Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is mostly from #13775

The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution.

I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

for (WritableColumnVector vector : columnVectors) {
vector.reset();
}
columnarBatch.setNumRows(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan . Can we keep this like Parquet? At the final empty batch, we need clear up this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I meant keeping here since we return at line 390 and 240. Parquet also does.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much!
LGTM except one minor comment.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 9, 2018

BTW, if you don't mind, could you update the followings? It was @viirya 's comment, so I made a followup patch, but we had better have this in your PR. To make another follow-up PR is overkill. :)

   /**
-   * The default size of batch. We use this value for both ORC and Spark consistently
-   * because they have different default values like the following.
+   * The default size of batch. We use this value for ORC reader to make it consistent
+   * with Spark's columnar batch because they have different default values like the
+   * following.

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85857 has finished for PR 20205 at commit bdf9dbf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class OrcColumnVector extends org.apache.spark.sql.vectorized.ColumnVector

@SparkQA
Copy link

SparkQA commented Jan 10, 2018

Test build #85901 has finished for PR 20205 at commit b78c6ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Jan 10, 2018
…rc reader

## What changes were proposed in this pull request?

This is mostly from #13775

The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution.

I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes #20205 from cloud-fan/orc.

(cherry picked from commit eaac60a)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in eaac60a Jan 10, 2018
int colId = requestedColIds[i];
// Initialize the missing columns once.
if (colId == -1) {
OnHeapColumnVector missingCol = new OnHeapColumnVector(capacity, dt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Shouldn't we respect MEMORY_MODE parameter here?

int partitionIdx = requiredFields.length;
for (int i = 0; i < partitionValues.numFields(); i++) {
DataType dt = partitionSchema.fields()[i].dataType();
OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

import org.apache.spark.unsafe.types.UTF8String;

/**
* A column vector class wrapping Hive's ColumnVector. Because Spark ColumnarBatch only accepts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not Hive's ColumnVector, but ORC's ColumnVector.

Hive built-in ORC 2940 / 2952 3.6 280.4 0.8X
Native ORC MR 2234 / 2255 4.7 213.1 1.0X
Native ORC Vectorized 854 / 869 12.3 81.4 2.6X
Native ORC Vectorized with copy 1099 / 1128 9.5 104.8 2.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like wrapper approach is usually faster than copy approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For long term, maybe we can consider remove copy approach to simplify the codes.

@viirya
Copy link
Member

viirya commented Jan 10, 2018

I'm busy for relocating so sorry not review promptly. LGTM with few minor comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants