[SPARK-16060][SQL][follow-up] add a wrapper solution for vectorized orc reader #20205

cloud-fan · 2018-01-09T16:08:33Z

What changes were proposed in this pull request?

This is mostly from #13775

The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution.

I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution.

How was this patch tested?

existing tests.

cloud-fan · 2018-01-09T16:09:09Z

cc @dongjoon-hyun @viirya @kiszk @gatorsmile

dongjoon-hyun · 2018-01-09T16:46:51Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

-    for (WritableColumnVector vector : columnVectors) {
-      vector.reset();
-    }
-    columnarBatch.setNumRows(0);


@cloud-fan . Can we keep this like Parquet? At the final empty batch, we need clear up this.

it's moved to https://github.com/apache/spark/pull/20205/files#diff-e594f7295e5408c01ace8175166313b6R253

Yep. I meant keeping here since we return at line 390 and 240. Parquet also does.

dongjoon-hyun

Thank you so much!
LGTM except one minor comment.

dongjoon-hyun · 2018-01-09T16:58:37Z

BTW, if you don't mind, could you update the followings? It was @viirya 's comment, so I made a followup patch, but we had better have this in your PR. To make another follow-up PR is overkill. :)

   /**
-   * The default size of batch. We use this value for both ORC and Spark consistently
-   * because they have different default values like the following.
+   * The default size of batch. We use this value for ORC reader to make it consistent
+   * with Spark's columnar batch because they have different default values like the
+   * following.

SparkQA · 2018-01-09T18:16:09Z

Test build #85857 has finished for PR 20205 at commit bdf9dbf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class OrcColumnVector extends org.apache.spark.sql.vectorized.ColumnVector

SparkQA · 2018-01-10T06:44:26Z

Test build #85901 has finished for PR 20205 at commit b78c6ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-10T07:17:11Z

thanks, merging to master/2.3!

…rc reader ## What changes were proposed in this pull request? This is mostly from #13775 The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution. I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes #20205 from cloud-fan/orc. (cherry picked from commit eaac60a) Signed-off-by: Wenchen Fan <[email protected]>

viirya · 2018-01-10T08:24:52Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

+        int colId = requestedColIds[i];
+        // Initialize the missing columns once.
+        if (colId == -1) {
+          OnHeapColumnVector missingCol = new OnHeapColumnVector(capacity, dt);


nit: Shouldn't we respect MEMORY_MODE parameter here?

viirya · 2018-01-10T08:25:10Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

+        int partitionIdx = requiredFields.length;
+        for (int i = 0; i < partitionValues.numFields(); i++) {
+          DataType dt = partitionSchema.fields()[i].dataType();
+          OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);


viirya · 2018-01-10T08:26:25Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java

+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * A column vector class wrapping Hive's ColumnVector. Because Spark ColumnarBatch only accepts


I think it is not Hive's ColumnVector, but ORC's ColumnVector.

viirya · 2018-01-10T08:29:05Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

-        Hive built-in ORC                             2940 / 2952          3.6         280.4       0.8X
+        Native ORC MR                                 2234 / 2255          4.7         213.1       1.0X
+        Native ORC Vectorized                          854 /  869         12.3          81.4       2.6X
+        Native ORC Vectorized with copy               1099 / 1128          9.5         104.8       2.0X


Looks like wrapper approach is usually faster than copy approach.

For long term, maybe we can consider remove copy approach to simplify the codes.

viirya · 2018-01-10T08:30:41Z

I'm busy for relocating so sorry not review promptly. LGTM with few minor comments.

add a wrapper solution for vectorized orc reader

bdf9dbf

dongjoon-hyun reviewed Jan 9, 2018

View reviewed changes

fix

b78c6ec

asfgit closed this in eaac60a Jan 10, 2018

viirya reviewed Jan 10, 2018

View reviewed changes

[SPARK-16060][SQL][follow-up] add a wrapper solution for vectorized orc reader #20205

[SPARK-16060][SQL][follow-up] add a wrapper solution for vectorized orc reader #20205

Uh oh!

Conversation

cloud-fan commented Jan 9, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jan 9, 2018

Uh oh!

dongjoon-hyun Jan 9, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 9, 2018

Uh oh!

SparkQA commented Jan 10, 2018

Uh oh!

cloud-fan commented Jan 10, 2018

Uh oh!

viirya Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun Jan 10, 2018 •

edited

Loading

dongjoon-hyun commented Jan 9, 2018 •

edited

Loading

viirya commented Jan 10, 2018 •

edited

Loading