[SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2 #21582

dongjoon-hyun · 2018-06-17T23:58:47Z

What changes were proposed in this pull request?

This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark.

ORC-91 Support for variable length blocks in HDFS (The current space wasted in ORC to padding is known to be 5%.)
ORC-344 Support for using Decimal64ColumnVector

In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 (HIVE-19669) and 1.5.2 (HIVE-19792) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library.

How was this patch tested?

Pass the Jenkins with all existing tests.

viirya · 2018-06-18T00:04:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

nit: Seems the first _: AtomicType can be saved because this covers all other cases?

viirya · 2018-06-18T00:05:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala

Why this change?

Thank you for review, @viirya . ORC 1.5 checks the field name syntax more strictly. For example, a field name having dot.

@dongjoon-hyun Thanks for explaining it.

SparkQA · 2018-06-18T00:08:03Z

Test build #92006 has finished for PR 21582 at commit 60e461e.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-06-18T01:38:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

We don't need to recursively quote udt.sqlType?

Thank you for review, @maropu . Yes, that is not handled here because the goal is to support user's column names like col1.x usually at the top-level column names.

SparkQA · 2018-06-18T04:56:45Z

Test build #92007 has finished for PR 21582 at commit 654aa45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-09T04:24:19Z

Test build #92730 has finished for PR 21582 at commit d15db23.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-07-09T06:46:06Z

@dbtsai . This seems to be another difference due to recent build system changes.

build/mvn -Phive clean package -DskipTests (Build Success)
build/sbt -Phive clean package (Build Failure)

I'll take a look at this.

viirya · 2018-07-09T06:57:11Z

@dongjoon-hyun What the error is you see, I can run the build with sbt without problem.

dongjoon-hyun · 2018-07-09T07:30:44Z

It occurs when we uses classifier. This PR uses nohive classifier for orc-core. If you try the above commands on this PR, the build fails for sbt only. It's the same error occured on Jenkins.

dongjoon-hyun · 2018-07-09T07:48:17Z

sql/core/pom.xml

+        <exclusion>
+          <groupId>org.apache.hive</groupId>
+          <artifactId>hive-storage-api</artifactId>
+        </exclusion>


I added the above eight lines to be consistent for both mvn and sbt.

SparkQA · 2018-07-09T12:12:18Z

Test build #92743 has finished for PR 21582 at commit 8fe3f11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-09T22:27:29Z

ORC-344 Support for using Decimal64ColumnVector

This can achieve better perf number? Could you share the perf gain?

dongjoon-hyun · 2018-07-11T20:58:55Z

@gatorsmile . For benchmark, ORC community is officially working on ORC-386. After it's finalized, I'll try to share you based on that.

dongjoon-hyun · 2018-07-13T20:17:02Z

@gatorsmile . ORC-386 is not merged yet, but I've got the following from Hive's DecimalBench. For Spark side, we can proceed after this BUILD PR.

Benchmark               (version)  Mode  Cnt         Score       Error  Units
DecimalBench.read        ORIGINAL  avgt    5  10375991.566 ± 99751.161  us/op
DecimalBench.read   USE_DECIMAL64  avgt    5   7368917.561 ± 39257.332  us/op
DecimalBench.write       ORIGINAL  avgt    5    211008.673 ± 30220.403  us/op
DecimalBench.write  USE_DECIMAL64  avgt    5     33881.693 ±   216.601  us/op

gatorsmile · 2018-07-16T04:39:21Z

@dongjoon-hyun Could you give how the benchmark works? What is the workload pattern? How does the benchmark invoke Spark?

dongjoon-hyun · 2018-07-17T17:26:13Z

@gatorsmile . We don't use the new feature here (in Spark) yet. This is a BUILD PR. The result comes from official ORC DecimalBench code in ORC-386 (which I mentioned.) We can get the benefit later.

https://github.com/apache/orc/pull/290/files#diff-e1a76ee6f5fe64d831a3e4a2a6c28323R57

gatorsmile · 2018-07-17T18:20:17Z

@dongjoon-hyun Could you submit a PR to use the latest Decimal64ColumnVector like what Hive does https://issues.apache.org/jira/browse/HIVE-19629 ?

gatorsmile · 2018-07-17T18:23:05Z

retest this please

SparkQA · 2018-07-17T22:46:47Z

Test build #93191 has finished for PR 21582 at commit 8fe3f11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-17T23:06:22Z

Test build #93193 has finished for PR 21582 at commit 8fe3f11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-18T06:52:14Z

Thanks! Merged to master.

dongjoon-hyun · 2018-07-18T21:15:42Z

Thank you so much, @gatorsmile . I will proceed.
Also, thank you, @viirya and @maropu .

This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark. - [ORC-91](https://issues.apache.org/jira/browse/ORC-91) Support for variable length blocks in HDFS (The current space wasted in ORC to padding is known to be 5%.) - [ORC-344](https://issues.apache.org/jira/browse/ORC-344) Support for using Decimal64ColumnVector In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 ([HIVE-19669](https://issues.apache.org/jira/browse/HIVE-19465)) and 1.5.2 ([HIVE-19792](https://issues.apache.org/jira/browse/HIVE-19792)) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library. Pass the Jenkins with all existing tests. Author: Dongjoon Hyun <[email protected]> Closes apache#21582 from dongjoon-hyun/SPARK-24576.

viirya reviewed Jun 18, 2018

View reviewed changes

maropu reviewed Jun 18, 2018

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.1~~ [SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2 Jul 9, 2018

[SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2

8fe3f11

dongjoon-hyun commented Jul 9, 2018

View reviewed changes

asfgit closed this in 3b59d32 Jul 18, 2018

dongjoon-hyun deleted the SPARK-24576 branch July 18, 2018 21:15

[SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2 #21582

[SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2 #21582

Uh oh!

Conversation

dongjoon-hyun commented Jun 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya Jun 18, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jun 18, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 18, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jun 18, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 18, 2018

Uh oh!

maropu Jun 18, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 21, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 18, 2018

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

dongjoon-hyun commented Jul 9, 2018

Uh oh!

viirya commented Jul 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jul 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun Jul 9, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

gatorsmile commented Jul 9, 2018

Uh oh!

dongjoon-hyun commented Jul 11, 2018

Uh oh!

dongjoon-hyun commented Jul 13, 2018

Uh oh!

gatorsmile commented Jul 16, 2018

Uh oh!

dongjoon-hyun commented Jul 17, 2018

Uh oh!

gatorsmile commented Jul 17, 2018

Uh oh!

gatorsmile commented Jul 17, 2018

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

gatorsmile commented Jul 18, 2018

Uh oh!

dongjoon-hyun commented Jul 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun commented Jun 17, 2018 •

edited

Loading

viirya commented Jul 9, 2018 •

edited

Loading

dongjoon-hyun commented Jul 9, 2018 •

edited

Loading