-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2 #21582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Seems the first _: AtomicType can be saved because this covers all other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @viirya . ORC 1.5 checks the field name syntax more strictly. For example, a field name having dot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Thanks for explaining it.
|
Test build #92006 has finished for PR 21582 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to recursively quote udt.sqlType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @maropu . Yes, that is not handled here because the goal is to support user's column names like col1.x usually at the top-level column names.
|
Test build #92007 has finished for PR 21582 at commit
|
|
Test build #92730 has finished for PR 21582 at commit
|
|
@dbtsai . This seems to be another difference due to recent build system changes.
I'll take a look at this. |
|
@dongjoon-hyun What the error is you see, I can run the build with sbt without problem. |
|
It occurs when we uses |
| <exclusion> | ||
| <groupId>org.apache.hive</groupId> | ||
| <artifactId>hive-storage-api</artifactId> | ||
| </exclusion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the above eight lines to be consistent for both mvn and sbt.
|
Test build #92743 has finished for PR 21582 at commit
|
This can achieve better perf number? Could you share the perf gain? |
|
@gatorsmile . For benchmark, ORC community is officially working on ORC-386. After it's finalized, I'll try to share you based on that. |
|
@gatorsmile . ORC-386 is not merged yet, but I've got the following from Hive's |
|
@dongjoon-hyun Could you give how the benchmark works? What is the workload pattern? How does the benchmark invoke Spark? |
|
@gatorsmile . We don't use the new feature here (in Spark) yet. This is a BUILD PR. The result comes from official ORC DecimalBench code in ORC-386 (which I mentioned.) We can get the benefit later. |
|
@dongjoon-hyun Could you submit a PR to use the latest Decimal64ColumnVector like what Hive does https://issues.apache.org/jira/browse/HIVE-19629 ? |
|
retest this please |
|
Test build #93191 has finished for PR 21582 at commit
|
|
Test build #93193 has finished for PR 21582 at commit
|
|
Thanks! Merged to master. |
|
Thank you so much, @gatorsmile . I will proceed. |
This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark. - [ORC-91](https://issues.apache.org/jira/browse/ORC-91) Support for variable length blocks in HDFS (The current space wasted in ORC to padding is known to be 5%.) - [ORC-344](https://issues.apache.org/jira/browse/ORC-344) Support for using Decimal64ColumnVector In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 ([HIVE-19669](https://issues.apache.org/jira/browse/HIVE-19465)) and 1.5.2 ([HIVE-19792](https://issues.apache.org/jira/browse/HIVE-19792)) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library. Pass the Jenkins with all existing tests. Author: Dongjoon Hyun <[email protected]> Closes apache#21582 from dongjoon-hyun/SPARK-24576.
What changes were proposed in this pull request?
This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark.
In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 (HIVE-19669) and 1.5.2 (HIVE-19792) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library.
How was this patch tested?
Pass the Jenkins with all existing tests.