Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 17, 2018

What changes were proposed in this pull request?

This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark.

  • ORC-91 Support for variable length blocks in HDFS (The current space wasted in ORC to padding is known to be 5%.)
  • ORC-344 Support for using Decimal64ColumnVector

In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 (HIVE-19669) and 1.5.2 (HIVE-19792) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library.

How was this patch tested?

Pass the Jenkins with all existing tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems the first _: AtomicType can be saved because this covers all other cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @viirya . ORC 1.5 checks the field name syntax more strictly. For example, a field name having dot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks for explaining it.

@SparkQA
Copy link

SparkQA commented Jun 18, 2018

Test build #92006 has finished for PR 21582 at commit 60e461e.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to recursively quote udt.sqlType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @maropu . Yes, that is not handled here because the goal is to support user's column names like col1.x usually at the top-level column names.

@SparkQA
Copy link

SparkQA commented Jun 18, 2018

Test build #92007 has finished for PR 21582 at commit 654aa45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.1 [SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2 Jul 9, 2018
@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92730 has finished for PR 21582 at commit d15db23.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

@dbtsai . This seems to be another difference due to recent build system changes.

  • build/mvn -Phive clean package -DskipTests (Build Success)
  • build/sbt -Phive clean package (Build Failure)

I'll take a look at this.

@viirya
Copy link
Member

viirya commented Jul 9, 2018

@dongjoon-hyun What the error is you see, I can run the build with sbt without problem.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 9, 2018

It occurs when we uses classifier. This PR uses nohive classifier for orc-core. If you try the above commands on this PR, the build fails for sbt only. It's the same error occured on Jenkins.

<exclusion>
<groupId>org.apache.hive</groupId>
<artifactId>hive-storage-api</artifactId>
</exclusion>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the above eight lines to be consistent for both mvn and sbt.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92743 has finished for PR 21582 at commit 8fe3f11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

ORC-344 Support for using Decimal64ColumnVector

This can achieve better perf number? Could you share the perf gain?

@dongjoon-hyun
Copy link
Member Author

@gatorsmile . For benchmark, ORC community is officially working on ORC-386. After it's finalized, I'll try to share you based on that.

@dongjoon-hyun
Copy link
Member Author

@gatorsmile . ORC-386 is not merged yet, but I've got the following from Hive's DecimalBench. For Spark side, we can proceed after this BUILD PR.

Benchmark               (version)  Mode  Cnt         Score       Error  Units
DecimalBench.read        ORIGINAL  avgt    5  10375991.566 ± 99751.161  us/op
DecimalBench.read   USE_DECIMAL64  avgt    5   7368917.561 ± 39257.332  us/op
DecimalBench.write       ORIGINAL  avgt    5    211008.673 ± 30220.403  us/op
DecimalBench.write  USE_DECIMAL64  avgt    5     33881.693 ±   216.601  us/op

@gatorsmile
Copy link
Member

@dongjoon-hyun Could you give how the benchmark works? What is the workload pattern? How does the benchmark invoke Spark?

@dongjoon-hyun
Copy link
Member Author

@gatorsmile . We don't use the new feature here (in Spark) yet. This is a BUILD PR. The result comes from official ORC DecimalBench code in ORC-386 (which I mentioned.) We can get the benefit later.

@gatorsmile
Copy link
Member

@dongjoon-hyun Could you submit a PR to use the latest Decimal64ColumnVector like what Hive does https://issues.apache.org/jira/browse/HIVE-19629 ?

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93191 has finished for PR 21582 at commit 8fe3f11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93193 has finished for PR 21582 at commit 8fe3f11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 3b59d32 Jul 18, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you so much, @gatorsmile . I will proceed.
Also, thank you, @viirya and @maropu .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-24576 branch July 18, 2018 21:15
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark.

- [ORC-91](https://issues.apache.org/jira/browse/ORC-91) Support for variable length blocks in HDFS (The current space wasted in ORC to padding is known to be 5%.)
- [ORC-344](https://issues.apache.org/jira/browse/ORC-344) Support for using Decimal64ColumnVector

In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 ([HIVE-19669](https://issues.apache.org/jira/browse/HIVE-19465)) and 1.5.2 ([HIVE-19792](https://issues.apache.org/jira/browse/HIVE-19792)) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library.

Pass the Jenkins with all existing tests.

Author: Dongjoon Hyun <[email protected]>

Closes apache#21582 from dongjoon-hyun/SPARK-24576.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants