-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27226][SQL] Reduce the code duplicate when upgrading built-in Hive #24166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @srowen @gatorsmile Please review this PR first. Thanks. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel like I know this part well enough to really review it, but I trust your general judgment about how to handle Orc and the Hive upgrade.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
Show resolved
Hide resolved
| (implicit df: DataFrame): Unit = { | ||
| val output = predicate.collect { case a: Attribute => a }.distinct | ||
| val query = df | ||
| .select(output.map(e => Column(e)): _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big deal, but .map(Column)?
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcShimUtils.scala
Show resolved
Hide resolved
|
Test build #103764 has finished for PR 24166 at commit
|
|
cc @dongjoon-hyun FYI |
|
@wangyum, BTW, please link related PRs, and/or short explanation why it needs in the description as well. It's kind of difficult to follow why those changes are needed |
| class VectorizedRowBatchWrap(val batch: VectorizedRowBatch) {} | ||
|
|
||
| private[sql] type Operator = OrcOperator | ||
| private[sql] type SearchArgument = OrcSearchArgument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add these two type aliases to avoid copying OrcV1FilterSuite for Hive 2.3.4.
| private[sql] type Operator = OrcOperator | ||
| private[sql] type SearchArgument = OrcSearchArgument | ||
|
|
||
| def getSqlDate(value: Any): Date = value.asInstanceOf[DateWritable].get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The below functions to avoid copying OrcDeserializer and OrcSerializer for Hive 2.3.4.
| checkFilterPredicate(df, predicate, checkLogicalOperator) | ||
| } | ||
|
|
||
| protected def checkNoFilterPredicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to super class
|
@wangyum Thank you for your work! This looks much better. cc @dongjoon-hyun @liancheng |
|
Test build #4666 has finished for PR 24166 at commit
|
|
Merged to master. If there any follow ups it can be in the next PR. |
| /** | ||
| * Various utilities for ORC used to upgrade the built-in Hive. | ||
| */ | ||
| private[sql] object OrcShimUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal but just leave a note for the future, private[sql] should be removed if that's under execution package per SPARK-16964
|
Looks fine to me. |
What changes were proposed in this pull request?
This pr related to #24119. Reduce the code duplicate when upgrading built-in Hive.

To achieve this, we should avoid using classes in
org.apache.orc.storage.*because these classes will be replaced withorg.apache.hadoop.hive.*after upgrading the built-in Hive. Such as:org.apache.orc.storage.*toOrcShimUtils:VectorizedRowBatch(Reduce code duplication of OrcColumnarBatchReader).OrcDeserializerandOrcSerializer(Reduce code duplication of OrcDeserializer and OrcSerializer).OperatorandSearchArgument(Reduce code duplication of OrcV1FilterSuite).OrcFilters(Reduce code duplication of OrcFilters).checkNoFilterPredicatefromOrcFilterSuitetoOrcTest(Reduce code duplication of OrcFilterSuite).After this pr. We only need to copy these 4 files: OrcColumnVector, OrcFilters, OrcFilterSuite and OrcShimUtils.
How was this patch tested?
existing tests