Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Mar 21, 2019

What changes were proposed in this pull request?

This pr related to #24119. Reduce the code duplicate when upgrading built-in Hive.
To achieve this, we should avoid using classes in org.apache.orc.storage.* because these classes will be replaced with org.apache.hadoop.hive.* after upgrading the built-in Hive. Such as:
image

  • Move the usage of org.apache.orc.storage.* to OrcShimUtils:
  1. Add wrapper for VectorizedRowBatch(Reduce code duplication of OrcColumnarBatchReader).
  2. Move some serializer/deserializer method out of OrcDeserializer and OrcSerializer(Reduce code duplication of OrcDeserializer and OrcSerializer).
  3. Defined two type aliases: Operator and SearchArgument(Reduce code duplication of OrcV1FilterSuite).
  • Move duplication code to super class:
  1. Add a trait for OrcFilters(Reduce code duplication of OrcFilters).
  2. Move checkNoFilterPredicate from OrcFilterSuite to OrcTest(Reduce code duplication of OrcFilterSuite).

After this pr. We only need to copy these 4 files: OrcColumnVector, OrcFilters, OrcFilterSuite and OrcShimUtils.

How was this patch tested?

existing tests

@wangyum
Copy link
Member Author

wangyum commented Mar 21, 2019

cc @srowen @gatorsmile Please review this PR first. Thanks.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like I know this part well enough to really review it, but I trust your general judgment about how to handle Orc and the Hive upgrade.

(implicit df: DataFrame): Unit = {
val output = predicate.collect { case a: Attribute => a }.distinct
val query = df
.select(output.map(e => Column(e)): _*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal, but .map(Column)?

@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103764 has finished for PR 24166 at commit 3e51b92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait OrcFiltersBase
  • class VectorizedRowBatchWrap(val batch: VectorizedRowBatch)

@HyukjinKwon
Copy link
Member

cc @dongjoon-hyun FYI

@HyukjinKwon
Copy link
Member

@wangyum, BTW, please link related PRs, and/or short explanation why it needs in the description as well. It's kind of difficult to follow why those changes are needed

class VectorizedRowBatchWrap(val batch: VectorizedRowBatch) {}

private[sql] type Operator = OrcOperator
private[sql] type SearchArgument = OrcSearchArgument
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add these two type aliases to avoid copying OrcV1FilterSuite for Hive 2.3.4.

private[sql] type Operator = OrcOperator
private[sql] type SearchArgument = OrcSearchArgument

def getSqlDate(value: Any): Date = value.asInstanceOf[DateWritable].get
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The below functions to avoid copying OrcDeserializer and OrcSerializer for Hive 2.3.4.

checkFilterPredicate(df, predicate, checkLogicalOperator)
}

protected def checkNoFilterPredicate
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to super class

@gatorsmile
Copy link
Member

@wangyum Thank you for your work! This looks much better. cc @dongjoon-hyun @liancheng

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #4666 has finished for PR 24166 at commit 3e51b92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait OrcFiltersBase
  • class VectorizedRowBatchWrap(val batch: VectorizedRowBatch)

@srowen
Copy link
Member

srowen commented Mar 26, 2019

Merged to master. If there any follow ups it can be in the next PR.

@srowen srowen closed this in 300ec1a Mar 26, 2019
/**
* Various utilities for ORC used to upgrade the built-in Hive.
*/
private[sql] object OrcShimUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal but just leave a note for the future, private[sql] should be removed if that's under execution package per SPARK-16964

@HyukjinKwon
Copy link
Member

Looks fine to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants