Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Aug 18, 2020

What changes were proposed in this pull request?

This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.

Why are the changes needed?

Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.

But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.

Does this PR introduce any user-facing change?

Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127527 has finished for PR 29457 at commit 2f010de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya marked this pull request as draft August 18, 2020 05:58
@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127545 has finished for PR 29457 at commit 090747d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 18, 2020

retest this please

@viirya viirya marked this pull request as ready for review August 18, 2020 18:35
@viirya
Copy link
Member Author

viirya commented Aug 18, 2020

@dongjoon-hyun
Copy link
Member

Thank you, @viirya .

@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127588 has finished for PR 29457 at commit 090747d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// ORC predicate pushdown
if (orcFilterPushDown) {
OrcUtils.readCatalystSchema(filePath, conf, ignoreCorruptFiles).map { fileSchema =>
OrcFilters.createFilter(fileSchema, filters).foreach { f =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the input schema is only used to build the dataTypeMap, can we directly pass the physical ORC schema here? Then we don't need to convert the ORC schema to catalyst schema which is consistent with the parquet side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our ORC pushdown code, e.g., buildLeafSearchArgument, uses catalyst schema in many places. I think it is still possible to use ORC's schema there and remove catalyst schema, but it looks like a big change. If we want to do it, I'd suggest to do it in another PR. This diff is already not small now.

Some(((parentFieldNames :+ f.name).quoted, f.dataType))
val fieldName = (parentFieldNames :+ f.name).quoted
val orcField = OrcPrimitiveField(fieldName, f.dataType)
Some((fieldName, orcField))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the OrcPrimitiveField if fieldName is always orcField.fieldName here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we look up attribute name of pushed down predicates in the map dataTypeMap(name), we need to easily retrieve the file schema's field name. If we simply use Map[String, DataType], the matched attribute name under case-insensitive analysis could be possibly in different letter case to the file schema's field name.

E,g, If ORC file has 'ABCfield, pushed down predicate is "abc > 0",dataTypeMap.contain("abc")is true and we can getdataTypeMap("abc")` back. But we need file schema's field name "ABC".

It is similar to Parquet:

private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
// Recursively traverse the parquet schema to get primitive fields that can be pushed-down.
// `parentFieldNames` is used to keep track of the current nested level when traversing.
def getPrimitiveFields(
fields: Seq[Type],
parentFieldNames: Array[String] = Array.empty): Seq[ParquetPrimitiveField] = {
fields.flatMap {
case p: PrimitiveType =>
Some(ParquetPrimitiveField(fieldNames = parentFieldNames :+ p.getName,
fieldType = ParquetSchemaType(p.getOriginalType,
p.getPrimitiveTypeName, p.getTypeLength, p.getDecimalMetadata)))
// Note that when g is a `Struct`, `g.getOriginalType` is `null`.
// When g is a `Map`, `g.getOriginalType` is `MAP`.
// When g is a `List`, `g.getOriginalType` is `LIST`.
case g: GroupType if g.getOriginalType == null =>
getPrimitiveFields(g.getFields.asScala.toSeq, parentFieldNames :+ g.getName)
// Parquet only supports push-down for primitive types; as a result, Map and List types
// are removed.
case _ => None
}
}

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e277ef1 Aug 21, 2020
@HyukjinKwon
Copy link
Member

late LGTM.

@viirya
Copy link
Member Author

viirya commented Aug 23, 2020

Because master and branch-3.0 both have few tests failed under hive-1.2 profile. And this diff missed a change in hive-1.2 code that causes compilation error. So it will make debugging the failed tests harder. I'd like revert this first at #29519. cc @cloud-fan @HyukjinKwon

viirya added a commit that referenced this pull request Aug 23, 2020
…se-insensitive analysis"

### What changes were proposed in this pull request?

This reverts commit e277ef1.

### Why are the changes needed?

Because master and branch-3.0 both have few tests failed under hive-1.2 profile. And the PR #29457 missed a change in hive-1.2 code that causes compilation error. So it will make debugging the failed tests harder. I'd like revert #29457 first to unblock it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #29519 from viirya/revert-SPARK-32646.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Aug 25, 2020
…wn should work with case-insensitive analysis

### What changes were proposed in this pull request?

This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.

This is re-submitted for #29457.  Because #29457 has a hive-1.2 error and there were some tests failed with hive-1.2 profile at the same time, #29457 was reverted to unblock others.

### Why are the changes needed?

Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.

But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.

### Does this PR introduce _any_ user-facing change?

Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.

### How was this patch tested?

Unit tests.

Closes #29530 from viirya/fix-orc-pushdown.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants