[SPARK-32646][SQL] ORC predicate pushdown should work with case-insensitive analysis #29457

viirya · 2020-08-18T03:59:44Z

What changes were proposed in this pull request?

This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.

Why are the changes needed?

Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.

But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.

Does this PR introduce any user-facing change?

Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.

How was this patch tested?

Unit tests.

SparkQA · 2020-08-18T05:55:04Z

Test build #127527 has finished for PR 29457 at commit 2f010de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-18T07:05:02Z

Test build #127545 has finished for PR 29457 at commit 090747d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-08-18T15:50:06Z

retest this please

viirya · 2020-08-18T18:36:22Z

cc @dongjoon-hyun @HyukjinKwon @dbtsai @cloud-fan @maropu

dongjoon-hyun · 2020-08-18T18:37:47Z

Thank you, @viirya .

SparkQA · 2020-08-18T20:40:29Z

Test build #127588 has finished for PR 29457 at commit 090747d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-20T09:31:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

+        // ORC predicate pushdown
+        if (orcFilterPushDown) {
+          OrcUtils.readCatalystSchema(filePath, conf, ignoreCorruptFiles).map { fileSchema =>
+            OrcFilters.createFilter(fileSchema, filters).foreach { f =>


AFAIK the input schema is only used to build the dataTypeMap, can we directly pass the physical ORC schema here? Then we don't need to convert the ORC schema to catalyst schema which is consistent with the parquet side.

Our ORC pushdown code, e.g., buildLeafSearchArgument, uses catalyst schema in many places. I think it is still possible to use ORC's schema there and remove catalyst schema, but it looks like a big change. If we want to do it, I'd suggest to do it in another PR. This diff is already not small now.

cloud-fan · 2020-08-20T09:32:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

-            Some(((parentFieldNames :+ f.name).quoted, f.dataType))
+            val fieldName = (parentFieldNames :+ f.name).quoted
+            val orcField = OrcPrimitiveField(fieldName, f.dataType)
+            Some((fieldName, orcField))


why do we need the OrcPrimitiveField if fieldName is always orcField.fieldName here?

Because we look up attribute name of pushed down predicates in the map dataTypeMap(name), we need to easily retrieve the file schema's field name. If we simply use Map[String, DataType], the matched attribute name under case-insensitive analysis could be possibly in different letter case to the file schema's field name.

E,g, If ORC file has 'ABCfield, pushed down predicate is "abc > 0",dataTypeMap.contain("abc")is true and we can getdataTypeMap("abc")` back. But we need file schema's field name "ABC".

It is similar to Parquet:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Lines 56 to 76 in d6a68e0

private val nameToParquetField : Map[String, ParquetPrimitiveField] = {

// Recursively traverse the parquet schema to get primitive fields that can be pushed-down.

// `parentFieldNames` is used to keep track of the current nested level when traversing.

def getPrimitiveFields(

fields: Seq[Type],

parentFieldNames: Array[String] = Array.empty): Seq[ParquetPrimitiveField] = {

fields.flatMap {

case p: PrimitiveType =>

Some(ParquetPrimitiveField(fieldNames = parentFieldNames :+ p.getName,

fieldType = ParquetSchemaType(p.getOriginalType,

p.getPrimitiveTypeName, p.getTypeLength, p.getDecimalMetadata)))

// Note that when g is a `Struct`, `g.getOriginalType` is `null`.

// When g is a `Map`, `g.getOriginalType` is `MAP`.

// When g is a `List`, `g.getOriginalType` is `LIST`.

case g: GroupType if g.getOriginalType == null =>

getPrimitiveFields(g.getFields.asScala.toSeq, parentFieldNames :+ g.getName)

// Parquet only supports push-down for primitive types; as a result, Map and List types

// are removed.

case _ => None

}

}

cloud-fan · 2020-08-21T07:57:23Z

thanks, merging to master!

HyukjinKwon · 2020-08-21T08:34:48Z

late LGTM.

viirya · 2020-08-23T05:40:23Z

Because master and branch-3.0 both have few tests failed under hive-1.2 profile. And this diff missed a change in hive-1.2 code that causes compilation error. So it will make debugging the failed tests harder. I'd like revert this first at #29519. cc @cloud-fan @HyukjinKwon

…se-insensitive analysis" ### What changes were proposed in this pull request? This reverts commit e277ef1. ### Why are the changes needed? Because master and branch-3.0 both have few tests failed under hive-1.2 profile. And the PR #29457 missed a change in hive-1.2 code that causes compilation error. So it will make debugging the failed tests harder. I'd like revert #29457 first to unblock it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29519 from viirya/revert-SPARK-32646. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>

…wn should work with case-insensitive analysis ### What changes were proposed in this pull request? This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis. This is re-submitted for #29457. Because #29457 has a hive-1.2 error and there were some tests failed with hive-1.2 profile at the same time, #29457 was reverted to unblock others. ### Why are the changes needed? Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis. But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too. ### Does this PR introduce _any_ user-facing change? Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work. ### How was this patch tested? Unit tests. Closes #29530 from viirya/fix-orc-pushdown. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

viirya added 5 commits August 15, 2020 16:26

Add basic.

65b1e7f

Add Orc filter case sensitivity test.

e3949f1

For DSv2.

a3d4bc8

Merge remote-tracking branch 'upstream/master' into fix-orc-pushdown

7104666

Make test consistent.

2f010de

probot-autolabeler bot added the SQL label Aug 18, 2020

viirya marked this pull request as draft August 18, 2020 05:58

Push down predicate for ColumnarReader.

090747d

viirya marked this pull request as ready for review August 18, 2020 18:35

cloud-fan reviewed Aug 20, 2020

View reviewed changes

cloud-fan closed this in e277ef1 Aug 21, 2020

viirya mentioned this pull request Aug 23, 2020

[SPARK-32646][SQL][FOLLOWUP][test-hadoop2.7][test-hive1.2] ORC predicate pushdown should work with case-insensitive analysis #29518

Closed

viirya mentioned this pull request Aug 23, 2020

Revert "[SPARK-32646][SQL] ORC predicate pushdown should work with case-insensitive analysis" #29519

Closed

This was referenced Aug 24, 2020

[SPARK-32646][SQL][3.0][test-hadoop2.7][test-hive1.2] ORC predicate pushdown should work with case-insensitive analysis #29513

Closed

[SPARK-32646][SQL][test-hadoop2.7][test-hive1.2] ORC predicate pushdown should work with case-insensitive analysis #29530

Closed

	private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
	// Recursively traverse the parquet schema to get primitive fields that can be pushed-down.
	// `parentFieldNames` is used to keep track of the current nested level when traversing.
	def getPrimitiveFields(
	fields: Seq[Type],
	parentFieldNames: Array[String] = Array.empty): Seq[ParquetPrimitiveField] = {
	fields.flatMap {
	case p: PrimitiveType =>
	Some(ParquetPrimitiveField(fieldNames = parentFieldNames :+ p.getName,
	fieldType = ParquetSchemaType(p.getOriginalType,
	p.getPrimitiveTypeName, p.getTypeLength, p.getDecimalMetadata)))
	// Note that when g is a `Struct`, `g.getOriginalType` is `null`.
	// When g is a `Map`, `g.getOriginalType` is `MAP`.
	// When g is a `List`, `g.getOriginalType` is `LIST`.
	case g: GroupType if g.getOriginalType == null =>
	getPrimitiveFields(g.getFields.asScala.toSeq, parentFieldNames :+ g.getName)
	// Parquet only supports push-down for primitive types; as a result, Map and List types
	// are removed.
	case _ => None
	}
	}

[SPARK-32646][SQL] ORC predicate pushdown should work with case-insensitive analysis #29457

[SPARK-32646][SQL] ORC predicate pushdown should work with case-insensitive analysis #29457

Uh oh!

Conversation

viirya commented Aug 18, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Aug 18, 2020

Uh oh!

SparkQA commented Aug 18, 2020

Uh oh!

viirya commented Aug 18, 2020

Uh oh!

viirya commented Aug 18, 2020

Uh oh!

dongjoon-hyun commented Aug 18, 2020

Uh oh!

SparkQA commented Aug 18, 2020

Uh oh!

cloud-fan Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 21, 2020

Uh oh!

HyukjinKwon commented Aug 21, 2020

Uh oh!

viirya commented Aug 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented Aug 23, 2020 •

edited

Loading