-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32646][SQL] ORC predicate pushdown should work with case-insensitive analysis #29457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #127527 has finished for PR 29457 at commit
|
|
Test build #127545 has finished for PR 29457 at commit
|
|
retest this please |
|
Thank you, @viirya . |
|
Test build #127588 has finished for PR 29457 at commit
|
| // ORC predicate pushdown | ||
| if (orcFilterPushDown) { | ||
| OrcUtils.readCatalystSchema(filePath, conf, ignoreCorruptFiles).map { fileSchema => | ||
| OrcFilters.createFilter(fileSchema, filters).foreach { f => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the input schema is only used to build the dataTypeMap, can we directly pass the physical ORC schema here? Then we don't need to convert the ORC schema to catalyst schema which is consistent with the parquet side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our ORC pushdown code, e.g., buildLeafSearchArgument, uses catalyst schema in many places. I think it is still possible to use ORC's schema there and remove catalyst schema, but it looks like a big change. If we want to do it, I'd suggest to do it in another PR. This diff is already not small now.
| Some(((parentFieldNames :+ f.name).quoted, f.dataType)) | ||
| val fieldName = (parentFieldNames :+ f.name).quoted | ||
| val orcField = OrcPrimitiveField(fieldName, f.dataType) | ||
| Some((fieldName, orcField)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the OrcPrimitiveField if fieldName is always orcField.fieldName here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we look up attribute name of pushed down predicates in the map dataTypeMap(name), we need to easily retrieve the file schema's field name. If we simply use Map[String, DataType], the matched attribute name under case-insensitive analysis could be possibly in different letter case to the file schema's field name.
E,g, If ORC file has 'ABCfield, pushed down predicate is "abc > 0",dataTypeMap.contain("abc")is true and we can getdataTypeMap("abc")` back. But we need file schema's field name "ABC".
It is similar to Parquet:
Lines 56 to 76 in d6a68e0
| private val nameToParquetField : Map[String, ParquetPrimitiveField] = { | |
| // Recursively traverse the parquet schema to get primitive fields that can be pushed-down. | |
| // `parentFieldNames` is used to keep track of the current nested level when traversing. | |
| def getPrimitiveFields( | |
| fields: Seq[Type], | |
| parentFieldNames: Array[String] = Array.empty): Seq[ParquetPrimitiveField] = { | |
| fields.flatMap { | |
| case p: PrimitiveType => | |
| Some(ParquetPrimitiveField(fieldNames = parentFieldNames :+ p.getName, | |
| fieldType = ParquetSchemaType(p.getOriginalType, | |
| p.getPrimitiveTypeName, p.getTypeLength, p.getDecimalMetadata))) | |
| // Note that when g is a `Struct`, `g.getOriginalType` is `null`. | |
| // When g is a `Map`, `g.getOriginalType` is `MAP`. | |
| // When g is a `List`, `g.getOriginalType` is `LIST`. | |
| case g: GroupType if g.getOriginalType == null => | |
| getPrimitiveFields(g.getFields.asScala.toSeq, parentFieldNames :+ g.getName) | |
| // Parquet only supports push-down for primitive types; as a result, Map and List types | |
| // are removed. | |
| case _ => None | |
| } | |
| } |
|
thanks, merging to master! |
|
late LGTM. |
|
Because master and branch-3.0 both have few tests failed under hive-1.2 profile. And this diff missed a change in hive-1.2 code that causes compilation error. So it will make debugging the failed tests harder. I'd like revert this first at #29519. cc @cloud-fan @HyukjinKwon |
…se-insensitive analysis" ### What changes were proposed in this pull request? This reverts commit e277ef1. ### Why are the changes needed? Because master and branch-3.0 both have few tests failed under hive-1.2 profile. And the PR #29457 missed a change in hive-1.2 code that causes compilation error. So it will make debugging the failed tests harder. I'd like revert #29457 first to unblock it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29519 from viirya/revert-SPARK-32646. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>
…wn should work with case-insensitive analysis ### What changes were proposed in this pull request? This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis. This is re-submitted for #29457. Because #29457 has a hive-1.2 error and there were some tests failed with hive-1.2 profile at the same time, #29457 was reverted to unblock others. ### Why are the changes needed? Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis. But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too. ### Does this PR introduce _any_ user-facing change? Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work. ### How was this patch tested? Unit tests. Closes #29530 from viirya/fix-orc-pushdown. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.
Why are the changes needed?
Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.
But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.
Does this PR introduce any user-facing change?
Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.
How was this patch tested?
Unit tests.