[SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet/ORC #22142

seancxmao · 2018-08-19T16:30:55Z

What changes were proposed in this pull request?

Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, regardless of spark.sql.caseSensitive set to true or false. This applies not only to Parquet, but also to ORC. Following is a brief summary:

ParquetFileFormat doesn't support case-insensitive field resolution.
native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields.
hive OrcFileFormat doesn't support case-insensitive field resolution.

#15799 reverted case-insensitive resolution for ParquetFileFormat and hive OrcFileFormat. This PR brings it back and improves it to do case-insensitive resolution only if Spark is in case-insensitive mode. And field resolution will fail if there is ambiguity, i.e. more than one field is matched. ParquetFileFormat, native OrcFileFormat and hive OrcFileFormat are all supported.

How was this patch tested?

Unit tests added.

…uet/ORC * Fix ParquetFileFormat * More than one Parquet column is matched * Fix OrcFileFormat (both native and hive implementations) * Fix issues according to review results: refactor test cases, code style, ... * Test cases: change paruqet/orc file schema from a to A * Test cases: let different columns have different value series * Refine error message * Split multi-format test suite * Simplify test cases for ambiguous resolution * Simplify test cases to reduce code lines * Refine tests and comments

AmplabJenkins · 2018-08-19T16:32:48Z

Can one of the admins verify this patch?

seancxmao · 2018-08-20T03:22:35Z

Split this into 2 PRs, one for Parquet and ORC respectively.

seancxmao changed the title ~~[SPARK-25132][SQL] case-insensitive field resolution when reading from Parquet/ORC~~ [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet/ORC Aug 20, 2018

seancxmao closed this Aug 20, 2018

seancxmao deleted the SPARK-25132 branch August 22, 2018 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet/ORC #22142

[SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet/ORC #22142

Uh oh!

seancxmao commented Aug 19, 2018

Uh oh!

AmplabJenkins commented Aug 19, 2018

Uh oh!

seancxmao commented Aug 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet/ORC #22142

[SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet/ORC #22142

Uh oh!

Conversation

seancxmao commented Aug 19, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Aug 19, 2018

Uh oh!

seancxmao commented Aug 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants