[SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization #14278

liancheng · 2016-07-20T06:52:15Z

What changes were proposed in this pull request?

In SpecificParquetRecordReaderBase, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).

On the other hand, we already set the real Spark requested schema into Hadoop configuration in ParquetFileFormat. This PR simply reads out this schema to replace the converted one.

How was this patch tested?

New test case added in ParquetQuerySuite.

liancheng · 2016-07-20T07:02:33Z

@vanzin Could you please help verify this fix? The reason why #14272 works is that the Parquet requested schema is generated using clipParquetSchema().

liancheng · 2016-07-20T07:03:25Z

Also cc @yhuai.

viirya · 2016-07-20T07:20:55Z

@liancheng I don't think we should use the Spark requested schema for vectorized Parquet reader. It only works for flat schema. We need the converted schema for complex type support, as I do in #14045. That is because we need the repetition and definition level information for each complex type columns. If we directly use Spark requested schema, we can't get the corresponding info.

liancheng · 2016-07-20T07:56:53Z

@viirya The updated schema field in this PR is only used to guide the vectorized reader to interpret basic Parquet types into logical types (e.g. Parquet int32 to Spark ByteType, and Parquet int96 to Spark TimestampType since Hive doesn't use proper Parquet types). We are still using the properly Parquet schema tailored from physical file schema as Parquet requested schema after this change.

liancheng · 2016-07-20T07:58:38Z

@viirya Basically we are mapping the logic in ParquetRowConverter for the non-vectorized Parquet reader. It's just implemented at a lower lever in the case of vectorized reader.

viirya · 2016-07-20T08:06:26Z

@liancheng Yea, you are right! After double-checking, the sparkSchema is override later. Thanks for clarifying it!

SparkQA · 2016-07-20T08:39:55Z

Test build #62583 has finished for PR 14278 at commit 2ade381.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-20T08:49:23Z

Test build #62585 has finished for PR 14278 at commit 3cff4af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-20T17:35:24Z

Change LGTM, thanks. Tried our tests and they work.

yhuai · 2016-07-20T23:42:33Z

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

We still have the requestedSchema in parquet's form, which does not contain the correct annotation. It may still be a potential issue when correct annotations in requestedSchema matters?

Actually it's safer when the Parquet requested schema conforms to the actual physical file to be read. Normally, we shouldn't care about logical types (those with annotations) at the level of Parquet record reader. It's the upper level engine's responsibility to convert basic types like int32 into logical types like INT_8 and INT_16. The vectorized reader has to mix them up because we need to construct value vectors of proper types at this level.

liancheng · 2016-07-21T09:14:15Z

Thanks for the review! I'm merging this to master and branch-2.0. Will send PRs to revert #14272 since this one is a more general fix of the same issue.

…quet reader initialization In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294). On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one. New test case added in `ParquetQuerySuite`. [1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294 Author: Cheng Lian <[email protected]> Closes #14278 from liancheng/spark-16632-simpler-fix. (cherry picked from commit 8674054) Signed-off-by: Cheng Lian <[email protected]>

SparkQA · 2016-07-21T11:01:08Z

Test build #62672 has finished for PR 14278 at commit d532e5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #14300 from liancheng/revert-pr-14272. (cherry picked from commit 69626ad) Signed-off-by: Cheng Lian <[email protected]>

… parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #14300 from liancheng/revert-pr-14272.

liancheng mentioned this pull request Jul 20, 2016

[SPARK-16632][sql] Respect Hive schema when merging parquet schema. #14272

Closed

yhuai reviewed Jul 20, 2016
View reviewed changes

liancheng added 2 commits July 21, 2016 17:03

Fixes SPARK-16632

f1e487b

More comments

d532e5e

liancheng force-pushed the spark-16632-simpler-fix branch from 3cff4af to d532e5e Compare July 21, 2016 09:03

asfgit closed this in 8674054 Jul 21, 2016

liancheng deleted the spark-16632-simpler-fix branch July 21, 2016 09:21

liancheng mentioned this pull request Jul 21, 2016

[SPARK-16632][SQL] Revert PR #14272: Respect Hive schema when merging parquet schema #14300

Closed

HyukjinKwon mentioned this pull request Jul 27, 2016

[SPARK-16472][SQL] Force user specified schema to the nullable one #14124

Closed

[SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization #14278

[SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization #14278

Uh oh!

Conversation

liancheng commented Jul 20, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

viirya commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

viirya commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

vanzin commented Jul 20, 2016

Uh oh!

yhuai Jul 20, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng Jul 21, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 21, 2016

Uh oh!

SparkQA commented Jul 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants