-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization #14278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also cc @yhuai. |
|
@liancheng I don't think we should use the Spark requested schema for vectorized Parquet reader. It only works for flat schema. We need the converted schema for complex type support, as I do in #14045. That is because we need the repetition and definition level information for each complex type columns. If we directly use Spark requested schema, we can't get the corresponding info. |
|
@viirya The updated schema field in this PR is only used to guide the vectorized reader to interpret basic Parquet types into logical types (e.g. Parquet |
|
@viirya Basically we are mapping the logic in |
|
@liancheng Yea, you are right! After double-checking, the |
|
Test build #62583 has finished for PR 14278 at commit
|
|
Test build #62585 has finished for PR 14278 at commit
|
|
Change LGTM, thanks. Tried our tests and they work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have the requestedSchema in parquet's form, which does not contain the correct annotation. It may still be a potential issue when correct annotations in requestedSchema matters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's safer when the Parquet requested schema conforms to the actual physical file to be read. Normally, we shouldn't care about logical types (those with annotations) at the level of Parquet record reader. It's the upper level engine's responsibility to convert basic types like int32 into logical types like INT_8 and INT_16. The vectorized reader has to mix them up because we need to construct value vectors of proper types at this level.
3cff4af to
d532e5e
Compare
|
Thanks for the review! I'm merging this to master and branch-2.0. Will send PRs to revert #14272 since this one is a more general fix of the same issue. |
…quet reader initialization In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294). On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one. New test case added in `ParquetQuerySuite`. [1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294 Author: Cheng Lian <[email protected]> Closes #14278 from liancheng/spark-16632-simpler-fix. (cherry picked from commit 8674054) Signed-off-by: Cheng Lian <[email protected]>
|
Test build #62672 has finished for PR 14278 at commit
|
… parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #14300 from liancheng/revert-pr-14272. (cherry picked from commit 69626ad) Signed-off-by: Cheng Lian <[email protected]>
… parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #14300 from liancheng/revert-pr-14272.
What changes were proposed in this pull request?
In
SpecificParquetRecordReaderBase, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).On the other hand, we already set the real Spark requested schema into Hadoop configuration in
ParquetFileFormat. This PR simply reads out this schema to replace the converted one.How was this patch tested?
New test case added in
ParquetQuerySuite.