Skip to content

Conversation

@liancheng
Copy link
Contributor

What changes were proposed in this pull request?

In SpecificParquetRecordReaderBase, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).

On the other hand, we already set the real Spark requested schema into Hadoop configuration in ParquetFileFormat. This PR simply reads out this schema to replace the converted one.

How was this patch tested?

New test case added in ParquetQuerySuite.

@liancheng
Copy link
Contributor Author

@vanzin Could you please help verify this fix? The reason why #14272 works is that the Parquet requested schema is generated using clipParquetSchema().

@liancheng
Copy link
Contributor Author

Also cc @yhuai.

@viirya
Copy link
Member

viirya commented Jul 20, 2016

@liancheng I don't think we should use the Spark requested schema for vectorized Parquet reader. It only works for flat schema. We need the converted schema for complex type support, as I do in #14045. That is because we need the repetition and definition level information for each complex type columns. If we directly use Spark requested schema, we can't get the corresponding info.

@liancheng
Copy link
Contributor Author

@viirya The updated schema field in this PR is only used to guide the vectorized reader to interpret basic Parquet types into logical types (e.g. Parquet int32 to Spark ByteType, and Parquet int96 to Spark TimestampType since Hive doesn't use proper Parquet types). We are still using the properly Parquet schema tailored from physical file schema as Parquet requested schema after this change.

@liancheng
Copy link
Contributor Author

@viirya Basically we are mapping the logic in ParquetRowConverter for the non-vectorized Parquet reader. It's just implemented at a lower lever in the case of vectorized reader.

@viirya
Copy link
Member

viirya commented Jul 20, 2016

@liancheng Yea, you are right! After double-checking, the sparkSchema is override later. Thanks for clarifying it!

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62583 has finished for PR 14278 at commit 2ade381.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62585 has finished for PR 14278 at commit 3cff4af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jul 20, 2016

Change LGTM, thanks. Tried our tests and they work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have the requestedSchema in parquet's form, which does not contain the correct annotation. It may still be a potential issue when correct annotations in requestedSchema matters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's safer when the Parquet requested schema conforms to the actual physical file to be read. Normally, we shouldn't care about logical types (those with annotations) at the level of Parquet record reader. It's the upper level engine's responsibility to convert basic types like int32 into logical types like INT_8 and INT_16. The vectorized reader has to mix them up because we need to construct value vectors of proper types at this level.

@liancheng liancheng force-pushed the spark-16632-simpler-fix branch from 3cff4af to d532e5e Compare July 21, 2016 09:03
@liancheng
Copy link
Contributor Author

Thanks for the review! I'm merging this to master and branch-2.0. Will send PRs to revert #14272 since this one is a more general fix of the same issue.

@asfgit asfgit closed this in 8674054 Jul 21, 2016
asfgit pushed a commit that referenced this pull request Jul 21, 2016
…quet reader initialization

In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).

On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one.

New test case added in `ParquetQuerySuite`.

[1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294

Author: Cheng Lian <[email protected]>

Closes #14278 from liancheng/spark-16632-simpler-fix.

(cherry picked from commit 8674054)
Signed-off-by: Cheng Lian <[email protected]>
@liancheng liancheng deleted the spark-16632-simpler-fix branch July 21, 2016 09:21
@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62672 has finished for PR 14278 at commit d532e5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jul 21, 2016
… parquet schema

## What changes were proposed in this pull request?

PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them.

This PR targets both master and branch-2.0.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <[email protected]>

Closes #14300 from liancheng/revert-pr-14272.

(cherry picked from commit 69626ad)
Signed-off-by: Cheng Lian <[email protected]>
asfgit pushed a commit that referenced this pull request Jul 21, 2016
… parquet schema

## What changes were proposed in this pull request?

PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them.

This PR targets both master and branch-2.0.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <[email protected]>

Closes #14300 from liancheng/revert-pr-14272.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants