[SPARK-16632][sql] Respect Hive schema when merging parquet schema. #14272

vanzin · 2016-07-19T23:36:23Z

When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.

So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.

Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.

When Hive (or at least certain versions of Hive) creates parquet files containing tinyint or smallint columns, it stores them as int32, but doesn't annotate the parquet field as containing the corresponding int8 / int16 data. When Spark reads those files using the vectorized reader, it follows the parquet schema for these fields, but when actually reading the data it tries to use the type fetched from the metastore, and then fails because data has been loaded into the wrong fields in OnHeapColumnVector. So instead of blindly trusting the parquet schema, check whether the Catalyst-provided schema disagrees with it, and adjust the types so that the necessary metadata is present when loading the data into the ColumnVector instance. Tested with unit tests and with tests that create byte / short columns in Hive and try to read them from Spark.

vanzin · 2016-07-19T23:38:21Z

@yhuai @liancheng

Not sure if this is the best place for the fix, but the problem is gone with the change. It duplicates some minor logic from ParquetSchemaConverter, but it seems weird to call that class from here since this code has no access to config data.

SparkQA · 2016-07-20T01:18:21Z

Test build #62561 has finished for PR 14272 at commit d853ba0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-07-20T04:52:38Z

This LGTM. Although it's a little bit hacky since technically the fields in requested schema passed to the Parquet record reader may have different original types (INT_8 and INT_16) from the actual ones (empty) defined in the physical file, fortunately Parquet record reader doesn't check for original types.

liancheng · 2016-07-20T04:54:09Z

I'm merging this to master.

@yhuai Do we want this in branch-2.0?

liancheng · 2016-07-20T04:57:00Z

Would like to add that AFAIK byte and short are the only problematic types that we don't handle before this PR. Other Hive-Parquet schema conversion quirks like string (translated into binary without UTF8 annotation) and timestamp (translated into deprecated int96) are already worked around in Spark.

liancheng · 2016-07-20T05:48:30Z

Discussed with @yhuai, I'm also merging this to branch-2.0.

@vanzin Thanks for fixing this!

yhuai · 2016-07-20T05:54:22Z

yea. I think the fix is pretty safe. After discussion with @liancheng, seems the more general fix is to just to use the requested catalyst schema to initialize the vectorized reader.

When Hive (or at least certain versions of Hive) creates parquet files containing tinyint or smallint columns, it stores them as int32, but doesn't annotate the parquet field as containing the corresponding int8 / int16 data. When Spark reads those files using the vectorized reader, it follows the parquet schema for these fields, but when actually reading the data it tries to use the type fetched from the metastore, and then fails because data has been loaded into the wrong fields in OnHeapColumnVector. So instead of blindly trusting the parquet schema, check whether the Catalyst-provided schema disagrees with it, and adjust the types so that the necessary metadata is present when loading the data into the ColumnVector instance. Tested with unit tests and with tests that create byte / short columns in Hive and try to read them from Spark. Author: Marcelo Vanzin <[email protected]> Closes #14272 from vanzin/SPARK-16632. (cherry picked from commit 75146be) Signed-off-by: Cheng Lian <[email protected]>

liancheng · 2016-07-20T07:03:12Z

Opened #14278 for the simpler yet more general fix.

… parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #14300 from liancheng/revert-pr-14272. (cherry picked from commit 69626ad) Signed-off-by: Cheng Lian <[email protected]>

… parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #14300 from liancheng/revert-pr-14272.

asfgit closed this in 75146be Jul 20, 2016

liancheng mentioned this pull request Jul 20, 2016

[SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization #14278

Closed

liancheng mentioned this pull request Jul 21, 2016

[SPARK-16632][SQL] Revert PR #14272: Respect Hive schema when merging parquet schema #14300

Closed

HyukjinKwon mentioned this pull request Jul 27, 2016

[SPARK-16472][SQL] Force user specified schema to the nullable one #14124

Closed

vanzin deleted the SPARK-16632 branch August 2, 2016 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16632][sql] Respect Hive schema when merging parquet schema. #14272

[SPARK-16632][sql] Respect Hive schema when merging parquet schema. #14272

Uh oh!

vanzin commented Jul 19, 2016

Uh oh!

vanzin commented Jul 19, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016 •

edited

Loading

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

yhuai commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-16632][sql] Respect Hive schema when merging parquet schema. #14272

[SPARK-16632][sql] Respect Hive schema when merging parquet schema. #14272

Uh oh!

Conversation

vanzin commented Jul 19, 2016

Uh oh!

vanzin commented Jul 19, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

yhuai commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liancheng commented Jul 20, 2016 •

edited

Loading