Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Jul 19, 2016

When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.

So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.

Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.

When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.

So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.

Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.
@vanzin
Copy link
Contributor Author

vanzin commented Jul 19, 2016

@yhuai @liancheng

Not sure if this is the best place for the fix, but the problem is gone with the change. It duplicates some minor logic from ParquetSchemaConverter, but it seems weird to call that class from here since this code has no access to config data.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62561 has finished for PR 14272 at commit d853ba0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

liancheng commented Jul 20, 2016

This LGTM. Although it's a little bit hacky since technically the fields in requested schema passed to the Parquet record reader may have different original types (INT_8 and INT_16) from the actual ones (empty) defined in the physical file, fortunately Parquet record reader doesn't check for original types.

@liancheng
Copy link
Contributor

I'm merging this to master.

@yhuai Do we want this in branch-2.0?

@liancheng
Copy link
Contributor

Would like to add that AFAIK byte and short are the only problematic types that we don't handle before this PR. Other Hive-Parquet schema conversion quirks like string (translated into binary without UTF8 annotation) and timestamp (translated into deprecated int96) are already worked around in Spark.

@asfgit asfgit closed this in 75146be Jul 20, 2016
@liancheng
Copy link
Contributor

Discussed with @yhuai, I'm also merging this to branch-2.0.

@vanzin Thanks for fixing this!

@yhuai
Copy link
Contributor

yhuai commented Jul 20, 2016

yea. I think the fix is pretty safe. After discussion with @liancheng, seems the more general fix is to just to use the requested catalyst schema to initialize the vectorized reader.

asfgit pushed a commit that referenced this pull request Jul 20, 2016
When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.

So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.

Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.

Author: Marcelo Vanzin <[email protected]>

Closes #14272 from vanzin/SPARK-16632.

(cherry picked from commit 75146be)
Signed-off-by: Cheng Lian <[email protected]>
@liancheng
Copy link
Contributor

Opened #14278 for the simpler yet more general fix.

asfgit pushed a commit that referenced this pull request Jul 21, 2016
… parquet schema

## What changes were proposed in this pull request?

PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them.

This PR targets both master and branch-2.0.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <[email protected]>

Closes #14300 from liancheng/revert-pr-14272.

(cherry picked from commit 69626ad)
Signed-off-by: Cheng Lian <[email protected]>
asfgit pushed a commit that referenced this pull request Jul 21, 2016
… parquet schema

## What changes were proposed in this pull request?

PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them.

This PR targets both master and branch-2.0.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <[email protected]>

Closes #14300 from liancheng/revert-pr-14272.
@vanzin vanzin deleted the SPARK-16632 branch August 2, 2016 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants