-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16632][sql] Respect Hive schema when merging parquet schema. #14272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When Hive (or at least certain versions of Hive) creates parquet files containing tinyint or smallint columns, it stores them as int32, but doesn't annotate the parquet field as containing the corresponding int8 / int16 data. When Spark reads those files using the vectorized reader, it follows the parquet schema for these fields, but when actually reading the data it tries to use the type fetched from the metastore, and then fails because data has been loaded into the wrong fields in OnHeapColumnVector. So instead of blindly trusting the parquet schema, check whether the Catalyst-provided schema disagrees with it, and adjust the types so that the necessary metadata is present when loading the data into the ColumnVector instance. Tested with unit tests and with tests that create byte / short columns in Hive and try to read them from Spark.
|
Not sure if this is the best place for the fix, but the problem is gone with the change. It duplicates some minor logic from |
|
Test build #62561 has finished for PR 14272 at commit
|
|
This LGTM. Although it's a little bit hacky since technically the fields in requested schema passed to the Parquet record reader may have different original types ( |
|
I'm merging this to master. @yhuai Do we want this in branch-2.0? |
|
Would like to add that AFAIK byte and short are the only problematic types that we don't handle before this PR. Other Hive-Parquet schema conversion quirks like string (translated into |
|
yea. I think the fix is pretty safe. After discussion with @liancheng, seems the more general fix is to just to use the requested catalyst schema to initialize the vectorized reader. |
When Hive (or at least certain versions of Hive) creates parquet files containing tinyint or smallint columns, it stores them as int32, but doesn't annotate the parquet field as containing the corresponding int8 / int16 data. When Spark reads those files using the vectorized reader, it follows the parquet schema for these fields, but when actually reading the data it tries to use the type fetched from the metastore, and then fails because data has been loaded into the wrong fields in OnHeapColumnVector. So instead of blindly trusting the parquet schema, check whether the Catalyst-provided schema disagrees with it, and adjust the types so that the necessary metadata is present when loading the data into the ColumnVector instance. Tested with unit tests and with tests that create byte / short columns in Hive and try to read them from Spark. Author: Marcelo Vanzin <[email protected]> Closes #14272 from vanzin/SPARK-16632. (cherry picked from commit 75146be) Signed-off-by: Cheng Lian <[email protected]>
|
Opened #14278 for the simpler yet more general fix. |
… parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #14300 from liancheng/revert-pr-14272. (cherry picked from commit 69626ad) Signed-off-by: Cheng Lian <[email protected]>
… parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #14300 from liancheng/revert-pr-14272.
When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.
So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.
Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.