[SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema #5214

budde · 2015-03-26T17:15:08Z

Opening to replace #5188.

When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore.

In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an "ALTER TABLE... ADD PARTITION..." statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema.

In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The mergeMetastoreParquetSchema() method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore.

This pull requests alters the behavior of mergeMetastoreParquetSchema() by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there.

AmplabJenkins · 2015-03-26T17:17:13Z

Can one of the admins verify this patch?

marmbrus · 2015-03-26T19:58:34Z

ok to test

SparkQA · 2015-03-26T20:03:12Z

Test build #29254 has started for PR 5214 at commit 9041bfa.

This patch merges cleanly.

SparkQA · 2015-03-26T20:20:45Z

Test build #29254 has finished for PR 5214 at commit 9041bfa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-26T20:20:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29254/
Test FAILed.

budde · 2015-03-26T20:24:14Z

Taking a look at why these tests failed.

…71 and SPARK-6538

budde · 2015-03-26T21:14:38Z

I must've accidentally run the tests on an old build artifact before opening this PR. It turns out that tests included #5141 expect failure in scenarios now permitted by this PR, while the tests originally included in this PR also expect failure in scenarios now permitted by #5141. I've cleared this up and the tests should pass now.

SparkQA · 2015-03-26T21:18:18Z

Test build #29257 has started for PR 5214 at commit a52d378.

This patch merges cleanly.

SparkQA · 2015-03-26T22:49:40Z

Test build #29257 has finished for PR 5214 at commit a52d378.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-26T22:49:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29257/
Test PASSed.

liancheng · 2015-03-27T16:55:49Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

I'm afraid diff and ++ are not OK here. For example, if the metastore schema has fields <a, b, c>, the Parquet schema has fields <a, c>, then the result schema would be <a, c, b>.

What is the expected order of fields in a schema? Is is lexicographic? Should we maintain the order of the metastore schema?

Not lexicographic, the order of fields in the result schema should be the same as the metastore schema.

How should we deal with potential ambiguities that may be introduced due to #5141? For instance, say we are merging the following schemas:

Metastores schema Parquet schema

Foo Foo

Bar Bar

Baz Bop

Bat Bat

The following options come to mind:

Attempt to merge the orderings and accept any possibility when there are ambiguities (e.g. both Foo Bar Baz Bop Bat and Foo Bar Bop Baz Bat are acceptable).

The fields defined in the metastore schema always begin in order, followed by any additional fields defined in the Parquet schema (e.g. Foo Bar Baz Bat Bop is the only accepted ordering).

When the metastore schema is available, we are actually converting a metastore Parquet table into ParquetRelation2. Thus, the final reconciled schema should have exactly the same fields as the metastore schema, and simply drop any fields only appear in the Parquet data file.

I see. Based on the change made in #5141, it looks like the schema returned by mergeMissingNullableFields() will still contain any additional fields defined in parquetSchema (lines 766-767). How would you feel about simply removing the additional parquetSchema fields in the mergeMissingNullableFields() method?

Execution would look something like this:

remove additional parquetSchema fields

call mergeMissingNullableFields() on schema w/additional fields removed

proceed with executing mergeMetastoreParquetSchema() with additions made [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns #5141 removed (should be unnecessary if we prune additional fields first)

Actually, now that I consider it, I'm not convinced that having the mergeNullableFields() method return the fields in non-metastore order is a problem here. Lines 766-767 of mergeMetastoreParquetSchema() should handle putting them in the proper order.

Removing the additional fields is still an option to consider, however.

Ah, yeah, you're right :) Totally forgot that mergeMetastoreParquetSchema already handles field reordering here. And all additional Parquet fields are removed via this zip call.

liancheng · 2015-03-28T01:11:31Z

add to whitelist

SparkQA · 2015-03-28T01:13:17Z

Test build #29333 has started for PR 5214 at commit a52d378.

This patch merges cleanly.

…a Parquet schema Opening to replace #5188. When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore. In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema. In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The **mergeMetastoreParquetSchema()** method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore. This pull requests alters the behavior of **mergeMetastoreParquetSchema()** by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there. Author: Adam Budde <[email protected]> Closes #5214 from budde/nullable-fields and squashes the following commits: a52d378 [Adam Budde] Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-6471 and SPARK-6538 9041bfa [Adam Budde] Add missing nullable Metastore fields when merging a Parquet schema (cherry picked from commit 5909f09) Signed-off-by: Cheng Lian <[email protected]>

liancheng · 2015-03-28T01:16:07Z

LGTM. Merged to master and 1.3. Thanks for working on this!

SparkQA · 2015-03-28T02:34:36Z

Test build #29333 has finished for PR 5214 at commit a52d378.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-28T02:34:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29333/
Test PASSed.

This PR adds a section about Hive metastore Parquet table conversion. It documents: 1. Schema reconciliation rules introduced in #5214 (see [this comment] [1] in #5188) 2. Metadata refreshing requirement introduced in #5339 [1]: #5188 (comment) Author: Cheng Lian <[email protected]> Closes #5348 from liancheng/sql-doc-parquet-conversion and squashes the following commits: 42ae0d0 [Cheng Lian] Adds Python `refreshTable` snippet 4c9847d [Cheng Lian] Resorts to SQL for Python metadata refreshing snippet 756e660 [Cheng Lian] Adds Python snippet for metadata refreshing 50675db [Cheng Lian] Addes Hive metastore Parquet table conversion section

Add missing nullable Metastore fields when merging a Parquet schema

9041bfa

Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-64…

a52d378

…71 and SPARK-6538

This was referenced Mar 27, 2015

[SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema #5188

Closed

[SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns #5141

Closed

liancheng reviewed Mar 27, 2015
View reviewed changes

asfgit closed this in 5909f09 Mar 28, 2015

liancheng mentioned this pull request Apr 3, 2015

[Doc] [SQL] Addes Hive metastore Parquet table conversion section #5348

Closed

budde mentioned this pull request Feb 9, 2017

[SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution #16797

Closed

budde mentioned this pull request Mar 3, 2017

[SPARK-19611][SQL] Introduce configurable table schema inference #16944

Closed

[SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema #5214

[SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema #5214

Uh oh!

Conversation

budde commented Mar 26, 2015

Uh oh!

AmplabJenkins commented Mar 26, 2015

Uh oh!

marmbrus commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

AmplabJenkins commented Mar 26, 2015

Uh oh!

budde commented Mar 26, 2015

Uh oh!

budde commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

AmplabJenkins commented Mar 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Mar 28, 2015

Uh oh!

SparkQA commented Mar 28, 2015

Uh oh!

liancheng commented Mar 28, 2015

Uh oh!

SparkQA commented Mar 28, 2015

Uh oh!

AmplabJenkins commented Mar 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants