[SPARK-5950][SQL] Enable inserting array into Hive table saved as Parquet using DataSource API #4729

viirya · 2015-02-23T17:20:07Z

Currently ParquetConversions in HiveMetastoreCatalog does not really work. One reason is that table is not part of the children nodes of InsertIntoTable. So the replacing is not working.

When we create a Parquet table in Hive with ARRAY field. In default ArrayType has containsNull as true. It affects the table's schema. But when inserting data into the table later, the schema of inserting data can be with containsNull as true or false. That makes the inserting/reading failed.

A similar problem is reported in https://issues.apache.org/jira/browse/SPARK-5508.

Hive seems only support null elements array. So this pr enables same behavior.

…rce.

SparkQA · 2015-02-23T18:09:31Z

Test build #27853 has finished for PR 4729 at commit 4e3bd55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(

SparkQA · 2015-02-24T02:23:32Z

Test build #27870 has finished for PR 4729 at commit 0e07bb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(

SparkQA · 2015-02-24T08:41:09Z

Test build #27883 has finished for PR 4729 at commit 175966f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-02-24T13:47:20Z

cc @marmbrus.

marmbrus · 2015-02-24T21:41:33Z

/cc @liancheng

liancheng · 2015-02-25T17:58:23Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

I don't think this is right here. ParquetConversions is an analysis rule, which only processes logical plans. However, InsertIntoHiveTable is a physical plan node.

InsertIntoHiveTable is a LogicalPlan defined in HiveMetastoreCatalog.

Oh sorry, I mistook this for the physical plan with the same name...

liancheng · 2015-02-26T19:56:12Z

Hey @viirya, this PR actually fixes two issues, the ParquetConvertions one and the array insertion one. However, both fixes need some tweaks. As 1.3 release is really close, @yhuai opened #4782 based on your work to fix the first issue. As for the array insertion issue, I feel hesitant to add the fix in RowWriteSupport, since this should be a Hive specific issue. Also, map and struct should also suffer the same issue, right?

viirya · 2015-02-26T23:11:02Z

@liancheng Unlike the issue of ParquetConversions, I think the array insertion issue may not be just a Hive specific one. The problem is when we create Parquet table that includes array (or map, struct), by default we use a schema that sets containsNull as true. But actually later we want to insert data, the data schema could have containsNull as true or false. In Hive, seems that it solves this problem by only supporting these fields containing null elements. So no matter the inserting data contains null or not, we set its schema to have containsNull as true before inserting into Parquet file. Since I think we don't want to explicitly change the data schema and affect other parts, doing it in RowWriteSupport should be ok, except you have other concerns.

yhuai · 2015-02-26T23:15:04Z

@viirya I think the issue at here is that the data written by hive parquet serde may not be read back by our own data source parquet. I have changed the title of the jira. It will be great if you can change your PR title.

viirya · 2015-02-26T23:52:13Z

@yhuai That problem is not caused by hive parquet serde. You can see the unit test I added. The table is created using data source api.

yhuai · 2015-02-26T23:54:43Z

When the jira was created, we did not correctly replace the destination table in insert into to our data source table. We were actually calling InsertIntoHive to do the work. f02394d fixed this problem. Now, you need to turn off our metastore conversion to see the problem.

viirya · 2015-02-27T00:02:16Z

@yhuai I see. That issue was first fixed by this pr. You can see the commits before. Even the destination table in replaced, the issue of array (or map) is still there.

yhuai · 2015-02-27T00:06:33Z

Can you try your unit test (without any other change) with master? Thanks!

viirya · 2015-02-27T00:13:18Z

In fact, even #4782 doesn't solve the table replacing issue I reported in this pr. The unit test is failed before hitting the data insertion issue...

viirya · 2015-02-27T00:26:45Z

@liancheng @yhuai Actually I don't know why you opened #4782 in order to fix the first issue. Because as I see, the commit of #4782 is just a part of my commits.

yhuai · 2015-02-27T00:46:08Z

Maybe I did not explain it clearly. SPARK-6023 and SPARK-5950 are two bugs, the first one is that we failed to replace the destination MetastoreRelation in InsertIntoTable even we ask Spark SQL to convert all MetastoreRelations associated with parquet tables to our data source parquet tables. The root cause for this one was clear and the fix is pretty simple. The second bug is arrays (maybe maps and structs?) written by Hive's parquet serde may not be able to read by data source parquet table. SPARK-5950 is for this bug. Since this pr is not ready (I will leave comments later), I made #4782 and we checked in it first to fix SPARK-6023.

yhuai · 2015-02-27T00:49:01Z

sql/hive/src/test/scala/org/apache/spark/sql/parquet/parquetSuites.scala

I just tried this test with our master, it did not fail. I think you need to first turn off the conversion for the write path and then turn on the conversion for the read path. You can use spark.sql.parquet.useDataSourceApi to control it.

spark.sql.parquet.useDataSourceApi is turn on already in the unit test I added. It failed on the master I just pulled.

yhuai · 2015-02-27T00:53:26Z

Seems Hive's parquet serde always values are nullable. Can you double check it? Also, we need to check if StructType and MapType are affected by this bug.

viirya · 2015-02-27T00:59:49Z

@yhuai Yes, I know that. I know there are two bugs. And I reported them in this pr and fixed them in the commits. You should read the description of this pr and my commits first.

You just solved part of the first issue. As I said, the unit test I added is still failed on the master now. That is because your commit is just part of my commits in this pr. Because of that, I don't know why you want to open another pr, instead of just using my commits.

I have said, the second issue is not caused by "Hive's parquet serde may not be able to read by data source parquet table". Because I create the parquet table using data source api not Hive parquet serde.

yhuai · 2015-02-27T01:26:29Z

OK. Now I understand what's going on. For SPARK-5950, we cannot do insert because InsertIntoTable will not be resolved and you saw an org.apache.spark.sql.AnalysisException, right? For SPARK-5508, the problem is data is inserted through InsertIntoHive and we cannot read it from our data source API write path. Are you trying to resolve both in this PR?

viirya · 2015-02-27T01:46:07Z

I can't tell if SPARK-5508 is using InsertIntoHive or not. I didn't see if spark.sql.parquet.useDataSourceApi is turning on or off in that JIRA.

If you simple replace InsertIntoTable's relation to ParquetConversions, then you will get org.apache.spark.sql.AnalysisException. So I don't know why you said the test is passed.

For SPARK-5950, there are a few issues:

It is the problem of ParquetConversions. As you did in [SPARK-6023][SQL] ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2 #4782, InsertIntoTable's table is never replaced.
AnalysisException. That is why I use InsertIntoHiveTable to replace InsertIntoTable in ParquetConversions. Because InsertIntoHiveTable doesn't check the equality of containsNull.
Since the containsNull of ArrayType, MapType, StructType is set to true by default, the schema of created Parquet table always has containsNull as true. Later, when you try to insert data that has same schema but only with different containsNull value, Parquet library will complain that the schema is different. So the insertion will fail.

This pr has solved all the three problems (I will update for MapType, StructType). #4782 just considers the first one.

Except for the second one, I already simply explained them in the description of this pr at the beginning.

Merge remote-tracking branch 'upstream/master' into hive_parquet Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

SparkQA · 2015-02-27T16:15:32Z

Test build #28070 has finished for PR 4729 at commit 2949324.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logError("User class threw exception: " + cause.getMessage, cause)

yhuai · 2015-02-28T05:58:58Z

@viirya Thank you for working on it! Our discussions helped me clearly understand the problem. After discussions with @liancheng, I am proposing a different approach to address this issue in #4826. Please feel free to leave comments at there.

… should work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5950 Thanks viirya for the initial work in #4729. cc marmbrus liancheng Author: Yin Huai <[email protected]> Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits: 3b61a04 [Yin Huai] Revert change on equals. 80e487e [Yin Huai] asNullable in UDT. 587d88b [Yin Huai] Make methods private. 0cb7ea2 [Yin Huai] marmbrus's comments. 3cec464 [Yin Huai] Cheng's comments. 486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck d3747d1 [Yin Huai] Remove unnecessary change. 8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck 8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check. 0eb5578 [Yin Huai] Fix tests. f6ed813 [Yin Huai] Update old parquet path. e4f397c [Yin Huai] Unit tests. b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check. 8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data. bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data. 0a703e7 [Yin Huai] Test failed again since we cannot read correct content. 9a26611 [Yin Huai] Make InsertIntoTable happy. 8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability 4ec17fd [Yin Huai] Failed test. (cherry picked from commit 1259994) Signed-off-by: Michael Armbrust <[email protected]>

… should work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5950 Thanks viirya for the initial work in #4729. cc marmbrus liancheng Author: Yin Huai <[email protected]> Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits: 3b61a04 [Yin Huai] Revert change on equals. 80e487e [Yin Huai] asNullable in UDT. 587d88b [Yin Huai] Make methods private. 0cb7ea2 [Yin Huai] marmbrus's comments. 3cec464 [Yin Huai] Cheng's comments. 486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck d3747d1 [Yin Huai] Remove unnecessary change. 8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck 8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check. 0eb5578 [Yin Huai] Fix tests. f6ed813 [Yin Huai] Update old parquet path. e4f397c [Yin Huai] Unit tests. b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check. 8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data. bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data. 0a703e7 [Yin Huai] Test failed again since we cannot read correct content. 9a26611 [Yin Huai] Make InsertIntoTable happy. 8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability 4ec17fd [Yin Huai] Failed test.

Enable inserting array into hive table saved as parquet using datasou…

4e3bd55

…rce.

Only check type name.

0e07bb8

Merge remote-tracking branch 'upstream/master' into hive_parquet

175966f

liancheng reviewed Feb 25, 2015
View reviewed changes

yhuai reviewed Feb 27, 2015
View reviewed changes

viirya mentioned this pull request Feb 27, 2015

[SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true #4806

Closed

Suport MapType and StructType.

2949324

Merge remote-tracking branch 'upstream/master' into hive_parquet Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

viirya mentioned this pull request Feb 28, 2015

[SPARK-5950][SQL]Insert array into a metastore table saved as parquet should work when using datasource api #4826

Closed

viirya closed this Feb 28, 2015

viirya deleted the hive_parquet branch December 27, 2023 18:17

[SPARK-5950][SQL] Enable inserting array into Hive table saved as Parquet using DataSource API #4729

[SPARK-5950][SQL] Enable inserting array into Hive table saved as Parquet using DataSource API #4729

Uh oh!

Conversation

viirya commented Feb 23, 2015

Uh oh!

SparkQA commented Feb 23, 2015

Uh oh!

SparkQA commented Feb 24, 2015

Uh oh!

SparkQA commented Feb 24, 2015

Uh oh!

viirya commented Feb 24, 2015

Uh oh!

marmbrus commented Feb 24, 2015

Uh oh!

liancheng Feb 25, 2015

Choose a reason for hiding this comment

Uh oh!

viirya Feb 26, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Feb 26, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Feb 26, 2015

Uh oh!

viirya commented Feb 26, 2015

Uh oh!

yhuai commented Feb 26, 2015

Uh oh!

viirya commented Feb 26, 2015

Uh oh!

yhuai commented Feb 26, 2015

Uh oh!

viirya commented Feb 27, 2015

Uh oh!

yhuai commented Feb 27, 2015

Uh oh!

viirya commented Feb 27, 2015

Uh oh!

viirya commented Feb 27, 2015

Uh oh!

yhuai commented Feb 27, 2015

Uh oh!

yhuai Feb 27, 2015

Choose a reason for hiding this comment

Uh oh!

viirya Feb 27, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Feb 27, 2015

Uh oh!

viirya commented Feb 27, 2015

Uh oh!

yhuai commented Feb 27, 2015

Uh oh!

viirya commented Feb 27, 2015

Uh oh!

SparkQA commented Feb 27, 2015

Uh oh!

yhuai commented Feb 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants