-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5950][SQL] Enable inserting array into Hive table saved as Parquet using DataSource API #4729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #27853 has finished for PR 4729 at commit
|
|
Test build #27870 has finished for PR 4729 at commit
|
|
Test build #27883 has finished for PR 4729 at commit
|
|
cc @marmbrus. |
|
/cc @liancheng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right here. ParquetConversions is an analysis rule, which only processes logical plans. However, InsertIntoHiveTable is a physical plan node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InsertIntoHiveTable is a LogicalPlan defined in HiveMetastoreCatalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I mistook this for the physical plan with the same name...
|
Hey @viirya, this PR actually fixes two issues, the |
|
@liancheng Unlike the issue of |
|
@viirya I think the issue at here is that the data written by hive parquet serde may not be read back by our own data source parquet. I have changed the title of the jira. It will be great if you can change your PR title. |
|
@yhuai That problem is not caused by hive parquet serde. You can see the unit test I added. The table is created using data source api. |
|
When the jira was created, we did not correctly replace the destination table in insert into to our data source table. We were actually calling InsertIntoHive to do the work. f02394d fixed this problem. Now, you need to turn off our metastore conversion to see the problem. |
|
@yhuai I see. That issue was first fixed by this pr. You can see the commits before. Even the destination table in replaced, the issue of array (or map) is still there. |
|
Can you try your unit test (without any other change) with master? Thanks! |
|
In fact, even #4782 doesn't solve the table replacing issue I reported in this pr. The unit test is failed before hitting the data insertion issue... |
|
@liancheng @yhuai Actually I don't know why you opened #4782 in order to fix the first issue. Because as I see, the commit of #4782 is just a part of my commits. |
|
Maybe I did not explain it clearly. SPARK-6023 and SPARK-5950 are two bugs, the first one is that we failed to replace the destination MetastoreRelation in InsertIntoTable even we ask Spark SQL to convert all MetastoreRelations associated with parquet tables to our data source parquet tables. The root cause for this one was clear and the fix is pretty simple. The second bug is arrays (maybe maps and structs?) written by Hive's parquet serde may not be able to read by data source parquet table. SPARK-5950 is for this bug. Since this pr is not ready (I will leave comments later), I made #4782 and we checked in it first to fix SPARK-6023. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried this test with our master, it did not fail. I think you need to first turn off the conversion for the write path and then turn on the conversion for the read path. You can use spark.sql.parquet.useDataSourceApi to control it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.parquet.useDataSourceApi is turn on already in the unit test I added. It failed on the master I just pulled.
|
Seems Hive's parquet serde always values are nullable. Can you double check it? Also, we need to check if |
|
@yhuai Yes, I know that. I know there are two bugs. And I reported them in this pr and fixed them in the commits. You should read the description of this pr and my commits first. You just solved part of the first issue. As I said, the unit test I added is still failed on the master now. That is because your commit is just part of my commits in this pr. Because of that, I don't know why you want to open another pr, instead of just using my commits. I have said, the second issue is not caused by "Hive's parquet serde may not be able to read by data source parquet table". Because I create the parquet table using data source api not Hive parquet serde. |
|
OK. Now I understand what's going on. For SPARK-5950, we cannot do insert because |
|
I can't tell if SPARK-5508 is using If you simple replace For SPARK-5950, there are a few issues:
This pr has solved all the three problems (I will update for Except for the second one, I already simply explained them in the description of this pr at the beginning. |
Merge remote-tracking branch 'upstream/master' into hive_parquet Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
|
Test build #28070 has finished for PR 4729 at commit
|
|
@viirya Thank you for working on it! Our discussions helped me clearly understand the problem. After discussions with @liancheng, I am proposing a different approach to address this issue in #4826. Please feel free to leave comments at there. |
… should work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5950 Thanks viirya for the initial work in #4729. cc marmbrus liancheng Author: Yin Huai <[email protected]> Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits: 3b61a04 [Yin Huai] Revert change on equals. 80e487e [Yin Huai] asNullable in UDT. 587d88b [Yin Huai] Make methods private. 0cb7ea2 [Yin Huai] marmbrus's comments. 3cec464 [Yin Huai] Cheng's comments. 486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck d3747d1 [Yin Huai] Remove unnecessary change. 8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck 8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check. 0eb5578 [Yin Huai] Fix tests. f6ed813 [Yin Huai] Update old parquet path. e4f397c [Yin Huai] Unit tests. b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check. 8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data. bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data. 0a703e7 [Yin Huai] Test failed again since we cannot read correct content. 9a26611 [Yin Huai] Make InsertIntoTable happy. 8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability 4ec17fd [Yin Huai] Failed test. (cherry picked from commit 1259994) Signed-off-by: Michael Armbrust <[email protected]>
… should work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5950 Thanks viirya for the initial work in #4729. cc marmbrus liancheng Author: Yin Huai <[email protected]> Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits: 3b61a04 [Yin Huai] Revert change on equals. 80e487e [Yin Huai] asNullable in UDT. 587d88b [Yin Huai] Make methods private. 0cb7ea2 [Yin Huai] marmbrus's comments. 3cec464 [Yin Huai] Cheng's comments. 486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck d3747d1 [Yin Huai] Remove unnecessary change. 8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck 8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check. 0eb5578 [Yin Huai] Fix tests. f6ed813 [Yin Huai] Update old parquet path. e4f397c [Yin Huai] Unit tests. b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check. 8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data. bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data. 0a703e7 [Yin Huai] Test failed again since we cannot read correct content. 9a26611 [Yin Huai] Make InsertIntoTable happy. 8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability 4ec17fd [Yin Huai] Failed test.
Currently
ParquetConversionsinHiveMetastoreCatalogdoes not really work. One reason is that table is not part of the children nodes ofInsertIntoTable. So the replacing is not working.When we create a Parquet table in Hive with ARRAY field. In default
ArrayTypehascontainsNullas true. It affects the table's schema. But when inserting data into the table later, the schema of inserting data can be withcontainsNullas true or false. That makes the inserting/reading failed.A similar problem is reported in https://issues.apache.org/jira/browse/SPARK-5508.
Hive seems only support null elements array. So this pr enables same behavior.