Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Feb 23, 2015

Currently ParquetConversions in HiveMetastoreCatalog does not really work. One reason is that table is not part of the children nodes of InsertIntoTable. So the replacing is not working.

When we create a Parquet table in Hive with ARRAY field. In default ArrayType has containsNull as true. It affects the table's schema. But when inserting data into the table later, the schema of inserting data can be with containsNull as true or false. That makes the inserting/reading failed.

A similar problem is reported in https://issues.apache.org/jira/browse/SPARK-5508.

Hive seems only support null elements array. So this pr enables same behavior.

@SparkQA
Copy link

SparkQA commented Feb 23, 2015

Test build #27853 has finished for PR 4729 at commit 4e3bd55.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #27870 has finished for PR 4729 at commit 0e07bb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #27883 has finished for PR 4729 at commit 175966f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 24, 2015

cc @marmbrus.

@marmbrus
Copy link
Contributor

/cc @liancheng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right here. ParquetConversions is an analysis rule, which only processes logical plans. However, InsertIntoHiveTable is a physical plan node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InsertIntoHiveTable is a LogicalPlan defined in HiveMetastoreCatalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I mistook this for the physical plan with the same name...

@liancheng
Copy link
Contributor

Hey @viirya, this PR actually fixes two issues, the ParquetConvertions one and the array insertion one. However, both fixes need some tweaks. As 1.3 release is really close, @yhuai opened #4782 based on your work to fix the first issue. As for the array insertion issue, I feel hesitant to add the fix in RowWriteSupport, since this should be a Hive specific issue. Also, map and struct should also suffer the same issue, right?

@viirya
Copy link
Member Author

viirya commented Feb 26, 2015

@liancheng Unlike the issue of ParquetConversions, I think the array insertion issue may not be just a Hive specific one. The problem is when we create Parquet table that includes array (or map, struct), by default we use a schema that sets containsNull as true. But actually later we want to insert data, the data schema could have containsNull as true or false. In Hive, seems that it solves this problem by only supporting these fields containing null elements. So no matter the inserting data contains null or not, we set its schema to have containsNull as true before inserting into Parquet file. Since I think we don't want to explicitly change the data schema and affect other parts, doing it in RowWriteSupport should be ok, except you have other concerns.

@yhuai
Copy link
Contributor

yhuai commented Feb 26, 2015

@viirya I think the issue at here is that the data written by hive parquet serde may not be read back by our own data source parquet. I have changed the title of the jira. It will be great if you can change your PR title.

@viirya
Copy link
Member Author

viirya commented Feb 26, 2015

@yhuai That problem is not caused by hive parquet serde. You can see the unit test I added. The table is created using data source api.

@yhuai
Copy link
Contributor

yhuai commented Feb 26, 2015

When the jira was created, we did not correctly replace the destination table in insert into to our data source table. We were actually calling InsertIntoHive to do the work. f02394d fixed this problem. Now, you need to turn off our metastore conversion to see the problem.

@viirya
Copy link
Member Author

viirya commented Feb 27, 2015

@yhuai I see. That issue was first fixed by this pr. You can see the commits before. Even the destination table in replaced, the issue of array (or map) is still there.

@yhuai
Copy link
Contributor

yhuai commented Feb 27, 2015

Can you try your unit test (without any other change) with master? Thanks!

@viirya
Copy link
Member Author

viirya commented Feb 27, 2015

In fact, even #4782 doesn't solve the table replacing issue I reported in this pr. The unit test is failed before hitting the data insertion issue...

@viirya
Copy link
Member Author

viirya commented Feb 27, 2015

@liancheng @yhuai Actually I don't know why you opened #4782 in order to fix the first issue. Because as I see, the commit of #4782 is just a part of my commits.

@yhuai
Copy link
Contributor

yhuai commented Feb 27, 2015

Maybe I did not explain it clearly. SPARK-6023 and SPARK-5950 are two bugs, the first one is that we failed to replace the destination MetastoreRelation in InsertIntoTable even we ask Spark SQL to convert all MetastoreRelations associated with parquet tables to our data source parquet tables. The root cause for this one was clear and the fix is pretty simple. The second bug is arrays (maybe maps and structs?) written by Hive's parquet serde may not be able to read by data source parquet table. SPARK-5950 is for this bug. Since this pr is not ready (I will leave comments later), I made #4782 and we checked in it first to fix SPARK-6023.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried this test with our master, it did not fail. I think you need to first turn off the conversion for the write path and then turn on the conversion for the read path. You can use spark.sql.parquet.useDataSourceApi to control it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.parquet.useDataSourceApi is turn on already in the unit test I added. It failed on the master I just pulled.

@yhuai
Copy link
Contributor

yhuai commented Feb 27, 2015

Seems Hive's parquet serde always values are nullable. Can you double check it? Also, we need to check if StructType and MapType are affected by this bug.

@viirya
Copy link
Member Author

viirya commented Feb 27, 2015

@yhuai Yes, I know that. I know there are two bugs. And I reported them in this pr and fixed them in the commits. You should read the description of this pr and my commits first.

You just solved part of the first issue. As I said, the unit test I added is still failed on the master now. That is because your commit is just part of my commits in this pr. Because of that, I don't know why you want to open another pr, instead of just using my commits.

I have said, the second issue is not caused by "Hive's parquet serde may not be able to read by data source parquet table". Because I create the parquet table using data source api not Hive parquet serde.

@yhuai
Copy link
Contributor

yhuai commented Feb 27, 2015

OK. Now I understand what's going on. For SPARK-5950, we cannot do insert because InsertIntoTable will not be resolved and you saw an org.apache.spark.sql.AnalysisException, right? For SPARK-5508, the problem is data is inserted through InsertIntoHive and we cannot read it from our data source API write path. Are you trying to resolve both in this PR?

@viirya
Copy link
Member Author

viirya commented Feb 27, 2015

I can't tell if SPARK-5508 is using InsertIntoHive or not. I didn't see if spark.sql.parquet.useDataSourceApi is turning on or off in that JIRA.

If you simple replace InsertIntoTable's relation to ParquetConversions, then you will get org.apache.spark.sql.AnalysisException. So I don't know why you said the test is passed.

For SPARK-5950, there are a few issues:

  1. It is the problem of ParquetConversions. As you did in [SPARK-6023][SQL] ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2 #4782, InsertIntoTable's table is never replaced.
  2. AnalysisException. That is why I use InsertIntoHiveTable to replace InsertIntoTable in ParquetConversions. Because InsertIntoHiveTable doesn't check the equality of containsNull.
  3. Since the containsNull of ArrayType, MapType, StructType is set to true by default, the schema of created Parquet table always has containsNull as true. Later, when you try to insert data that has same schema but only with different containsNull value, Parquet library will complain that the schema is different. So the insertion will fail.

This pr has solved all the three problems (I will update for MapType, StructType). #4782 just considers the first one.

Except for the second one, I already simply explained them in the description of this pr at the beginning.

Merge remote-tracking branch 'upstream/master' into hive_parquet

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28070 has finished for PR 4729 at commit 2949324.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logError("User class threw exception: " + cause.getMessage, cause)

@yhuai
Copy link
Contributor

yhuai commented Feb 28, 2015

@viirya Thank you for working on it! Our discussions helped me clearly understand the problem. After discussions with @liancheng, I am proposing a different approach to address this issue in #4826. Please feel free to leave comments at there.

@viirya viirya closed this Feb 28, 2015
asfgit pushed a commit that referenced this pull request Mar 3, 2015
… should work when using datasource api

This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However,  the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really cares nullability,  `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.

JIRA: https://issues.apache.org/jira/browse/SPARK-5950

Thanks viirya for the initial work in #4729.

cc marmbrus liancheng

Author: Yin Huai <[email protected]>

Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits:

3b61a04 [Yin Huai] Revert change on equals.
80e487e [Yin Huai] asNullable in UDT.
587d88b [Yin Huai] Make methods private.
0cb7ea2 [Yin Huai] marmbrus's comments.
3cec464 [Yin Huai] Cheng's comments.
486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
d3747d1 [Yin Huai] Remove unnecessary change.
8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
0eb5578 [Yin Huai] Fix tests.
f6ed813 [Yin Huai] Update old parquet path.
e4f397c [Yin Huai] Unit tests.
b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
9a26611 [Yin Huai] Make InsertIntoTable happy.
8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
4ec17fd [Yin Huai] Failed test.

(cherry picked from commit 1259994)
Signed-off-by: Michael Armbrust <[email protected]>
asfgit pushed a commit that referenced this pull request Mar 3, 2015
… should work when using datasource api

This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However,  the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really cares nullability,  `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.

JIRA: https://issues.apache.org/jira/browse/SPARK-5950

Thanks viirya for the initial work in #4729.

cc marmbrus liancheng

Author: Yin Huai <[email protected]>

Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits:

3b61a04 [Yin Huai] Revert change on equals.
80e487e [Yin Huai] asNullable in UDT.
587d88b [Yin Huai] Make methods private.
0cb7ea2 [Yin Huai] marmbrus's comments.
3cec464 [Yin Huai] Cheng's comments.
486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
d3747d1 [Yin Huai] Remove unnecessary change.
8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
0eb5578 [Yin Huai] Fix tests.
f6ed813 [Yin Huai] Update old parquet path.
e4f397c [Yin Huai] Unit tests.
b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
9a26611 [Yin Huai] Make InsertIntoTable happy.
8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
4ec17fd [Yin Huai] Failed test.
@viirya viirya deleted the hive_parquet branch December 27, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants