[SPARK-16731][SQL] use StructType in CatalogTable and remove CatalogColumn #14363

cloud-fan · 2016-07-26T07:38:30Z

What changes were proposed in this pull request?

StructField has very similar semantic with CatalogColumn, except that CatalogColumn use string to express data type. I think it's reasonable to use StructType as the CatalogTable.schema and remove CatalogColumn.

How was this patch tested?

existing tests.

cloud-fan · 2016-07-26T07:38:58Z

cc @yhuai @liancheng

SparkQA · 2016-07-26T09:05:27Z

Test build #62873 has finished for PR 14363 at commit 776b267.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-26T10:46:57Z

Test build #62879 has finished for PR 14363 at commit 3f5480c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-26T14:57:33Z

Test build #62889 has finished for PR 14363 at commit addc585.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-28T11:29:47Z

Test build #62965 has finished for PR 14363 at commit 97b0492.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class SchemaAttribute(f: StructField)

SparkQA · 2016-07-28T14:22:23Z

Test build #62969 has finished for PR 14363 at commit b781ef8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class SchemaAttribute(f: StructField)

gatorsmile · 2016-07-28T20:18:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

-      dataType = hc.getType,
-      nullable = true,
-      comment = Option(hc.getComment))
+      dataType = CatalystSqlParser.parseDataType(hc.getType),


This is the change we have to make if we convert CatalogColumn to StructField. Previously, we do the data type parsing only when we need to use it. Here, we parse it when we read from the Hive Catalog. That means, it could break the behaviors in some extreme cases. For example, it sounds like hc.getType could return null? or Hive could return some data types we might not recognize. We could hit the exception from Parser, right?

That means, the caller of fromHiveColumn will also get the exception. getTableOption is the caller. I am just wondering if we do not want to see this kind of exception when doing getTableOption. Or maybe issue a nicer error message here?

So the behaviour change is: previously if a hive table contains type string that we can't parse, we are still able to describe it, but throw an exception if we try to read it. After this PR, we will throw an exception when we try to read its table meta from hive meta store.

I think it's ok to break it, but need better error message. what do you think? cc @yhuai @liancheng

SparkQA · 2016-07-30T17:30:53Z

Test build #63045 has finished for PR 14363 at commit 80d2f50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-07-31T21:14:55Z

Do we know which hive type strings cannot be parsed by spark?

yhuai · 2016-08-01T01:16:41Z

LGTM. Thanks. Merging to master.

cloud-fan · 2016-08-01T02:24:17Z

Do we know which hive type strings cannot be parsed by spark?

varchar(length) and char(length). see #14363 (comment) for what we break.

yhuai · 2016-08-01T03:35:47Z

Thanks. But, what are specific cases are not supported? If there is any case, we should make change to support that, right?

cloud-fan · 2016-08-01T04:28:00Z

for a hive table(created by hive) with varchar(length) column, we can describe it but can't read data from it before this PR. Now we can't describe it. Do you think we should fix it? BTW there is no test for this case.

gatorsmile · 2016-08-01T05:13:50Z

TestHive.sessionState.metadataHive.runSqlHive("CREATE TABLE test (id varchar(50))")
TestHive.sessionState.metadataHive.runSqlHive("INSERT INTO TABLE test VALUES ('4')")
spark.sql("select * from test").show()
spark.sql("describe test").show()

Are you saying this case? I tried. It works.

cloud-fan · 2016-08-01T05:41:09Z

Oh sorry I misread our parser rules. varchar(length) is supported but the length is ignored. I checked with hive again, looks like the only unsupported data type is UNIONTYPE

gatorsmile · 2016-08-01T07:08:17Z

Your concern is valid. We are missing the test cases for verifying these scenarios.

I saw a discussion in a wechat group about the issue in integration between Hive and Spark. They are complaining Spark is unable to read the data wrote by Hive. In Hive refactoring, I am wondering if we also need to build the test cases to cover these cases?

lianhuiwang · 2016-08-01T08:33:40Z

@cloud-fan There is a case that i met. The varchar(length)/char(length) type is not a String Type. But now SparkSQL consider them a string type. So there are different result with the following example:
TestHive.sessionState.metadataHive.runSqlHive("CREATE TABLE test (id varchar(50))")
TestHive.sessionState.metadataHive.runSqlHive("INSERT INTO TABLE test VALUES ('abcdef')")
TestHive.sessionState.metadataHive.runSqlHive("CREATE TABLE test_parquet (id varchar(2) stored as parquet)")
TestHive.sessionState.metadataHive.runSqlHive("insert overwrite table varchar_parquet1 select * from test")
the result of varchar_parquet1 are 'ab'.
spark.sql("insert overwrite table varchar_parquet1 select * from test").show()
the result of varchar_parquet1 are 'abcdef'.

cloud-fan · 2016-08-01T09:17:41Z

Well, Spark SQL is not announced to be fully compatible with hive, I think it's reasonable to have some issues. cc @rxin @yhuai should we fix this?

gatorsmile · 2016-08-01T15:52:00Z

@lianhuiwang Writing a Hive Table in Parquet format is a little bit different here. For performance reasons, we are converting it to data source tables when inserting rows into Parquet. To get the expected results, you just need to set spark.sql.hive.convertMetastoreParquet to false.

If you are choosing textfile, it works as expected.

… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes #16060 from cloud-fan/varchar.

… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes #16060 from cloud-fan/varchar. (cherry picked from commit 3f03c90) Signed-off-by: Reynold Xin <[email protected]>

… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes apache#16060 from cloud-fan/varchar.

cloud-fan mentioned this pull request Jul 26, 2016

[SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog #14155

Closed

cloud-fan force-pushed the column branch from addc585 to 97b0492 Compare July 28, 2016 10:08

use StructType in CatalogTable and remove CatalogColumn

b781ef8

cloud-fan force-pushed the column branch from 97b0492 to b781ef8 Compare July 28, 2016 12:26

gatorsmile reviewed Jul 28, 2016
View reviewed changes

address comments

80d2f50

asfgit closed this in 301fb0d Aug 1, 2016

cloud-fan mentioned this pull request Nov 29, 2016

[SPARK-18220][SQL] read Hive orc table with varchar column should not fail #16060

Closed

[SPARK-16731][SQL] use StructType in CatalogTable and remove CatalogColumn #14363

[SPARK-16731][SQL] use StructType in CatalogTable and remove CatalogColumn #14363

Uh oh!

Conversation

cloud-fan commented Jul 26, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jul 26, 2016

Uh oh!

SparkQA commented Jul 26, 2016

Uh oh!

SparkQA commented Jul 26, 2016

Uh oh!

SparkQA commented Jul 26, 2016

Uh oh!

SparkQA commented Jul 28, 2016

Uh oh!

SparkQA commented Jul 28, 2016

Uh oh!

gatorsmile Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 30, 2016

Uh oh!

yhuai commented Jul 31, 2016

Uh oh!

yhuai commented Aug 1, 2016

Uh oh!

cloud-fan commented Aug 1, 2016

Uh oh!

yhuai commented Aug 1, 2016

Uh oh!

cloud-fan commented Aug 1, 2016

Uh oh!

gatorsmile commented Aug 1, 2016

Uh oh!

cloud-fan commented Aug 1, 2016

Uh oh!

gatorsmile commented Aug 1, 2016

Uh oh!

lianhuiwang commented Aug 1, 2016

Uh oh!

cloud-fan commented Aug 1, 2016

Uh oh!

gatorsmile commented Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gatorsmile Jul 28, 2016 •

edited

Loading

gatorsmile commented Aug 1, 2016 •

edited

Loading