-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16731][SQL] use StructType in CatalogTable and remove CatalogColumn #14363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @yhuai @liancheng |
|
Test build #62873 has finished for PR 14363 at commit
|
|
Test build #62879 has finished for PR 14363 at commit
|
|
Test build #62889 has finished for PR 14363 at commit
|
|
Test build #62965 has finished for PR 14363 at commit
|
|
Test build #62969 has finished for PR 14363 at commit
|
| dataType = hc.getType, | ||
| nullable = true, | ||
| comment = Option(hc.getComment)) | ||
| dataType = CatalystSqlParser.parseDataType(hc.getType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the change we have to make if we convert CatalogColumn to StructField. Previously, we do the data type parsing only when we need to use it. Here, we parse it when we read from the Hive Catalog. That means, it could break the behaviors in some extreme cases. For example, it sounds like hc.getType could return null? or Hive could return some data types we might not recognize. We could hit the exception from Parser, right?
That means, the caller of fromHiveColumn will also get the exception. getTableOption is the caller. I am just wondering if we do not want to see this kind of exception when doing getTableOption. Or maybe issue a nicer error message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the behaviour change is: previously if a hive table contains type string that we can't parse, we are still able to describe it, but throw an exception if we try to read it. After this PR, we will throw an exception when we try to read its table meta from hive meta store.
I think it's ok to break it, but need better error message. what do you think? cc @yhuai @liancheng
|
Test build #63045 has finished for PR 14363 at commit
|
|
Do we know which hive type strings cannot be parsed by spark? |
|
LGTM. Thanks. Merging to master. |
|
|
Thanks. But, what are specific cases are not supported? If there is any case, we should make change to support that, right? |
|
for a hive table(created by hive) with |
TestHive.sessionState.metadataHive.runSqlHive("CREATE TABLE test (id varchar(50))")
TestHive.sessionState.metadataHive.runSqlHive("INSERT INTO TABLE test VALUES ('4')")
spark.sql("select * from test").show()
spark.sql("describe test").show()Are you saying this case? I tried. It works. |
|
Oh sorry I misread our parser rules. |
|
Your concern is valid. We are missing the test cases for verifying these scenarios. I saw a discussion in a wechat group about the issue in integration between Hive and Spark. They are complaining Spark is unable to read the data wrote by Hive. In Hive refactoring, I am wondering if we also need to build the test cases to cover these cases? |
|
@cloud-fan There is a case that i met. The varchar(length)/char(length) type is not a String Type. But now SparkSQL consider them a string type. So there are different result with the following example: |
|
@lianhuiwang Writing a Hive Table in Parquet format is a little bit different here. For performance reasons, we are converting it to data source tables when inserting rows into Parquet. To get the expected results, you just need to set If you are choosing |
… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes #16060 from cloud-fan/varchar.
… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes #16060 from cloud-fan/varchar. (cherry picked from commit 3f03c90) Signed-off-by: Reynold Xin <[email protected]>
… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes apache#16060 from cloud-fan/varchar.
… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes apache#16060 from cloud-fan/varchar.
What changes were proposed in this pull request?
StructFieldhas very similar semantic withCatalogColumn, except thatCatalogColumnuse string to express data type. I think it's reasonable to useStructTypeas theCatalogTable.schemaand removeCatalogColumn.How was this patch tested?
existing tests.