Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

StructField has very similar semantic with CatalogColumn, except that CatalogColumn use string to express data type. I think it's reasonable to use StructType as the CatalogTable.schema and remove CatalogColumn.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @liancheng

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62873 has finished for PR 14363 at commit 776b267.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62879 has finished for PR 14363 at commit 3f5480c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62889 has finished for PR 14363 at commit addc585.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62965 has finished for PR 14363 at commit 97b0492.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • implicit class SchemaAttribute(f: StructField)

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62969 has finished for PR 14363 at commit b781ef8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • implicit class SchemaAttribute(f: StructField)

dataType = hc.getType,
nullable = true,
comment = Option(hc.getComment))
dataType = CatalystSqlParser.parseDataType(hc.getType),
Copy link
Member

@gatorsmile gatorsmile Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change we have to make if we convert CatalogColumn to StructField. Previously, we do the data type parsing only when we need to use it. Here, we parse it when we read from the Hive Catalog. That means, it could break the behaviors in some extreme cases. For example, it sounds like hc.getType could return null? or Hive could return some data types we might not recognize. We could hit the exception from Parser, right?

That means, the caller of fromHiveColumn will also get the exception. getTableOption is the caller. I am just wondering if we do not want to see this kind of exception when doing getTableOption. Or maybe issue a nicer error message here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the behaviour change is: previously if a hive table contains type string that we can't parse, we are still able to describe it, but throw an exception if we try to read it. After this PR, we will throw an exception when we try to read its table meta from hive meta store.

I think it's ok to break it, but need better error message. what do you think? cc @yhuai @liancheng

@SparkQA
Copy link

SparkQA commented Jul 30, 2016

Test build #63045 has finished for PR 14363 at commit 80d2f50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jul 31, 2016

Do we know which hive type strings cannot be parsed by spark?

@yhuai
Copy link
Contributor

yhuai commented Aug 1, 2016

LGTM. Thanks. Merging to master.

@asfgit asfgit closed this in 301fb0d Aug 1, 2016
@cloud-fan
Copy link
Contributor Author

Do we know which hive type strings cannot be parsed by spark?

varchar(length) and char(length). see #14363 (comment) for what we break.

@yhuai
Copy link
Contributor

yhuai commented Aug 1, 2016

Thanks. But, what are specific cases are not supported? If there is any case, we should make change to support that, right?

@cloud-fan
Copy link
Contributor Author

for a hive table(created by hive) with varchar(length) column, we can describe it but can't read data from it before this PR. Now we can't describe it. Do you think we should fix it? BTW there is no test for this case.

@gatorsmile
Copy link
Member

TestHive.sessionState.metadataHive.runSqlHive("CREATE TABLE test (id varchar(50))")
TestHive.sessionState.metadataHive.runSqlHive("INSERT INTO TABLE test VALUES ('4')")
spark.sql("select * from test").show()
spark.sql("describe test").show()

Are you saying this case? I tried. It works.

@cloud-fan
Copy link
Contributor Author

Oh sorry I misread our parser rules. varchar(length) is supported but the length is ignored. I checked with hive again, looks like the only unsupported data type is UNIONTYPE

@gatorsmile
Copy link
Member

Your concern is valid. We are missing the test cases for verifying these scenarios.

I saw a discussion in a wechat group about the issue in integration between Hive and Spark. They are complaining Spark is unable to read the data wrote by Hive. In Hive refactoring, I am wondering if we also need to build the test cases to cover these cases?

@lianhuiwang
Copy link
Contributor

@cloud-fan There is a case that i met. The varchar(length)/char(length) type is not a String Type. But now SparkSQL consider them a string type. So there are different result with the following example:
TestHive.sessionState.metadataHive.runSqlHive("CREATE TABLE test (id varchar(50))")
TestHive.sessionState.metadataHive.runSqlHive("INSERT INTO TABLE test VALUES ('abcdef')")
TestHive.sessionState.metadataHive.runSqlHive("CREATE TABLE test_parquet (id varchar(2) stored as parquet)")
TestHive.sessionState.metadataHive.runSqlHive("insert overwrite table varchar_parquet1 select * from test")
the result of varchar_parquet1 are 'ab'.
spark.sql("insert overwrite table varchar_parquet1 select * from test").show()
the result of varchar_parquet1 are 'abcdef'.

@cloud-fan
Copy link
Contributor Author

Well, Spark SQL is not announced to be fully compatible with hive, I think it's reasonable to have some issues. cc @rxin @yhuai should we fix this?

@gatorsmile
Copy link
Member

gatorsmile commented Aug 1, 2016

@lianhuiwang Writing a Hive Table in Parquet format is a little bit different here. For performance reasons, we are converting it to data source tables when inserting rows into Parquet. To get the expected results, you just need to set spark.sql.hive.convertMetastoreParquet to false.

If you are choosing textfile, it works as expected.

asfgit pushed a commit that referenced this pull request Nov 30, 2016
… fail

## What changes were proposed in this pull request?

Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.

## How was this patch tested?

newly added regression test

Author: Wenchen Fan <[email protected]>

Closes #16060 from cloud-fan/varchar.
asfgit pushed a commit that referenced this pull request Nov 30, 2016
… fail

## What changes were proposed in this pull request?

Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.

## How was this patch tested?

newly added regression test

Author: Wenchen Fan <[email protected]>

Closes #16060 from cloud-fan/varchar.

(cherry picked from commit 3f03c90)
Signed-off-by: Reynold Xin <[email protected]>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
… fail

## What changes were proposed in this pull request?

Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.

## How was this patch tested?

newly added regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16060 from cloud-fan/varchar.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… fail

## What changes were proposed in this pull request?

Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.

## How was this patch tested?

newly added regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16060 from cloud-fan/varchar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants