-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… issues with char/varchar columns in ORC.
|
Test build #72371 has finished for PR 16804 at commit
|
|
|
||
| test("read varchar column from orc tables created by hive") { | ||
| try { | ||
| // This is an ORC file with a single VARCHAR(10) column that's created using Hive 1.2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @hvanhovell .
Nit. It's three columns.
Structure for orc/orc_text_types.orc
File Version: 0.12 with HIVE_8732
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:char(10),_col2:varchar(10)>
|
Test build #72412 has finished for PR 16804 at commit
|
| dataType match { | ||
| case p: PrimitiveDataTypeContext => | ||
| val dt = p.identifier.getText.toLowerCase | ||
| (dt, p.INTEGER_VALUE().asScala.toList) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
p.identifier.getText.toLowerCase match {
case "varchar" | "char" => builder.putString(HIVE_TYPE_STRING, dataType.getText.toLowerCase)
}
| * Metadata key used to store the Hive type name. This is relevant for datatypes that do not | ||
| * have a direct Spark SQL counterpart, such as CHAR and VARCHAR. | ||
| */ | ||
| val HIVE_TYPE_STRING = "HIVE_TYPE_STRING" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we remove HiveUtils. HIVE_TYPE_STRING?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we should.
| } | ||
|
|
||
| test("read varchar column from orc tables created by hive") { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
try {
hiveClient.runSqlHive("CREATE TABLE hive_orc(a VARCHAR(10)) STORED AS orc LOCATION xxx")
hiveClient.runSqlHive("INSERT INTO TABLE hive_orc SELECT 'a' FROM (SELECT 1) t")
sql("CREATE EXTERNAL TABLE spark_orc ...")
checkAnswer...
} finally {
sql("DROP TABLE IF EXISTS ...")
...
}
then we don't need to create the orc file manually.
# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala
|
Test build #72518 has finished for PR 16804 at commit
|
| import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} | ||
| import org.apache.spark.sql.execution.FileRelation | ||
| import org.apache.spark.sql.types.StructField | ||
| import org.apache.spark.sql.types._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That just makes it easier to use HIVE_TYPE_STRING.
|
|
||
| { | ||
| function | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is my bad.
| package object types | ||
| package object types { | ||
| /** | ||
| * Metadata key used to store the the raw hive type string in the metadata of StructField. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the the -> the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
|
Test build #72522 has finished for PR 16804 at commit
|
| s"ALTER TABLE hive_orc SET LOCATION '$location'") | ||
| hiveClient.runSqlHive( | ||
| "INSERT INTO TABLE hive_orc SELECT 'a', 'b', 'c' FROM (SELECT 1) t") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding one more check?
checkAnswer(spark.table("hive_orc"), Row("a", "b ", "c"))Then, we can remove the test case SPARK-18220: read Hive orc table with varchar column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
|
Test build #72587 has finished for PR 16804 at commit
|
|
retest this please |
|
LGTM pending test |
|
Test build #72604 has finished for PR 16804 at commit
|
| test("SPARK-18220: read Hive orc table with varchar column") { | ||
| test("SPARK-19459/SPARK-18220: read char/varchar column written by Hive") { | ||
| val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client | ||
| val location = Utils.createTempDir().toURI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we remove this temp dir in the finally block?
|
Test build #72648 has finished for PR 16804 at commit
|
|
LGTM, merging to master! |
…tadata ## What changes were proposed in this pull request? Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column. This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see apache#16060 for more details on how the metadata is used. ## How was this patch tested? Added a regression test to `OrcSourceSuite`. Author: Herman van Hovell <[email protected]> Closes apache#16804 from hvanhovell/SPARK-19459.
## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.
## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up. # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala
## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up. # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala
## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.
|
This doesn't solve the problem when reading a CHAR/VARCHAR column in Hive from a table created using Spark, does it? Hive will fail when trying to convert the String to its CHAR/VARCHAR type |
What changes were proposed in this pull request?
Reading from an existing ORC table which contains
charorvarcharcolumns can fail with aClassCastExceptionif the table metadata has been created using Spark. This is caused by the fact that spark internally replacescharandvarcharcolumns with astringcolumn.This PR fixes this by adding the hive type to the
StructField'smetadata under theHIVE_TYPE_STRINGkey. This is picked up by theHiveClientand the ORC reader, see #16060 for more details on how the metadata is used.How was this patch tested?
Added a regression test to
OrcSourceSuite.