[SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native data source table persisted in metastore #22262

seancxmao · 2018-08-29T06:45:27Z

What changes were proposed in this pull request?

Apache Spark doesn't create Hive table with duplicated fields in both case-sensitive and case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first and create Hive table on that location, where it's created. In this situation, field resolution should fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or filtered. Previously, SPARK-25132 fixed the same issue in Parquet.

Here is a simple example:

val data = spark.range(5).selectExpr("id as a", "id * 2 as A")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION '/user/hive/warehouse/orc_data'")
spark.conf.set("spark.sql.caseSensitive", false)
sql("select A from orc_data_source").show
+---+
|  A|
+---+
|  3|
|  2|
|  4|
|  1|
|  0|
+---+

See #22148 for more details about parquet data source reader.

How was this patch tested?

Unit tests added.

…for ORC native reader

seancxmao · 2018-08-29T06:53:24Z

@dongjoon-hyun @cloud-fan @gatorsmile Could you please kindly review this?

dongjoon-hyun · 2018-09-02T03:08:58Z

@seancxmao . Could you explain why we need this PR more specifically in the PR description? Apache Spark 2.3.1 already shows exceptions like the following for both ORC and Parquet, doesn't it?

scala> spark.version
res5: String = 2.3.1

scala> sql("set spark.sql.caseSensitive=true")
scala> spark.read.orc("/tmp/o").printSchema
root
 |-- a: integer (nullable = true)
 |-- A: integer (nullable = true)

scala> sql("set spark.sql.caseSensitive=false")
scala> spark.read.orc("/tmp/o").printSchema
18/09/01 20:06:05 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `a`;
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `a`;

In general, we had better be more specific. The PR claims a general issue, but the test case seems to cover only the following very specific case.

When you create ORC file directly in a case-sensitive manner with case-insensitively duplicate columns like 'a' and 'A'.
Then, when you try to access it as a Hive table with a single column 'a' or 'A' (because Hive doesn't allow to have both).

seancxmao · 2018-09-05T16:30:00Z

@dongjoon-hyun I have updated PR description to explain in more details. As you mentioned, this PR is specific to the case when reading from data source table persisted in metastore.

dongjoon-hyun · 2018-09-06T00:15:28Z

Thank you, @seancxmao . I'll review tonight again.

dongjoon-hyun · 2018-09-07T05:14:15Z

Sorry, but I'm still feeling that this PR is losing focus. How about mentioning what you do in this PR like the following?

Apache Spark doesn't create Hive tables with duplicated fields in both case-sensitive and
case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first
and creates Hive table on that location, it's created. In this situation, field resolution should
fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or
filtered. Previously, SPARK-25132 fixed the same issue in Parquet.

Here is a simple example:
...

seancxmao · 2018-09-07T05:24:41Z

I updated the PR description. Thank you for pointing that PR description should stay focused. I also think it's more clear.

dongjoon-hyun · 2018-09-07T05:53:20Z

Thank you, @seancxmao .
Also, I made a PR to you, seancxmao#1 , to simply the logic.
Could you review and merge that if you think that's okay?

cloud-fan · 2018-09-07T08:27:33Z

ok to test

SparkQA · 2018-09-07T12:10:38Z

Test build #95792 has finished for PR 22262 at commit fa2a45f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-08T03:21:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

Sorry, @seancxmao . We need to update once more. ~~I'll make a PR to you.~~

That's all right :)

dongjoon-hyun · 2018-09-08T03:53:02Z

@seancxmao . I mistook yesterday. Could you restore to your first commit? In the first commit, please adjust the indentation at line 140. Sorry for the back and forth!

seancxmao · 2018-09-08T05:20:33Z

@dongjoon-hyun That's all right :). I have reverted to the first commit and adjusted the indentation.

dongjoon-hyun · 2018-09-08T05:46:11Z

~~BTW, I think we need this duplication check in case-sensitive mode, too. I'll ping on previous Parquet PR.~~ Never mind. It's also another RuntimeException with different message in case-sensitive mode.

scala> sql("select * from parquet").show
java.lang.RuntimeException: [id] optional int32 id was added twice at org.apache.parquet.hadoop.ColumnChunkPageReadStore.addColumn(ColumnChunkPageReadStore.java:175)
...

seancxmao · 2018-09-08T05:59:53Z

... we need this duplication check in case-sensitive mode ...

Do you mean we may define ORC/Parquet schema with identical field names (even in the same letter case)? Would you please explain a bit more on this?

dongjoon-hyun · 2018-09-08T06:33:11Z

The following is the sequence. LOCATION is accident-prone. I don't think that is the scope of this PR.

This is not related to spark.sql.caseSensitive.

scala> sql("insert overwrite local directory '/tmp/parquet' stored as parquet select 1 id, 2 id")

$ parquet-tools schema /tmp/parquet
message hive_schema {
  optional int32 id;
  optional int32 id;
}

The following occurs when set spark.sql.caseSensitive=true.

scala> sql("create table parquet(id int) USING parquet LOCATION '/tmp/parquet'")
res3: org.apache.spark.sql.DataFrame = []

scala> sql("select * from parquet")
res4: org.apache.spark.sql.DataFrame = [id: int]

scala> sql("select * from parquet").show
18/09/07 23:31:03 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.RuntimeException: [id] INT32 was added twice

SparkQA · 2018-09-08T07:05:02Z

Test build #95825 has finished for PR 22262 at commit 26b4710.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-08T07:05:02Z

Test build #95824 has finished for PR 22262 at commit 366bb35.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-08T07:21:28Z

Retest this please.

SparkQA · 2018-09-08T11:08:48Z

Test build #95827 has finished for PR 22262 at commit 26b4710.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

cc @cloud-fan and @gatorsmile

dongjoon-hyun · 2018-09-10T02:22:25Z

Merged to master/2.4.

… for ORC native data source table persisted in metastore ## What changes were proposed in this pull request? Apache Spark doesn't create Hive table with duplicated fields in both case-sensitive and case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first and create Hive table on that location, where it's created. In this situation, field resolution should fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or filtered. Previously, SPARK-25132 fixed the same issue in Parquet. Here is a simple example: ``` val data = spark.range(5).selectExpr("id as a", "id * 2 as A") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION '/user/hive/warehouse/orc_data'") spark.conf.set("spark.sql.caseSensitive", false) sql("select A from orc_data_source").show +---+ | A| +---+ | 3| | 2| | 4| | 1| | 0| +---+ ``` See #22148 for more details about parquet data source reader. ## How was this patch tested? Unit tests added. Closes #22262 from seancxmao/SPARK-25175. Authored-by: seancxmao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a0aed47) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2018-09-10T02:29:07Z

Thank you, @seancxmao .

seancxmao · 2018-09-10T05:11:50Z

@dongjoon-hyun Thank you!

[SPARK-25175][SQL] Field resolution should fail if there's ambiguity …

366bb35

…for ORC native reader

seancxmao changed the title ~~[SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native reader~~ [SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native data source table persisted in metastore Sep 5, 2018

dongjoon-hyun reviewed Sep 8, 2018

View reviewed changes

seancxmao force-pushed the SPARK-25175 branch from fa2a45f to 366bb35 Compare September 8, 2018 05:15

Adjust indentation

26b4710

dongjoon-hyun approved these changes Sep 8, 2018

View reviewed changes

asfgit closed this in a0aed47 Sep 10, 2018

[SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native data source table persisted in metastore #22262

[SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native data source table persisted in metastore #22262

Uh oh!

Conversation

seancxmao commented Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

seancxmao commented Aug 29, 2018

Uh oh!

dongjoon-hyun commented Sep 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seancxmao commented Sep 5, 2018

Uh oh!

dongjoon-hyun commented Sep 6, 2018

Uh oh!

dongjoon-hyun commented Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seancxmao commented Sep 7, 2018

Uh oh!

dongjoon-hyun commented Sep 7, 2018

Uh oh!

cloud-fan commented Sep 7, 2018

Uh oh!

SparkQA commented Sep 7, 2018

Uh oh!

dongjoon-hyun Sep 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seancxmao Sep 8, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 8, 2018

Uh oh!

seancxmao commented Sep 8, 2018

Uh oh!

dongjoon-hyun commented Sep 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seancxmao commented Sep 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 8, 2018

Uh oh!

SparkQA commented Sep 8, 2018

Uh oh!

dongjoon-hyun commented Sep 8, 2018

Uh oh!

SparkQA commented Sep 8, 2018

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 10, 2018

Uh oh!

dongjoon-hyun commented Sep 10, 2018

Uh oh!

seancxmao commented Sep 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seancxmao commented Aug 29, 2018 •

edited

Loading

dongjoon-hyun commented Sep 2, 2018 •

edited

Loading

dongjoon-hyun commented Sep 7, 2018 •

edited

Loading

dongjoon-hyun Sep 8, 2018 •

edited

Loading

dongjoon-hyun commented Sep 8, 2018 •

edited

Loading

seancxmao commented Sep 8, 2018 •

edited

Loading

dongjoon-hyun commented Sep 8, 2018 •

edited

Loading