Skip to content

Conversation

@seancxmao
Copy link
Contributor

@seancxmao seancxmao commented Aug 29, 2018

What changes were proposed in this pull request?

Apache Spark doesn't create Hive table with duplicated fields in both case-sensitive and case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first and create Hive table on that location, where it's created. In this situation, field resolution should fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or filtered. Previously, SPARK-25132 fixed the same issue in Parquet.

Here is a simple example:

val data = spark.range(5).selectExpr("id as a", "id * 2 as A")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION '/user/hive/warehouse/orc_data'")
spark.conf.set("spark.sql.caseSensitive", false)
sql("select A from orc_data_source").show
+---+
|  A|
+---+
|  3|
|  2|
|  4|
|  1|
|  0|
+---+

See #22148 for more details about parquet data source reader.

How was this patch tested?

Unit tests added.

@seancxmao
Copy link
Contributor Author

@dongjoon-hyun @cloud-fan @gatorsmile Could you please kindly review this?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 2, 2018

@seancxmao . Could you explain why we need this PR more specifically in the PR description? Apache Spark 2.3.1 already shows exceptions like the following for both ORC and Parquet, doesn't it?

scala> spark.version
res5: String = 2.3.1

scala> sql("set spark.sql.caseSensitive=true")
scala> spark.read.orc("/tmp/o").printSchema
root
 |-- a: integer (nullable = true)
 |-- A: integer (nullable = true)

scala> sql("set spark.sql.caseSensitive=false")
scala> spark.read.orc("/tmp/o").printSchema
18/09/01 20:06:05 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `a`;
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `a`;

In general, we had better be more specific. The PR claims a general issue, but the test case seems to cover only the following very specific case.

  • When you create ORC file directly in a case-sensitive manner with case-insensitively duplicate columns like 'a' and 'A'.
  • Then, when you try to access it as a Hive table with a single column 'a' or 'A' (because Hive doesn't allow to have both).

@seancxmao
Copy link
Contributor Author

@dongjoon-hyun I have updated PR description to explain in more details. As you mentioned, this PR is specific to the case when reading from data source table persisted in metastore.

@seancxmao seancxmao changed the title [SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native reader [SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native data source table persisted in metastore Sep 5, 2018
@dongjoon-hyun
Copy link
Member

Thank you, @seancxmao . I'll review tonight again.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 7, 2018

Sorry, but I'm still feeling that this PR is losing focus. How about mentioning what you do in this PR like the following?

Apache Spark doesn't create Hive tables with duplicated fields in both case-sensitive and
case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first
and creates Hive table on that location, it's created. In this situation, field resolution should
fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or
filtered. Previously, SPARK-25132 fixed the same issue in Parquet.

Here is a simple example:
...

@seancxmao
Copy link
Contributor Author

I updated the PR description. Thank you for pointing that PR description should stay focused. I also think it's more clear.

@dongjoon-hyun
Copy link
Member

Thank you, @seancxmao .
Also, I made a PR to you, seancxmao#1 , to simply the logic.
Could you review and merge that if you think that's okay?

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95792 has finished for PR 22262 at commit fa2a45f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @seancxmao . We need to update once more. I'll make a PR to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all right :)

@dongjoon-hyun
Copy link
Member

@seancxmao . I mistook yesterday. Could you restore to your first commit? In the first commit, please adjust the indentation at line 140. Sorry for the back and forth!

@seancxmao
Copy link
Contributor Author

@dongjoon-hyun That's all right :). I have reverted to the first commit and adjusted the indentation.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 8, 2018

BTW, I think we need this duplication check in case-sensitive mode, too. I'll ping on previous Parquet PR. Never mind. It's also another RuntimeException with different message in case-sensitive mode.

scala> sql("select * from parquet").show
java.lang.RuntimeException: [id] optional int32 id was added twice at org.apache.parquet.hadoop.ColumnChunkPageReadStore.addColumn(ColumnChunkPageReadStore.java:175)
...

@seancxmao
Copy link
Contributor Author

seancxmao commented Sep 8, 2018

... we need this duplication check in case-sensitive mode ...

Do you mean we may define ORC/Parquet schema with identical field names (even in the same letter case)? Would you please explain a bit more on this?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 8, 2018

The following is the sequence. LOCATION is accident-prone. I don't think that is the scope of this PR.

This is not related to spark.sql.caseSensitive.

scala> sql("insert overwrite local directory '/tmp/parquet' stored as parquet select 1 id, 2 id")
$ parquet-tools schema /tmp/parquet
message hive_schema {
  optional int32 id;
  optional int32 id;
}

The following occurs when set spark.sql.caseSensitive=true.

scala> sql("create table parquet(id int) USING parquet LOCATION '/tmp/parquet'")
res3: org.apache.spark.sql.DataFrame = []

scala> sql("select * from parquet")
res4: org.apache.spark.sql.DataFrame = [id: int]

scala> sql("select * from parquet").show
18/09/07 23:31:03 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.RuntimeException: [id] INT32 was added twice

@SparkQA
Copy link

SparkQA commented Sep 8, 2018

Test build #95825 has finished for PR 22262 at commit 26b4710.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2018

Test build #95824 has finished for PR 22262 at commit 366bb35.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Sep 8, 2018

Test build #95827 has finished for PR 22262 at commit 26b4710.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

cc @cloud-fan and @gatorsmile

@dongjoon-hyun
Copy link
Member

Merged to master/2.4.

asfgit pushed a commit that referenced this pull request Sep 10, 2018
… for ORC native data source table persisted in metastore

## What changes were proposed in this pull request?
Apache Spark doesn't create Hive table with duplicated fields in both case-sensitive and case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first and create Hive table on that location, where it's created. In this situation, field resolution should fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or filtered. Previously, SPARK-25132 fixed the same issue in Parquet.

Here is a simple example:

```
val data = spark.range(5).selectExpr("id as a", "id * 2 as A")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION '/user/hive/warehouse/orc_data'")
spark.conf.set("spark.sql.caseSensitive", false)
sql("select A from orc_data_source").show
+---+
|  A|
+---+
|  3|
|  2|
|  4|
|  1|
|  0|
+---+
```

See #22148 for more details about parquet data source reader.

## How was this patch tested?
Unit tests added.

Closes #22262 from seancxmao/SPARK-25175.

Authored-by: seancxmao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a0aed47)
Signed-off-by: Dongjoon Hyun <[email protected]>
@asfgit asfgit closed this in a0aed47 Sep 10, 2018
@dongjoon-hyun
Copy link
Member

Thank you, @seancxmao .

@seancxmao
Copy link
Contributor Author

@dongjoon-hyun Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants