-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native data source table persisted in metastore #22262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…for ORC native reader
|
@dongjoon-hyun @cloud-fan @gatorsmile Could you please kindly review this? |
|
@seancxmao . Could you explain why we need this PR more specifically in the PR description? Apache Spark 2.3.1 already shows exceptions like the following for both ORC and Parquet, doesn't it? scala> spark.version
res5: String = 2.3.1
scala> sql("set spark.sql.caseSensitive=true")
scala> spark.read.orc("/tmp/o").printSchema
root
|-- a: integer (nullable = true)
|-- A: integer (nullable = true)
scala> sql("set spark.sql.caseSensitive=false")
scala> spark.read.orc("/tmp/o").printSchema
18/09/01 20:06:05 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `a`;
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `a`;In general, we had better be more specific. The PR claims a general issue, but the test case seems to cover only the following very specific case.
|
|
@dongjoon-hyun I have updated PR description to explain in more details. As you mentioned, this PR is specific to the case when reading from data source table persisted in metastore. |
|
Thank you, @seancxmao . I'll review tonight again. |
|
Sorry, but I'm still feeling that this PR is losing focus. How about mentioning what you do in this PR like the following? |
|
I updated the PR description. Thank you for pointing that PR description should stay focused. I also think it's more clear. |
|
Thank you, @seancxmao . |
|
ok to test |
|
Test build #95792 has finished for PR 22262 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @seancxmao . We need to update once more. I'll make a PR to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's all right :)
|
@seancxmao . I mistook yesterday. Could you restore to your first commit? In the first commit, please adjust the indentation at line 140. Sorry for the back and forth! |
fa2a45f to
366bb35
Compare
|
@dongjoon-hyun That's all right :). I have reverted to the first commit and adjusted the indentation. |
|
scala> sql("select * from parquet").show
java.lang.RuntimeException: [id] optional int32 id was added twice at org.apache.parquet.hadoop.ColumnChunkPageReadStore.addColumn(ColumnChunkPageReadStore.java:175)
... |
Do you mean we may define ORC/Parquet schema with identical field names (even in the same letter case)? Would you please explain a bit more on this? |
|
The following is the sequence. This is not related to scala> sql("insert overwrite local directory '/tmp/parquet' stored as parquet select 1 id, 2 id")The following occurs when scala> sql("create table parquet(id int) USING parquet LOCATION '/tmp/parquet'")
res3: org.apache.spark.sql.DataFrame = []
scala> sql("select * from parquet")
res4: org.apache.spark.sql.DataFrame = [id: int]
scala> sql("select * from parquet").show
18/09/07 23:31:03 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.RuntimeException: [id] INT32 was added twice |
|
Test build #95825 has finished for PR 22262 at commit
|
|
Test build #95824 has finished for PR 22262 at commit
|
|
Retest this please. |
|
Test build #95827 has finished for PR 22262 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
cc @cloud-fan and @gatorsmile
|
Merged to master/2.4. |
… for ORC native data source table persisted in metastore ## What changes were proposed in this pull request? Apache Spark doesn't create Hive table with duplicated fields in both case-sensitive and case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first and create Hive table on that location, where it's created. In this situation, field resolution should fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or filtered. Previously, SPARK-25132 fixed the same issue in Parquet. Here is a simple example: ``` val data = spark.range(5).selectExpr("id as a", "id * 2 as A") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION '/user/hive/warehouse/orc_data'") spark.conf.set("spark.sql.caseSensitive", false) sql("select A from orc_data_source").show +---+ | A| +---+ | 3| | 2| | 4| | 1| | 0| +---+ ``` See #22148 for more details about parquet data source reader. ## How was this patch tested? Unit tests added. Closes #22262 from seancxmao/SPARK-25175. Authored-by: seancxmao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a0aed47) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Thank you, @seancxmao . |
|
@dongjoon-hyun Thank you! |
What changes were proposed in this pull request?
Apache Spark doesn't create Hive table with duplicated fields in both case-sensitive and case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first and create Hive table on that location, where it's created. In this situation, field resolution should fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or filtered. Previously, SPARK-25132 fixed the same issue in Parquet.
Here is a simple example:
See #22148 for more details about parquet data source reader.
How was this patch tested?
Unit tests added.