-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources #29234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #126539 has finished for PR 29234 at commit
|
|
Test build #126654 has finished for PR 29234 at commit
|
|
Test build #126667 has finished for PR 29234 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala
Outdated
Show resolved
Hide resolved
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #126701 has finished for PR 29234 at commit
|
|
Test build #126726 has finished for PR 29234 at commit
|
| .add("camelcase", LongType) | ||
| .add("CamelCase", LongType)) | ||
| ).foreach { case (selectExpr: Seq[String], caseInsensitiveSchema: StructType) => | ||
| withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we test both v1 and v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AvroSuite tests both. We could run entire FileBasedDataSourceSuite for v1 and v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked both v1 and v2, see NestedDataSourceV1Suite and NestedDataSourceV2Suite
|
Test build #126765 has finished for PR 29234 at commit
|
|
Test build #126769 has finished for PR 29234 at commit
|
|
Test build #126788 has finished for PR 29234 at commit
|
|
@HyukjinKwon @cloud-fan Could you review this PR. |
|
thanks, merging to master! |
…atasource ### What changes were proposed in this pull request? Check that there are not duplicate column names on the same level (top level or nested levels) in reading from JDBC datasource. If such duplicate columns exist, throw the exception: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: ``` The check takes into account the SQL config `spark.sql.caseSensitive` (`false` by default). ### Why are the changes needed? To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error: ```Scala org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: `camelcase` ``` Checking of top-level duplicates was introduced by #17758, and duplicates in nested structures by #29234. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added new test suite `JdbcNestedDataSourceSuite`. Closes #29317 from MaxGekk/jdbc-dup-nested-columns. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
|
||
| - In Spark 3.1, `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result `NULL`. | ||
|
|
||
| - In Spark 3.1, the Parquet, ORC, Avro and JSON datasources throw the exception `org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema` in read if they detect duplicate names in top-level columns as well in nested structures. The datasources take into account the SQL config `spark.sql.caseSensitive` while detecting column name duplicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
Check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception:
The check takes into account the SQL config
spark.sql.caseSensitive(falseby default).Why are the changes needed?
To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error:
Checking of top-level duplicates was introduced by #17758.
Does this PR introduce any user-facing change?
Yes. For the example from SPARK-32431:
ORC:
JSON:
Parquet:
Avro:
After the changes, Parquet, ORC, JSON and Avro output the same error:
How was this patch tested?
Run modified test suites:
and added new UT to
SchemaUtilsSuite.