-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource #29317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| SchemaUtils.checkSchemaColumnNameDuplication( | ||
| userSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the fix - replacing checkColumnNameDuplication by checkSchemaColumnNameDuplication
|
@cloud-fan @HyukjinKwon @maropu Please, review this PR. |
| import org.apache.spark.sql.types.StructType | ||
| import org.apache.spark.util.Utils | ||
|
|
||
| class JdbcNestedDataSourceSuite extends NestedDataSourceSuiteBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Jdbc->JDBC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found 7 files with Jdbc:
➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f
./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala
and 8 starts from JDBC:
➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f
./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
If you don't mind I will rename all those files (and classes) to JDBC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR for renaming #29323
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like JDBC more as @maropu's suggestion but don't feel strongly about if we should rename the others.
| tableSchema: StructType, | ||
| customSchema: String, | ||
| nameEquality: Resolver): StructType = { | ||
| if (null != customSchema && customSchema.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know which JDBC server supports nested schema. But IIUC this feature is to specify the type, and I think it can be used to specify the data type of nested fields as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it can be and accepting nested fields looks okay. Either way, I think we need more test cases for customeSchema with nested fields, arrays, map, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDBC spec mentions the STRUCT type, for example https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#STRUCT.
At least, you can access to Spark cluster from another Spark cluster via JDBC ;-)
|
Test build #126888 has finished for PR 29317 at commit
|
|
Test build #126893 has finished for PR 29317 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Check that there are not duplicate column names on the same level (top level or nested levels) in reading from JDBC datasource. If such duplicate columns exist, throw the exception:
The check takes into account the SQL config
spark.sql.caseSensitive(falseby default).Why are the changes needed?
To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error:
Checking of top-level duplicates was introduced by #17758, and duplicates in nested structures by #29234.
Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
Added new test suite
JdbcNestedDataSourceSuite.