[SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource #29317

MaxGekk · 2020-07-31T11:36:44Z

What changes were proposed in this pull request?

Check that there are not duplicate column names on the same level (top level or nested levels) in reading from JDBC datasource. If such duplicate columns exist, throw the exception:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value:

The check takes into account the SQL config spark.sql.caseSensitive (false by default).

Why are the changes needed?

To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: `camelcase`

Checking of top-level duplicates was introduced by #17758, and duplicates in nested structures by #29234.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Added new test suite JdbcNestedDataSourceSuite.

MaxGekk · 2020-07-31T12:17:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+      SchemaUtils.checkSchemaColumnNameDuplication(
+        userSchema,


Here is the fix - replacing checkColumnNameDuplication by checkSchemaColumnNameDuplication

MaxGekk · 2020-07-31T12:18:06Z

@cloud-fan @HyukjinKwon @maropu Please, review this PR.

maropu · 2020-07-31T13:58:59Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JdbcNestedDataSourceSuite.scala

+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+class JdbcNestedDataSourceSuite extends NestedDataSourceSuiteBase {


nit: Jdbc->JDBC

I have found 7 files with Jdbc:

➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala

and 8 starts from JDBC:

➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala

If you don't mind I will rename all those files (and classes) to JDBC.

Here is the PR for renaming #29323

I would like JDBC more as @maropu's suggestion but don't feel strongly about if we should rename the others.

maropu · 2020-07-31T14:02:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

      tableSchema: StructType,
      customSchema: String,
      nameEquality: Resolver): StructType = {
    if (null != customSchema && customSchema.nonEmpty) {


Oh, I see. We need to accept a nested schema in customSchema? I checked the original PR #18266, but I couldn't find test cases for nested schemas. So, I'm not sure this is an expected behaviour... cc: @wangyum

I don't know which JDBC server supports nested schema. But IIUC this feature is to specify the type, and I think it can be used to specify the data type of nested fields as well.

Yea, it can be and accepting nested fields looks okay. Either way, I think we need more test cases for customeSchema with nested fields, arrays, map, ...

JDBC spec mentions the STRUCT type, for example https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#STRUCT.

At least, you can access to Spark cluster from another Spark cluster via JDBC ;-)

SparkQA · 2020-07-31T15:56:37Z

Test build #126888 has finished for PR 29317 at commit be2c224.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-31T20:57:09Z

Test build #126893 has finished for PR 29317 at commit 5d1dcaf.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JDBCNestedDataSourceSuite extends NestedDataSourceSuiteBase

cloud-fan · 2020-08-03T03:20:30Z

thanks, merging to master!

MaxGekk added 2 commits July 31, 2020 14:33

Add JdbcNestedDataSourceSuite

da5c1ec

Fix JDBC

be2c224

probot-autolabeler bot added the SQL label Jul 31, 2020

MaxGekk changed the title ~~[WIP][SQL] Check duplicate nested columns in read from JDBC datasource~~ [SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource Jul 31, 2020

MaxGekk commented Jul 31, 2020

View reviewed changes

maropu reviewed Jul 31, 2020

View reviewed changes

Renaming: Jdbc -> JDBC

5d1dcaf

HyukjinKwon approved these changes Aug 2, 2020

View reviewed changes

cloud-fan closed this in fda397d Aug 3, 2020

gatorsmile mentioned this pull request Sep 8, 2020

[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources #29234

Closed

MaxGekk deleted the jdbc-dup-nested-columns branch December 11, 2020 20:27

[SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource #29317

[SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource #29317

Uh oh!

Conversation

MaxGekk commented Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jul 31, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 31, 2020

Uh oh!

SparkQA commented Jul 31, 2020

Uh oh!

cloud-fan commented Aug 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Jul 31, 2020 •

edited

Loading

maropu Jul 31, 2020 •

edited

Loading