Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 31, 2020

What changes were proposed in this pull request?

Check that there are not duplicate column names on the same level (top level or nested levels) in reading from JDBC datasource. If such duplicate columns exist, throw the exception:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value:

The check takes into account the SQL config spark.sql.caseSensitive (false by default).

Why are the changes needed?

To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: `camelcase`

Checking of top-level duplicates was introduced by #17758, and duplicates in nested structures by #29234.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Added new test suite JdbcNestedDataSourceSuite.

@MaxGekk MaxGekk changed the title [WIP][SQL] Check duplicate nested columns in read from JDBC datasource [SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource Jul 31, 2020
Comment on lines +822 to +823
SchemaUtils.checkSchemaColumnNameDuplication(
userSchema,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the fix - replacing checkColumnNameDuplication by checkSchemaColumnNameDuplication

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 31, 2020

@cloud-fan @HyukjinKwon @maropu Please, review this PR.

import org.apache.spark.sql.types.StructType
import org.apache.spark.util.Utils

class JdbcNestedDataSourceSuite extends NestedDataSourceSuiteBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Jdbc->JDBC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found 7 files with Jdbc:

➜  apache-spark git:(master) find . -name "Jdbc*.scala" -type f
./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala

and 8 starts from JDBC:

➜  apache-spark git:(master) find . -name "JDBC*.scala" -type f
./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala

If you don't mind I will rename all those files (and classes) to JDBC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR for renaming #29323

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like JDBC more as @maropu's suggestion but don't feel strongly about if we should rename the others.

tableSchema: StructType,
customSchema: String,
nameEquality: Resolver): StructType = {
if (null != customSchema && customSchema.nonEmpty) {
Copy link
Member

@maropu maropu Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. We need to accept a nested schema in customSchema? I checked the original PR #18266, but I couldn't find test cases for nested schemas. So, I'm not sure this is an expected behaviour... cc: @wangyum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know which JDBC server supports nested schema. But IIUC this feature is to specify the type, and I think it can be used to specify the data type of nested fields as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it can be and accepting nested fields looks okay. Either way, I think we need more test cases for customeSchema with nested fields, arrays, map, ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDBC spec mentions the STRUCT type, for example https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#STRUCT.

At least, you can access to Spark cluster from another Spark cluster via JDBC ;-)

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126888 has finished for PR 29317 at commit be2c224.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126893 has finished for PR 29317 at commit 5d1dcaf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JDBCNestedDataSourceSuite extends NestedDataSourceSuiteBase

@cloud-fan cloud-fan closed this in fda397d Aug 3, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants