[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources #29234

MaxGekk · 2020-07-25T09:50:16Z

What changes were proposed in this pull request?

Check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema:

The check takes into account the SQL config spark.sql.caseSensitive (false by default).

Why are the changes needed?

To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`

Checking of top-level duplicates was introduced by #17758.

Does this PR introduce any user-facing change?

Yes. For the example from SPARK-32431:

ORC:

java.io.IOException: Error reading file: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
	at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind DATA position: 6 length: 6 range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3
	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
	at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)

JSON:

+------------+
|StructColumn|
+------------+
|        [,,]|
+------------+

Parquet:

+------------+
|StructColumn|
+------------+
|     [0,, 1]|
+------------+

Avro:

+------------+
|StructColumn|
+------------+
|        [,,]|
+------------+

After the changes, Parquet, ORC, JSON and Avro output the same error:

Found duplicate column(s) in the data schema: `camelcase`;
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`;
	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112)
	at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51)
	at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67)

How was this patch tested?

Run modified test suites:

$ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*"

and added new UT to SchemaUtilsSuite.

SparkQA · 2020-07-25T14:33:23Z

Test build #126539 has finished for PR 29234 at commit cc971f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-27T19:05:22Z

Test build #126654 has finished for PR 29234 at commit d0764d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-27T23:44:02Z

Test build #126667 has finished for PR 29234 at commit 8906732.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

SparkQA · 2020-07-28T11:24:23Z

Test build #126701 has finished for PR 29234 at commit e66b03c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-28T21:38:23Z

Test build #126726 has finished for PR 29234 at commit bd03de5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class AvroSuite extends QueryTest with SharedSparkSession with NestedDataSourceSuiteBase
trait NestedDataSourceSuiteBase extends QueryTest with SharedSparkSession

cloud-fan · 2020-07-29T05:59:44Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+            .add("camelcase", LongType)
+            .add("CamelCase", LongType))
+    ).foreach { case (selectExpr: Seq[String], caseInsensitiveSchema: StructType) =>
+      withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {


shall we test both v1 and v2?

AvroSuite tests both. We could run entire FileBasedDataSourceSuite for v1 and v2.

I checked both v1 and v2, see NestedDataSourceV1Suite and NestedDataSourceV2Suite

SparkQA · 2020-07-29T12:53:12Z

Test build #126765 has finished for PR 29234 at commit ae7268c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-29T16:08:25Z

Test build #126769 has finished for PR 29234 at commit 9d72467.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala

SparkQA · 2020-07-30T02:44:57Z

Test build #126788 has finished for PR 29234 at commit 918c77c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-07-30T05:44:16Z

@HyukjinKwon @cloud-fan Could you review this PR.

cloud-fan · 2020-07-30T06:05:50Z

thanks, merging to master!

…atasource ### What changes were proposed in this pull request? Check that there are not duplicate column names on the same level (top level or nested levels) in reading from JDBC datasource. If such duplicate columns exist, throw the exception: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: ``` The check takes into account the SQL config `spark.sql.caseSensitive` (`false` by default). ### Why are the changes needed? To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error: ```Scala org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: `camelcase` ``` Checking of top-level duplicates was introduced by #17758, and duplicates in nested structures by #29234. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added new test suite `JdbcNestedDataSourceSuite`. Closes #29317 from MaxGekk/jdbc-dup-nested-columns. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile · 2020-09-08T00:30:22Z

docs/sql-migration-guide.md


  - In Spark 3.1, `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result `NULL`.
+
+  - In Spark 3.1, the Parquet, ORC, Avro and JSON datasources throw the exception `org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema` in read if they detect duplicate names in top-level columns as well in nested structures. The datasources take into account the SQL config `spark.sql.caseSensitive` while detecting column name duplicates.


@MaxGekk Also update this for the changes made in #29317?

MaxGekk added 2 commits July 25, 2020 11:58

Add a test to FileBasedDataSourceSuite

e0c0920

Add a test to AvroSuite

cc971f4

probot-autolabeler bot added AVRO SQL labels Jul 25, 2020

dongjoon-hyun mentioned this pull request Jul 25, 2020

[SPARK-32437][CORE] Improve MapStatus deserialization speed with RoaringBitmap 0.9.0 #29233

Closed

MaxGekk added 2 commits July 27, 2020 19:50

Fix tests

d0764d2

Check nested structs

61a9e94

MaxGekk added 2 commits July 27, 2020 22:08

Refactoring

378a397

Ignore tests

8906732

Enable tests in AvroSuite and in FileBasedDataSourceSuite

e66b03c

MaxGekk changed the title ~~[SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns~~ [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources Jul 28, 2020

HyukjinKwon reviewed Jul 28, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 28, 2020

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

Put common code to NestedDataSourceSuiteBase

bd03de5

cloud-fan reviewed Jul 29, 2020

View reviewed changes

MaxGekk added 3 commits July 29, 2020 09:47

Replace checkSchemaColumnNameDuplication by rec impl

078bf50

Move tests to NestedDataSourceSuite

a746cdf

Update the SQL migration guide.

ae7268c

probot-autolabeler bot added the DOCS label Jul 29, 2020

MaxGekk added 3 commits July 29, 2020 14:25

Add a gap

c5616b3

Remove an unused import

12338f2

Fix coding style

9d72467

viirya reviewed Jul 29, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala Show resolved Hide resolved

Look into maps and arrays

918c77c

cloud-fan closed this in 99a8555 Jul 30, 2020

MaxGekk mentioned this pull request Jul 31, 2020

[SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource #29317

Closed

gatorsmile reviewed Sep 8, 2020

View reviewed changes

MaxGekk deleted the nested-case-insensitive-column branch December 11, 2020 20:27


		- In Spark 3.1, `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result `NULL`.

		- In Spark 3.1, the Parquet, ORC, Avro and JSON datasources throw the exception `org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema` in read if they detect duplicate names in top-level columns as well in nested structures. The datasources take into account the SQL config `spark.sql.caseSensitive` while detecting column name duplicates.

[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources #29234

[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources #29234

Uh oh!

Conversation

MaxGekk commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 25, 2020

Uh oh!

SparkQA commented Jul 27, 2020

Uh oh!

SparkQA commented Jul 27, 2020

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jul 28, 2020

Uh oh!

SparkQA commented Jul 28, 2020

Uh oh!

cloud-fan Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 29, 2020

Uh oh!

SparkQA commented Jul 29, 2020

Uh oh!

Uh oh!

SparkQA commented Jul 30, 2020

Uh oh!

MaxGekk commented Jul 30, 2020

Uh oh!

cloud-fan commented Jul 30, 2020

Uh oh!

gatorsmile Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MaxGekk commented Jul 25, 2020 •

edited

Loading