Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 25, 2020

What changes were proposed in this pull request?

Check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema:

The check takes into account the SQL config spark.sql.caseSensitive (false by default).

Why are the changes needed?

To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`

Checking of top-level duplicates was introduced by #17758.

Does this PR introduce any user-facing change?

Yes. For the example from SPARK-32431:

ORC:

java.io.IOException: Error reading file: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
	at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind DATA position: 6 length: 6 range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3
	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
	at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)

JSON:

+------------+
|StructColumn|
+------------+
|        [,,]|
+------------+

Parquet:

+------------+
|StructColumn|
+------------+
|     [0,, 1]|
+------------+

Avro:

+------------+
|StructColumn|
+------------+
|        [,,]|
+------------+

After the changes, Parquet, ORC, JSON and Avro output the same error:

Found duplicate column(s) in the data schema: `camelcase`;
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`;
	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112)
	at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51)
	at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67)

How was this patch tested?

Run modified test suites:

$ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*"

and added new UT to SchemaUtilsSuite.

@SparkQA
Copy link

SparkQA commented Jul 25, 2020

Test build #126539 has finished for PR 29234 at commit cc971f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126654 has finished for PR 29234 at commit d0764d2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126667 has finished for PR 29234 at commit 8906732.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources Jul 28, 2020
@SparkQA
Copy link

SparkQA commented Jul 28, 2020

Test build #126701 has finished for PR 29234 at commit e66b03c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2020

Test build #126726 has finished for PR 29234 at commit bd03de5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class AvroSuite extends QueryTest with SharedSparkSession with NestedDataSourceSuiteBase
  • trait NestedDataSourceSuiteBase extends QueryTest with SharedSparkSession

.add("camelcase", LongType)
.add("CamelCase", LongType))
).foreach { case (selectExpr: Seq[String], caseInsensitiveSchema: StructType) =>
withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we test both v1 and v2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AvroSuite tests both. We could run entire FileBasedDataSourceSuite for v1 and v2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked both v1 and v2, see NestedDataSourceV1Suite and NestedDataSourceV2Suite

@SparkQA
Copy link

SparkQA commented Jul 29, 2020

Test build #126765 has finished for PR 29234 at commit ae7268c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2020

Test build #126769 has finished for PR 29234 at commit 9d72467.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2020

Test build #126788 has finished for PR 29234 at commit 918c77c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 30, 2020

@HyukjinKwon @cloud-fan Could you review this PR.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 99a8555 Jul 30, 2020
cloud-fan pushed a commit that referenced this pull request Aug 3, 2020
…atasource

### What changes were proposed in this pull request?
Check that there are not duplicate column names on the same level (top level or nested levels) in reading from JDBC datasource. If such duplicate columns exist, throw the exception:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value:
```
The check takes into account the SQL config `spark.sql.caseSensitive` (`false` by default).

### Why are the changes needed?
To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error:
```Scala
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: `camelcase`
```

Checking of top-level duplicates was introduced by #17758, and duplicates in nested structures by #29234.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Added new test suite `JdbcNestedDataSourceSuite`.

Closes #29317 from MaxGekk/jdbc-dup-nested-columns.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

- In Spark 3.1, `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result `NULL`.

- In Spark 3.1, the Parquet, ORC, Avro and JSON datasources throw the exception `org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema` in read if they detect duplicate names in top-level columns as well in nested structures. The datasources take into account the SQL config `spark.sql.caseSensitive` while detecting column name duplicates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk Also update this for the changes made in #29317?

@MaxGekk MaxGekk deleted the nested-case-insensitive-column branch December 11, 2020 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants