[SPARK-27269][SQL] File source v2 should validate data schema only #24203

gengliangwang · 2019-03-25T14:51:37Z

What changes were proposed in this pull request?

Currently, File source v2 allows each data source to specify the supported data types by implementing the method supportsDataType in FileScan and FileWriteBuilder.

However, in the read path, the validation checks all the data types in readSchema, which might contain partition columns. This is actually a regression. E.g. Text data source only supports String data type, while the partition columns can still contain Integer type since partition columns are processed by Spark.

This PR is to:

Refactor schema validation and check data schema only.
Filter the partition columns in data schema if user specified schema provided.

How was this patch tested?

Unit test

gengliangwang · 2019-03-25T15:01:39Z

@cloud-fan @dongjoon-hyun @HyukjinKwon

SparkQA · 2019-03-25T18:15:09Z

Test build #103913 has finished for PR 24203 at commit 43743b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

HyukjinKwon

Looks good except one Dongjoon's comment and the test failure.

dongjoon-hyun · 2019-03-26T05:11:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

-  }.asNullable
+  lazy val dataSchema: StructType = userSpecifiedSchema.map { schema =>
+      val partitionSchema = fileIndex.partitionSchema
+      val equality = sparkSession.sessionState.conf.resolver


equality -> resolver?

The naming is following DataSource.scala line 185. I think it is OK.

Do you mean the one line written two year ago? All the other new instances use resolver = sparkSession.sessionState.conf.resolver (more than 7).

If you search with conf.resolver, there are more instances with val resolver.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

dongjoon-hyun · 2019-03-26T05:59:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

+      val partitionSchema = fileIndex.partitionSchema
+      val equality = sparkSession.sessionState.conf.resolver
+      StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))
+    }.orElse {


Indentation? (https://github.com/databricks/scala-style-guide#pattern-matching)
Line 50 ~ 58 should be updated.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVDataSourceV2.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcDataSourceV2.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileTableSuite.scala

dongjoon-hyun · 2019-03-26T06:10:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileTableSuite.scala

+      val table = new DummyFileTable(spark, options, Seq(pathName), expectedDataSchema, None)
+      assert(table.dataSchema == expectedDataSchema)
+      val expectedPartitionSchema = StructType(Seq(StructField("p", IntegerType, true)))
+      assert(table.fileIndex.partitionSchema ==  expectedPartitionSchema)


nit. additional space after ==.

dongjoon-hyun · 2019-03-26T06:29:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

  def inferSchema(files: Seq[FileStatus]): Option[StructType]
+
+  /**
+   * Returns whether this format supports the given [[DataType]] in write path.


write -> read/write.

dongjoon-hyun

@gengliangwang . Sorry for missing the regression last time.
This PR seems to have a different issue. Previously, we check the schema in read-path via toBatch. For now, it's removed. Currently, we only remove partition columns by names and we don't check the column types in the data schema.

SparkQA · 2019-03-26T06:38:30Z

Test build #103936 has finished for PR 24203 at commit 214bd8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-26T07:30:37Z

@dongjoon-hyun, do you mean the issue is that we don't check the schema in toBatch but lazily within Table? The data schema looks being checked via overridden schema at FileTable in read-path.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

SparkQA · 2019-03-26T10:42:17Z

Test build #103967 has finished for PR 24203 at commit 63466b1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-26T12:41:54Z

Test build #103966 has finished for PR 24203 at commit d8b2638.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

dongjoon-hyun

Only one nit comment about naming. Thank you, @gengliangwang .

SparkQA · 2019-03-26T21:00:21Z

Test build #103984 has finished for PR 24203 at commit 4ca742d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-26T22:58:07Z

Merged to master.

gengliangwang added 3 commits March 25, 2019 17:57

refactor

b5aea7f

add test case

da55493

undo unrelated changes

43743b3

cloud-fan approved these changes Mar 25, 2019

View reviewed changes

gengliangwang mentioned this pull request Mar 25, 2019

[SPARK-27271][SQL] Migrate Text to File Data Source V2 #24207

Closed

dongjoon-hyun reviewed Mar 25, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Mar 26, 2019

View reviewed changes

fix

214bd8b

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVDataSourceV2.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcDataSourceV2.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileTableSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

dongjoon-hyun requested changes Mar 26, 2019

View reviewed changes

gengliangwang added 2 commits March 26, 2019 16:34

address comments

d8b2638

revise

63466b1

gengliangwang commented Mar 26, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala Outdated Show resolved Hide resolved

dongjoon-hyun approved these changes Mar 26, 2019

View reviewed changes

revise

4ca742d

HyukjinKwon closed this in 267160b Mar 26, 2019

[SPARK-27269][SQL] File source v2 should validate data schema only #24203

[SPARK-27269][SQL] File source v2 should validate data schema only #24203

Uh oh!

Conversation

gengliangwang commented Mar 25, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Mar 25, 2019

Uh oh!

SparkQA commented Mar 25, 2019

Uh oh!

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun Mar 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 26, 2019

Uh oh!

HyukjinKwon commented Mar 26, 2019

Uh oh!

Uh oh!

SparkQA commented Mar 26, 2019

Uh oh!

SparkQA commented Mar 26, 2019

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 26, 2019

Uh oh!

HyukjinKwon commented Mar 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon left a comment •

edited

Loading

dongjoon-hyun Mar 26, 2019 •

edited

Loading

dongjoon-hyun Mar 26, 2019 •

edited

Loading

dongjoon-hyun Mar 26, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading