Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Currently, File source v2 allows each data source to specify the supported data types by implementing the method supportsDataType in FileScan and FileWriteBuilder.

However, in the read path, the validation checks all the data types in readSchema, which might contain partition columns. This is actually a regression. E.g. Text data source only supports String data type, while the partition columns can still contain Integer type since partition columns are processed by Spark.

This PR is to:

  1. Refactor schema validation and check data schema only.
  2. Filter the partition columns in data schema if user specified schema provided.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Mar 25, 2019

Test build #103913 has finished for PR 24203 at commit 43743b3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except one Dongjoon's comment and the test failure.

}.asNullable
lazy val dataSchema: StructType = userSpecifiedSchema.map { schema =>
val partitionSchema = fileIndex.partitionSchema
val equality = sparkSession.sessionState.conf.resolver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equality -> resolver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is following DataSource.scala line 185. I think it is OK.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the one line written two year ago? All the other new instances use resolver = sparkSession.sessionState.conf.resolver (more than 7).

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you search with conf.resolver, there are more instances with val resolver.

val partitionSchema = fileIndex.partitionSchema
val equality = sparkSession.sessionState.conf.resolver
StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))
}.orElse {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation? (https://github.com/databricks/scala-style-guide#pattern-matching)
Line 50 ~ 58 should be updated.

val table = new DummyFileTable(spark, options, Seq(pathName), expectedDataSchema, None)
assert(table.dataSchema == expectedDataSchema)
val expectedPartitionSchema = StructType(Seq(StructField("p", IntegerType, true)))
assert(table.fileIndex.partitionSchema == expectedPartitionSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. additional space after ==.

def inferSchema(files: Seq[FileStatus]): Option[StructType]

/**
* Returns whether this format supports the given [[DataType]] in write path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write -> read/write.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang . Sorry for missing the regression last time.
This PR seems to have a different issue. Previously, we check the schema in read-path via toBatch. For now, it's removed. Currently, we only remove partition columns by names and we don't check the column types in the data schema.

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103936 has finished for PR 24203 at commit 214bd8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@dongjoon-hyun, do you mean the issue is that we don't check the schema in toBatch but lazily within Table? The data schema looks being checked via overridden schema at FileTable in read-path.

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103967 has finished for PR 24203 at commit 63466b1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103966 has finished for PR 24203 at commit d8b2638.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one nit comment about naming. Thank you, @gengliangwang .

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103984 has finished for PR 24203 at commit 4ca742d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants