[SPARK-26913][SQL] New data source V2 API: SupportsDirectWrite #23824

gengliangwang · 2019-02-18T11:04:21Z

What changes were proposed in this pull request?

Spark supports writing to file data sources without getting the table schema and validation with the schema.
For example,

spark.range(10).write.orc(path)
val newDF = spark.range(20).map(id => (id.toDouble, id.toString)).toDF("double", "string")
newDF.write.mode("overwrite").orc(path)

There is no actions to get/infer the schema from the table/path
The schema of newDF can be different with the original table schema.

However, from https://github.com/apache/spark/pull/23606/files#r255319992 we can see that the feature above is still missing in data source V2. Currently, data source V2 always validates the output query with the table schema. Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.

This PR proposes to create a new mix-in interface SupportsDirectWrite. With the interface, Spark will write to the table location directly without schema inference and validation on DataFrameWriter.save().

The PR also reeanbles Orc data source V2.

How was this patch tested?

Unit test

gengliangwang · 2019-02-18T11:04:47Z

@cloud-fan @rdblue @rxin @gatorsmile @jose-torres

cloud-fan · 2019-02-18T12:13:21Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsDirectWrite.java

+ * <p>
+ * If a {@link Table} implements this interface, the
+ * {@link SupportsWrite#newWriteBuilder(DataSourceOptions)}  must return a {@link WriteBuilder}
+ * with {@link WriteBuilder#buildForBatch()} implemented.


this is wrong, please remove the entire <p>..</p>

cloud-fan · 2019-02-18T12:16:55Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

      val options = sessionOptions ++ extraOptions + checkFilesExistsOption
      val dsOptions = new DataSourceOptions(options.asJava)
      provider.getTable(dsOptions) match {
+        case table: SupportsDirectWrite =>


this should work without save mode. That said, we should add a new flag in AppendData and other operators to indicate if it needs schema validation.

Creating DataSourceV2Relation requires table schema. Take file source as an example, Spark doesn't need to infer data schema here.

cloud-fan · 2019-02-18T12:18:00Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsDirectWrite.java

+ * </p>
+ */
+@Evolving
+public interface SupportsDirectWrite extends SupportsWrite {}


I'm fine with it but eventually we should put it in the capability API.

cloud-fan · 2019-02-18T12:19:09Z

IIRC there was a discussion about schema validation before, @rdblue what's your use cases for it?

gengliangwang · 2019-02-18T14:00:37Z

It seems that supporting direct write requires supporting save modes. Create #23829 and close this one.

SparkQA · 2019-02-18T15:10:13Z

Test build #102465 has finished for PR 23824 at commit 4e76dde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-02-18T16:42:47Z

I think this is a bad idea if the intent is to be a work-around for not having the new table catalog API (from the link in the description). If there is some other reason why this is necessary, then definitely write up a clear proposal for how this is supposed to work and we can discuss the extension to the v2 API. Otherwise, I'm -1.

HyukjinKwon · 2019-02-19T06:22:16Z

I am going to join in the meetup on 21st this month anyway but I really think we need a way to avoid the schema verification. For some cases, it doesn't make sense at all to read schema in write path. I think we got some feedback from the mailing list as well.

rdblue · 2019-02-19T18:56:56Z

@HyukjinKwon, I agree that this is a valid use case.

Unfortunately, there are now 3 PRs for this so comments are hard to keep track of, but what I've said elsehwere is that we really just need to decide what tables should have this behavior and how those tables should communicate it to Spark. I think adding an API before understanding the current behavior and use case is going to cause problems.

SupportsDirectWrite

4e76dde

cloud-fan reviewed Feb 18, 2019

View reviewed changes

cloud-fan mentioned this pull request Feb 18, 2019

[SPARK-26744][SQL][HOTFIX] Disable schema validation tests for FileDataSourceV2 (partially revert SPARK-26744) #23828

Closed

gengliangwang closed this Feb 18, 2019

gengliangwang mentioned this pull request Feb 19, 2019

[SPARK-26915][SQL]File source should write without schema validation in DataFrameWriter.save() #23829

Closed

gengliangwang mentioned this pull request Feb 19, 2019

[SPARK-26915][SQL] DataFrameWriter.save() should write without schema validation #23836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-26913][SQL] New data source V2 API: SupportsDirectWrite #23824

[SPARK-26913][SQL] New data source V2 API: SupportsDirectWrite #23824

Uh oh!

gengliangwang commented Feb 18, 2019

Uh oh!

gengliangwang commented Feb 18, 2019

Uh oh!

cloud-fan Feb 18, 2019

Uh oh!

cloud-fan Feb 18, 2019

Uh oh!

gengliangwang Feb 18, 2019

Uh oh!

cloud-fan Feb 18, 2019

Uh oh!

cloud-fan commented Feb 18, 2019

Uh oh!

gengliangwang commented Feb 18, 2019 •

edited

Loading

Uh oh!

SparkQA commented Feb 18, 2019

Uh oh!

rdblue commented Feb 18, 2019

Uh oh!

HyukjinKwon commented Feb 19, 2019

Uh oh!

rdblue commented Feb 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-26913][SQL] New data source V2 API: SupportsDirectWrite #23824

[SPARK-26913][SQL] New data source V2 API: SupportsDirectWrite #23824

Uh oh!

Conversation

gengliangwang commented Feb 18, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Feb 18, 2019

Uh oh!

cloud-fan Feb 18, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 18, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Feb 18, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 18, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 18, 2019

Uh oh!

gengliangwang commented Feb 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 18, 2019

Uh oh!

rdblue commented Feb 18, 2019

Uh oh!

HyukjinKwon commented Feb 19, 2019

Uh oh!

rdblue commented Feb 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gengliangwang commented Feb 18, 2019 •

edited

Loading