Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Spark supports writing to file data sources without getting the table schema and validation with the schema.
For example,

spark.range(10).write.orc(path)
val newDF = spark.range(20).map(id => (id.toDouble, id.toString)).toDF("double", "string")
newDF.write.mode("overwrite").orc(path)
  1. There is no actions to get/infer the schema from the table/path
  2. The schema of newDF can be different with the original table schema.

However, from https://github.com/apache/spark/pull/23606/files#r255319992 we can see that the feature above is still missing in data source V2. Currently, data source V2 always validates the output query with the table schema. Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.

This PR proposes to create a new mix-in interface SupportsDirectWrite. With the interface, Spark will write to the table location directly without schema inference and validation on DataFrameWriter.save().

The PR also reeanbles Orc data source V2.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

* <p>
* If a {@link Table} implements this interface, the
* {@link SupportsWrite#newWriteBuilder(DataSourceOptions)} must return a {@link WriteBuilder}
* with {@link WriteBuilder#buildForBatch()} implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong, please remove the entire <p>..</p>

val options = sessionOptions ++ extraOptions + checkFilesExistsOption
val dsOptions = new DataSourceOptions(options.asJava)
provider.getTable(dsOptions) match {
case table: SupportsDirectWrite =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should work without save mode. That said, we should add a new flag in AppendData and other operators to indicate if it needs schema validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating DataSourceV2Relation requires table schema. Take file source as an example, Spark doesn't need to infer data schema here.

* </p>
*/
@Evolving
public interface SupportsDirectWrite extends SupportsWrite {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it but eventually we should put it in the capability API.

@cloud-fan
Copy link
Contributor

IIRC there was a discussion about schema validation before, @rdblue what's your use cases for it?

@gengliangwang
Copy link
Member Author

gengliangwang commented Feb 18, 2019

It seems that supporting direct write requires supporting save modes. Create #23829 and close this one.

@SparkQA
Copy link

SparkQA commented Feb 18, 2019

Test build #102465 has finished for PR 23824 at commit 4e76dde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Feb 18, 2019

I think this is a bad idea if the intent is to be a work-around for not having the new table catalog API (from the link in the description). If there is some other reason why this is necessary, then definitely write up a clear proposal for how this is supposed to work and we can discuss the extension to the v2 API. Otherwise, I'm -1.

@HyukjinKwon
Copy link
Member

I am going to join in the meetup on 21st this month anyway but I really think we need a way to avoid the schema verification. For some cases, it doesn't make sense at all to read schema in write path. I think we got some feedback from the mailing list as well.

@rdblue
Copy link
Contributor

rdblue commented Feb 19, 2019

@HyukjinKwon, I agree that this is a valid use case.

Unfortunately, there are now 3 PRs for this so comments are hard to keep track of, but what I've said elsehwere is that we really just need to decide what tables should have this behavior and how those tables should communicate it to Spark. I think adding an API before understanding the current behavior and use case is going to cause problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants