-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26913][SQL] New data source V2 API: SupportsDirectWrite #23824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * <p> | ||
| * If a {@link Table} implements this interface, the | ||
| * {@link SupportsWrite#newWriteBuilder(DataSourceOptions)} must return a {@link WriteBuilder} | ||
| * with {@link WriteBuilder#buildForBatch()} implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wrong, please remove the entire <p>..</p>
| val options = sessionOptions ++ extraOptions + checkFilesExistsOption | ||
| val dsOptions = new DataSourceOptions(options.asJava) | ||
| provider.getTable(dsOptions) match { | ||
| case table: SupportsDirectWrite => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work without save mode. That said, we should add a new flag in AppendData and other operators to indicate if it needs schema validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating DataSourceV2Relation requires table schema. Take file source as an example, Spark doesn't need to infer data schema here.
| * </p> | ||
| */ | ||
| @Evolving | ||
| public interface SupportsDirectWrite extends SupportsWrite {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it but eventually we should put it in the capability API.
|
IIRC there was a discussion about schema validation before, @rdblue what's your use cases for it? |
|
It seems that supporting direct write requires supporting save modes. Create #23829 and close this one. |
|
Test build #102465 has finished for PR 23824 at commit
|
|
I think this is a bad idea if the intent is to be a work-around for not having the new table catalog API (from the link in the description). If there is some other reason why this is necessary, then definitely write up a clear proposal for how this is supposed to work and we can discuss the extension to the v2 API. Otherwise, I'm -1. |
|
I am going to join in the meetup on 21st this month anyway but I really think we need a way to avoid the schema verification. For some cases, it doesn't make sense at all to read schema in write path. I think we got some feedback from the mailing list as well. |
|
@HyukjinKwon, I agree that this is a valid use case. Unfortunately, there are now 3 PRs for this so comments are hard to keep track of, but what I've said elsehwere is that we really just need to decide what tables should have this behavior and how those tables should communicate it to Spark. I think adding an API before understanding the current behavior and use case is going to cause problems. |
What changes were proposed in this pull request?
Spark supports writing to file data sources without getting the table schema and validation with the schema.
For example,
newDFcan be different with the original table schema.However, from https://github.com/apache/spark/pull/23606/files#r255319992 we can see that the feature above is still missing in data source V2. Currently, data source V2 always validates the output query with the table schema. Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.
This PR proposes to create a new mix-in interface
SupportsDirectWrite. With the interface, Spark will write to the table location directly without schema inference and validation onDataFrameWriter.save().The PR also reeanbles Orc data source V2.
How was this patch tested?
Unit test