-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26811][SQL] Add capabilities to v2.Table #24012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
22f3953 to
23746e7
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Test build #103161 has finished for PR 24012 at commit
|
|
@cloud-fan, can you take a look at this PR? It adds capabilities like we discussed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we remove this interface as well? We can move newScanBuilder to table and throw exception by default. Tables that reports batch/stream scan capability should overwrite newScanBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having this because it maintains separation between the read/write API and the catalog API. We could update the read and write API later or add a new one by adding a different read trait, without changing how catalogs and tables work. So I think it is worth keeping SupportsRead and SupportsWrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will have tons of capabilities, maybe Array is good enough? Array is also more java/scala friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to use an array when the storage should be a set, just because it is necessary to call asJava when returning it.
8636867 to
0d44757
Compare
|
@cloud-fan, I've rebased to pick up the changes in master introduced by the move to Although I see your point with returning an Array of capabilities, I think it is better to return a Set. That's how Spark uses the data and I see no reason to use the wrong kind of storage -- which we would no doubt coerce to a set -- just to avoid calling |
|
Test build #103455 has finished for PR 24012 at commit
|
|
sounds good. |
|
Retest this please. |
|
Test build #103510 has finished for PR 24012 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this doesn't need to be lazy val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this since I need to resolve conflicts.
93c77f5 to
69e729e
Compare
## What changes were proposed in this pull request? The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence. After #23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option. ## How was this patch tested? Unit test. Closes #24069 from gengliangwang/removeOptionCheckFilesExist. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
@cloud-fan, I've fixed the commit conflict caused by 6d22ee3. As I noted on that commit, please do not commit non-functional changes that cause unnecessary conflicts. That problem delayed getting this work in by another day. I've also removed |
|
Test build #103546 has finished for PR 24012 at commit
|
|
@cloud-fan, tests are passing on this so it is ready for another look. Thank you! |
|
thanks, merging to master! |
|
Thank you for reviewing this, @cloud-fan! |
|
Is there a plan documented on what the final API would look like? It's super confusing to have half capability via traits and half capability via enums. |
|
#24129 is adding streaming read/write capability. Eventually we should have all the capabilities via enum. |
## What changes were proposed in this pull request? This is a followup of #24012 , to add the corresponding capabilities for streaming. ## How was this patch tested? existing tests Closes #24129 from cloud-fan/capability. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation. Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks. Closes apache#24012 from rdblue/SPARK-26811-add-capabilities. Authored-by: Ryan Blue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? It's a followup of apache#24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes apache#24285 from cloud-fan/doc. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? It's a followup of apache#24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes apache#24285 from cloud-fan/doc. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This is a followup of apache#24012 , to add the corresponding capabilities for streaming. existing tests Closes apache#24129 from cloud-fan/capability. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This adds a new method,
capabilitiestov2.Tablethat returns a set ofTableCapability. Capabilities are used to fail queries during analysis checks,V2WriteSupportCheck, when the table does not support operations, like truncation.How was this patch tested?
Existing tests for regressions, added new analysis suite,
V2WriteSupportCheckSuite, for new capability checks.