[SPARK-26811][SQL] Add capabilities to v2.Table #24012

rdblue · 2019-03-07T20:06:36Z

What changes were proposed in this pull request?

This adds a new method, capabilities to v2.Table that returns a set of TableCapability. Capabilities are used to fail queries during analysis checks, V2WriteSupportCheck, when the table does not support operations, like truncation.

How was this patch tested?

Existing tests for regressions, added new analysis suite, V2WriteSupportCheckSuite, for new capability checks.

SparkQA · 2019-03-08T00:54:53Z

Test build #103161 has finished for PR 24012 at commit 8636867.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-03-08T16:34:01Z

@cloud-fan, can you take a look at this PR? It adds capabilities like we discussed.

cloud-fan · 2019-03-11T06:05:25Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsRead.java

shall we remove this interface as well? We can move newScanBuilder to table and throw exception by default. Tables that reports batch/stream scan capability should overwrite newScanBuilder

I like having this because it maintains separation between the read/write API and the catalog API. We could update the read and write API later or add a new one by adding a different read trait, without changing how catalogs and tables work. So I think it is worth keeping SupportsRead and SupportsWrite.

cloud-fan · 2019-03-12T16:25:01Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java

I don't think we will have tons of capabilities, maybe Array is good enough? Array is also more java/scala friendly.

I don't think it's a good idea to use an array when the storage should be a set, just because it is necessary to call asJava when returning it.

cloud-fan · 2019-03-13T17:40:46Z

LGTM except https://github.com/apache/spark/pull/24012/files#r264765864

rdblue · 2019-03-13T21:07:54Z

@cloud-fan, I've rebased to pick up the changes in master introduced by the move to CaseInsensitiveStringMap. I think this is ready to go.

Although I see your point with returning an Array of capabilities, I think it is better to return a Set. That's how Spark uses the data and I see no reason to use the wrong kind of storage -- which we would no doubt coerce to a set -- just to avoid calling asJava in a few places.

SparkQA · 2019-03-13T22:17:27Z

Test build #103455 has finished for PR 24012 at commit 0d44757.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-14T03:16:39Z

sounds good.

rdblue · 2019-03-14T20:21:23Z

Retest this please.

SparkQA · 2019-03-15T01:39:36Z

Test build #103510 has finished for PR 24012 at commit 93c77f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-15T14:53:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

nit: this doesn't need to be lazy val

I'll fix this since I need to resolve conflicts.

## What changes were proposed in this pull request? The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence. After #23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option. ## How was this patch tested? Unit test. Closes #24069 from gengliangwang/removeOptionCheckFilesExist. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

rdblue · 2019-03-15T18:31:54Z

@cloud-fan, I've fixed the commit conflict caused by 6d22ee3. As I noted on that commit, please do not commit non-functional changes that cause unnecessary conflicts. That problem delayed getting this work in by another day.

I've also removed lazy from that capabilities val.

SparkQA · 2019-03-15T22:48:40Z

Test build #103546 has finished for PR 24012 at commit 69e729e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-03-16T19:08:53Z

@cloud-fan, tests are passing on this so it is ready for another look. Thank you!

cloud-fan · 2019-03-18T10:25:22Z

thanks, merging to master!

rdblue · 2019-03-18T16:11:33Z

Thank you for reviewing this, @cloud-fan!

rxin · 2019-04-15T23:11:00Z

Is there a plan documented on what the final API would look like? It's super confusing to have half capability via traits and half capability via enums.

cloud-fan · 2019-04-16T02:01:35Z

#24129 is adding streaming read/write capability. Eventually we should have all the capabilities via enum.

## What changes were proposed in this pull request? This is a followup of #24012 , to add the corresponding capabilities for streaming. ## How was this patch tested? existing tests Closes #24129 from cloud-fan/capability. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation. Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks. Closes apache#24012 from rdblue/SPARK-26811-add-capabilities. Authored-by: Ryan Blue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? It's a followup of apache#24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes apache#24285 from cloud-fan/doc. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This is a followup of apache#24012 , to add the corresponding capabilities for streaming. existing tests Closes apache#24129 from cloud-fan/capability. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

rdblue force-pushed the SPARK-26811-add-capabilities branch from 22f3953 to 23746e7 Compare March 7, 2019 20:12

This comment has been minimized.

Sign in to view

rdblue changed the title ~~[SPARK-26811][SQL] Add capabilities to v2.Table.~~ [SPARK-26811][SQL] Add capabilities to v2.Table Mar 7, 2019

This comment has been minimized.

Sign in to view

cloud-fan reviewed Mar 11, 2019

View reviewed changes

cloud-fan reviewed Mar 12, 2019

View reviewed changes

rdblue force-pushed the SPARK-26811-add-capabilities branch from 8636867 to 0d44757 Compare March 13, 2019 21:05

cloud-fan reviewed Mar 15, 2019

View reviewed changes

rdblue added 4 commits March 15, 2019 11:26

Add capabilities to v2.Table.

bd9fe02

Add docs to TableCapability.

d25ba82

Fix Kafka source.

520ade1

Update V2WriteSupportCheckSuite for CaseInsensitiveStringMap change.

69e729e

rdblue force-pushed the SPARK-26811-add-capabilities branch from 93c77f5 to 69e729e Compare March 15, 2019 18:27

cloud-fan closed this in e348f14 Mar 18, 2019

cloud-fan mentioned this pull request Mar 18, 2019

[SPARK-27190][SQL] add table capability for streaming #24129

Closed

rdblue deleted the SPARK-26811-add-capabilities branch March 18, 2019 16:11

cloud-fan mentioned this pull request Apr 3, 2019

[SPARK-26811][SQL][followup] fix some documentation #24285

Closed

[SPARK-26811][SQL] Add capabilities to v2.Table #24012

[SPARK-26811][SQL] Add capabilities to v2.Table #24012

Uh oh!

Conversation

rdblue commented Mar 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Mar 8, 2019

Uh oh!

rdblue commented Mar 8, 2019

Uh oh!

cloud-fan Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Mar 13, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 13, 2019

Uh oh!

rdblue commented Mar 13, 2019

Uh oh!

SparkQA commented Mar 13, 2019

Uh oh!

cloud-fan commented Mar 14, 2019

Uh oh!

rdblue commented Mar 14, 2019

Uh oh!

SparkQA commented Mar 15, 2019

Uh oh!

cloud-fan Mar 15, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Mar 15, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue commented Mar 15, 2019

Uh oh!

SparkQA commented Mar 15, 2019

Uh oh!

rdblue commented Mar 16, 2019

Uh oh!

cloud-fan commented Mar 18, 2019

Uh oh!

rdblue commented Mar 18, 2019

Uh oh!

rxin commented Apr 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Apr 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdblue commented Mar 7, 2019 •

edited

Loading

rxin commented Apr 15, 2019 •

edited

Loading