[SPARK-26313][SQL] move `newScanBuilder` from Table to read related mix-in traits #23266

cloud-fan · 2018-12-09T08:41:10Z

What changes were proposed in this pull request?

As discussed in https://github.com/apache/spark/pull/23208/files#r239684490 , we should put newScanBuilder in read related mix-in traits like SupportsBatchRead, to support write-only table.

In the Append operator, we should skip schema validation if not necessary. In the future we would introduce a capability API, so that data source can tell Spark that it doesn't want to do validation.

How was this patch tested?

existing tests.

cloud-fan · 2018-12-09T08:41:54Z

cc @rdblue @HyukjinKwon @gatorsmile

cloud-fan · 2018-12-09T09:17:43Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java

I'm not sure about this. Maybe it's ok to leave schema in Table, and asks write-only table to report schema as empty.

To me, +1 for the current change.

I think that schema should be a method on Table. Write-only tables still need to access the table's schema to validate a write.

@rdblue . Validating is one of the important use cases, but there are another use cases. We've already received the request previously.

The schema is defined by the dataframe itself, not by the data source, i.e. it should be extracted from df.schema and not by source.createReader

http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html

@dongjoon-hyun, the problem that you pointed out was that the schema shouldn't require a reader.
That's what I'm saying here, too: the table should have a schema and it shouldn't need to implement the read path to validate a write using the table schema.

I should also note that writing to a source that can't be read and doesn't have a consistent schema is far outside the primary uses of this API. Keep in mind that we should design primarily for tables with schemas; it's great to make that use case possible, but we should not alter the design too much to do it.

I'd moving the schema stuff to a new single interface. That would be better for Single Responsibility Principle. @rdblue , in that case, we are good, right? Is there any other concerns?

I don't see the value in moving the schema to a different interface, and I think that moving the schema to an interface specific to the read path is worse because it causes the problem that a table must be readable to be writable.

I think that the use case for a table with no schema is extremely narrow, if not laughably unlikely.

Silly use cases should not cause us to make changes to an API when there is a reasonable alternative: a source with no schema should return an empty schema and signal that it wants to disable write validation (using a capability).

There are two advantages to this approach:

Implementations need to implement schema and validation rules are enabled by default. If an implementation needs to remember to add a HasSchema interface to turn on validation rules, then the default is wrong. The result would be that people less familiar with the API will not have validated writes, and that's a problem.

The API is simpler and easier to understand.

I'm -1 on moving this from Table.

Okay. You think so. But, obviously, we already wasted a lot of time for this laughably unlikely stuff, didn't we?

Did we? What are the examples of sources that are write-only and don't need schema validation? I could be wrong here, but I didn't think that there were many.

SparkQA · 2018-12-09T12:31:26Z

Test build #99886 has finished for PR 23266 at commit da520cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-09T21:10:34Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java

According to this update, maybe a logical structured data set -> a logical named data set?

HyukjinKwon · 2018-12-10T03:02:34Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java

HyukjinKwon

To me +1, but some more input might be needed (maybe from @rdblue).

rdblue · 2018-12-10T21:01:27Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java

Why would newScanBuilder be exposed by a batch interface? I think this should be SupportsRead instead.

Checking whether a table can be used for batch processing should be done using a different interface.

rdblue · 2018-12-10T21:02:42Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java

This should be "A mix-in interface for readable tables ... This adds newScanBuilder used to create a scan for batch, micro-batch, or continuous processing."

rdblue · 2018-12-10T21:03:43Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java

This doesn't need to explain how ScanBuilder works.

rdblue · 2018-12-10T21:04:32Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java

Spark will call this method to configure each scan. Using "scanning query" implies that it will be called just once per query, which isn't true if the table is scanned twice in a query.

rdblue · 2018-12-10T21:05:02Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java

Why does this say "later"?

cloud-fan · 2018-12-12T06:46:10Z

It's a little weird to have table without schema. I leave schema in Table with comment saying that, implementation can return empty schema if the table is not readable.

For a sink that can accept data in any schema, data source API might not be a good option. Dataset.foreach could be better.

cloud-fan · 2018-12-12T14:09:27Z

retest this please

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java

dongjoon-hyun · 2018-12-12T15:18:42Z

Could you update the title because we move only one method newScanBuilder now.

SparkQA · 2018-12-12T17:50:17Z

Test build #100025 has finished for PR 23266 at commit d9ae0fe.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-12-12T18:15:28Z

+1

dongjoon-hyun · 2018-12-12T18:15:44Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java

 */
 @Evolving
-public interface SupportsBatchRead extends Table { }
+public interface SupportsBatchRead extends SupportsRead { }


@cloud-fan What about the following?

-public interface SupportsBatchRead extends SupportsRead { } +public interface SupportsBatchRead extends Table, SupportsRead { }

-interface SupportsRead extends Table { +interface SupportsRead {

SupportsRead only makes sense under the table context. It also simplifies the code

public interface SupportsBatchRead extends Table, SupportsRead { } public interface SupportsMicroBatchRead extends Table, SupportsRead { } public interface SupportsContinuousRead extends Table, SupportsRead { }

to

public interface SupportsBatchRead extends SupportsRead { } public interface SupportsMicroBatchRead extends SupportsRead { } public interface SupportsContinuousRead extends SupportsRead { }

BTW SupportsRead is package private.

SparkQA · 2018-12-12T19:20:16Z

Test build #100028 has finished for PR 23266 at commit 99e9fe9.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java

HyukjinKwon

LGTM!

HyukjinKwon · 2018-12-13T08:16:32Z

retest this please

SparkQA · 2018-12-13T09:53:35Z

Test build #100080 has finished for PR 23266 at commit 99e9fe9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-13T11:15:40Z

retest this please

SparkQA · 2018-12-13T14:51:37Z

Test build #100086 has finished for PR 23266 at commit 99e9fe9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-12-13T15:02:35Z

Merged to master!

…ix-in traits ## What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/23208/files#r239684490 , we should put `newScanBuilder` in read related mix-in traits like `SupportsBatchRead`, to support write-only table. In the `Append` operator, we should skip schema validation if not necessary. In the future we would introduce a capability API, so that data source can tell Spark that it doesn't want to do validation. ## How was this patch tested? existing tests. Closes apache#23266 from cloud-fan/ds-read. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

cloud-fan commented Dec 9, 2018

View reviewed changes

dongjoon-hyun reviewed Dec 9, 2018

View reviewed changes

HyukjinKwon reviewed Dec 10, 2018

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java Outdated

Copy link

Member

HyukjinKwon Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

HyukjinKwon approved these changes Dec 10, 2018

View reviewed changes

rdblue reviewed Dec 10, 2018

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java Outdated

Copy link

Contributor

rdblue Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this say "later"?

cloud-fan force-pushed the ds-read branch from da520cc to 4a982b6 Compare December 12, 2018 06:42

move read related methods from Table to read related mix-in traits

d9ae0fe

cloud-fan force-pushed the ds-read branch from 4a982b6 to d9ae0fe Compare December 12, 2018 07:30

apache deleted a comment from SparkQA Dec 12, 2018

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Dec 12, 2018

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java Show resolved Hide resolved

address comments

99e9fe9

cloud-fan changed the title ~~[SPARK-26313][SQL] move read related methods from Table to read related mix-in traits~~ [SPARK-26313][SQL] move newScanBuilder from Table to read related mix-in traits Dec 12, 2018

dongjoon-hyun reviewed Dec 12, 2018

View reviewed changes

HyukjinKwon reviewed Dec 13, 2018

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java Show resolved Hide resolved

HyukjinKwon approved these changes Dec 13, 2018

View reviewed changes

asfgit closed this in 6c1f7ba Dec 13, 2018

[SPARK-26313][SQL] move newScanBuilder from Table to read related mix-in traits #23266

[SPARK-26313][SQL] move newScanBuilder from Table to read related mix-in traits #23266

Uh oh!

Conversation

cloud-fan commented Dec 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 12, 2018

Uh oh!

This comment has been minimized.

cloud-fan commented Dec 12, 2018

Uh oh!

Uh oh!

dongjoon-hyun commented Dec 12, 2018

Uh oh!

SparkQA commented Dec 12, 2018

Uh oh!

rdblue commented Dec 12, 2018

Uh oh!

dongjoon-hyun Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 12, 2018

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 13, 2018

Uh oh!

SparkQA commented Dec 13, 2018

Uh oh!

cloud-fan commented Dec 13, 2018

Uh oh!

[SPARK-26313][SQL] move `newScanBuilder` from Table to read related mix-in traits #23266

[SPARK-26313][SQL] move `newScanBuilder` from Table to read related mix-in traits #23266

cloud-fan commented Dec 9, 2018 •

edited

Loading

dongjoon-hyun Dec 12, 2018 •

edited

Loading