-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26313][SQL] move newScanBuilder from Table to read related mix-in traits
#23266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. Maybe it's ok to leave schema in Table, and asks write-only table to report schema as empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, +1 for the current change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that schema should be a method on Table. Write-only tables still need to access the table's schema to validate a write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue . Validating is one of the important use cases, but there are another use cases. We've already received the request previously.
The schema is defined by the dataframe itself, not by the data source, i.e. it should be extracted from df.schema and not by source.createReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun, the problem that you pointed out was that the schema shouldn't require a reader.
That's what I'm saying here, too: the table should have a schema and it shouldn't need to implement the read path to validate a write using the table schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also note that writing to a source that can't be read and doesn't have a consistent schema is far outside the primary uses of this API. Keep in mind that we should design primarily for tables with schemas; it's great to make that use case possible, but we should not alter the design too much to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd moving the schema stuff to a new single interface. That would be better for Single Responsibility Principle. @rdblue , in that case, we are good, right? Is there any other concerns?
I don't see the value in moving the schema to a different interface, and I think that moving the schema to an interface specific to the read path is worse because it causes the problem that a table must be readable to be writable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the use case for a table with no schema is extremely narrow, if not laughably unlikely.
Silly use cases should not cause us to make changes to an API when there is a reasonable alternative: a source with no schema should return an empty schema and signal that it wants to disable write validation (using a capability).
There are two advantages to this approach:
- Implementations need to implement
schemaand validation rules are enabled by default. If an implementation needs to remember to add aHasSchemainterface to turn on validation rules, then the default is wrong. The result would be that people less familiar with the API will not have validated writes, and that's a problem. - The API is simpler and easier to understand.
I'm -1 on moving this from Table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. You think so. But, obviously, we already wasted a lot of time for this laughably unlikely stuff, didn't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we? What are the examples of sources that are write-only and don't need schema validation? I could be wrong here, but I didn't think that there were many.
|
Test build #99886 has finished for PR 23266 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this update, maybe a logical structured data set -> a logical named data set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me +1, but some more input might be needed (maybe from @rdblue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would newScanBuilder be exposed by a batch interface? I think this should be SupportsRead instead.
Checking whether a table can be used for batch processing should be done using a different interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be "A mix-in interface for readable tables ... This adds newScanBuilder used to create a scan for batch, micro-batch, or continuous processing."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to explain how ScanBuilder works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark will call this method to configure each scan. Using "scanning query" implies that it will be called just once per query, which isn't true if the table is scanned twice in a query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this say "later"?
|
It's a little weird to have table without schema. I leave For a sink that can accept data in any schema, data source API might not be a good option. |
This comment has been minimized.
This comment has been minimized.
|
retest this please |
|
Could you update the title because we move only one method |
newScanBuilder from Table to read related mix-in traits
|
Test build #100025 has finished for PR 23266 at commit
|
|
+1 |
| */ | ||
| @Evolving | ||
| public interface SupportsBatchRead extends Table { } | ||
| public interface SupportsBatchRead extends SupportsRead { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan What about the following?
-public interface SupportsBatchRead extends SupportsRead { }
+public interface SupportsBatchRead extends Table, SupportsRead { }-interface SupportsRead extends Table {
+interface SupportsRead {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SupportsRead only makes sense under the table context. It also simplifies the code
public interface SupportsBatchRead extends Table, SupportsRead { }
public interface SupportsMicroBatchRead extends Table, SupportsRead { }
public interface SupportsContinuousRead extends Table, SupportsRead { }
to
public interface SupportsBatchRead extends SupportsRead { }
public interface SupportsMicroBatchRead extends SupportsRead { }
public interface SupportsContinuousRead extends SupportsRead { }
BTW SupportsRead is package private.
|
Test build #100028 has finished for PR 23266 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
retest this please |
|
Test build #100080 has finished for PR 23266 at commit
|
|
retest this please |
|
Test build #100086 has finished for PR 23266 at commit
|
|
Merged to master! |
…ix-in traits ## What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/23208/files#r239684490 , we should put `newScanBuilder` in read related mix-in traits like `SupportsBatchRead`, to support write-only table. In the `Append` operator, we should skip schema validation if not necessary. In the future we would introduce a capability API, so that data source can tell Spark that it doesn't want to do validation. ## How was this patch tested? existing tests. Closes apache#23266 from cloud-fan/ds-read. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ix-in traits ## What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/23208/files#r239684490 , we should put `newScanBuilder` in read related mix-in traits like `SupportsBatchRead`, to support write-only table. In the `Append` operator, we should skip schema validation if not necessary. In the future we would introduce a capability API, so that data source can tell Spark that it doesn't want to do validation. ## How was this patch tested? existing tests. Closes apache#23266 from cloud-fan/ds-read. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
As discussed in https://github.com/apache/spark/pull/23208/files#r239684490 , we should put
newScanBuilderin read related mix-in traits likeSupportsBatchRead, to support write-only table.In the
Appendoperator, we should skip schema validation if not necessary. In the future we would introduce a capability API, so that data source can tell Spark that it doesn't want to do validation.How was this patch tested?
existing tests.