Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 9, 2018

What changes were proposed in this pull request?

As discussed in https://github.com/apache/spark/pull/23208/files#r239684490 , we should put newScanBuilder in read related mix-in traits like SupportsBatchRead, to support write-only table.

In the Append operator, we should skip schema validation if not necessary. In the future we would introduce a capability API, so that data source can tell Spark that it doesn't want to do validation.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cc @rdblue @HyukjinKwon @gatorsmile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. Maybe it's ok to leave schema in Table, and asks write-only table to report schema as empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, +1 for the current change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that schema should be a method on Table. Write-only tables still need to access the table's schema to validate a write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue . Validating is one of the important use cases, but there are another use cases. We've already received the request previously.

The schema is defined by the dataframe itself, not by the data source, i.e. it should be extracted from df.schema and not by source.createReader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun, the problem that you pointed out was that the schema shouldn't require a reader.
That's what I'm saying here, too: the table should have a schema and it shouldn't need to implement the read path to validate a write using the table schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also note that writing to a source that can't be read and doesn't have a consistent schema is far outside the primary uses of this API. Keep in mind that we should design primarily for tables with schemas; it's great to make that use case possible, but we should not alter the design too much to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd moving the schema stuff to a new single interface. That would be better for Single Responsibility Principle. @rdblue , in that case, we are good, right? Is there any other concerns?

I don't see the value in moving the schema to a different interface, and I think that moving the schema to an interface specific to the read path is worse because it causes the problem that a table must be readable to be writable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the use case for a table with no schema is extremely narrow, if not laughably unlikely.

Silly use cases should not cause us to make changes to an API when there is a reasonable alternative: a source with no schema should return an empty schema and signal that it wants to disable write validation (using a capability).

There are two advantages to this approach:

  1. Implementations need to implement schema and validation rules are enabled by default. If an implementation needs to remember to add a HasSchema interface to turn on validation rules, then the default is wrong. The result would be that people less familiar with the API will not have validated writes, and that's a problem.
  2. The API is simpler and easier to understand.

I'm -1 on moving this from Table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. You think so. But, obviously, we already wasted a lot of time for this laughably unlikely stuff, didn't we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we? What are the examples of sources that are write-only and don't need schema validation? I could be wrong here, but I didn't think that there were many.

@SparkQA
Copy link

SparkQA commented Dec 9, 2018

Test build #99886 has finished for PR 23266 at commit da520cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this update, maybe a logical structured data set -> a logical named data set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me +1, but some more input might be needed (maybe from @rdblue).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would newScanBuilder be exposed by a batch interface? I think this should be SupportsRead instead.

Checking whether a table can be used for batch processing should be done using a different interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "A mix-in interface for readable tables ... This adds newScanBuilder used to create a scan for batch, micro-batch, or continuous processing."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to explain how ScanBuilder works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark will call this method to configure each scan. Using "scanning query" implies that it will be called just once per query, which isn't true if the table is scanned twice in a query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this say "later"?

@cloud-fan
Copy link
Contributor Author

It's a little weird to have table without schema. I leave schema in Table with comment saying that, implementation can return empty schema if the table is not readable.

For a sink that can accept data in any schema, data source API might not be a good option. Dataset.foreach could be better.

@SparkQA

This comment has been minimized.

@cloud-fan
Copy link
Contributor Author

retest this please

@dongjoon-hyun
Copy link
Member

Could you update the title because we move only one method newScanBuilder now.

@cloud-fan cloud-fan changed the title [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits [SPARK-26313][SQL] move newScanBuilder from Table to read related mix-in traits Dec 12, 2018
@SparkQA
Copy link

SparkQA commented Dec 12, 2018

Test build #100025 has finished for PR 23266 at commit d9ae0fe.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Dec 12, 2018

+1

*/
@Evolving
public interface SupportsBatchRead extends Table { }
public interface SupportsBatchRead extends SupportsRead { }
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan What about the following?

-public interface SupportsBatchRead extends SupportsRead { }
+public interface SupportsBatchRead extends Table, SupportsRead { }
-interface SupportsRead extends Table {
+interface SupportsRead {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SupportsRead only makes sense under the table context. It also simplifies the code

public interface SupportsBatchRead extends Table, SupportsRead { }
public interface SupportsMicroBatchRead extends Table, SupportsRead { }
public interface SupportsContinuousRead extends Table, SupportsRead { }

to

public interface SupportsBatchRead extends SupportsRead { }
public interface SupportsMicroBatchRead extends SupportsRead { }
public interface SupportsContinuousRead extends SupportsRead { }

BTW SupportsRead is package private.

@SparkQA
Copy link

SparkQA commented Dec 12, 2018

Test build #100028 has finished for PR 23266 at commit 99e9fe9.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100080 has finished for PR 23266 at commit 99e9fe9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100086 has finished for PR 23266 at commit 99e9fe9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master!

@asfgit asfgit closed this in 6c1f7ba Dec 13, 2018
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
…ix-in traits

## What changes were proposed in this pull request?

As discussed in https://github.com/apache/spark/pull/23208/files#r239684490 , we should put `newScanBuilder` in read related mix-in traits like `SupportsBatchRead`, to support write-only table.

In the `Append` operator, we should skip schema validation if not necessary. In the future we would introduce a capability API, so that data source can tell Spark that it doesn't want to do validation.

## How was this patch tested?

existing tests.

Closes apache#23266 from cloud-fan/ds-read.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…ix-in traits

## What changes were proposed in this pull request?

As discussed in https://github.com/apache/spark/pull/23208/files#r239684490 , we should put `newScanBuilder` in read related mix-in traits like `SupportsBatchRead`, to support write-only table.

In the `Append` operator, we should skip schema validation if not necessary. In the future we would introduce a capability API, so that data source can tell Spark that it doesn't want to do validation.

## How was this patch tested?

existing tests.

Closes apache#23266 from cloud-fan/ds-read.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants