[SPARK-24253][SQL] Add DeleteSupport mix-in for DataSourceV2. #21308

rdblue · 2018-05-11T22:07:28Z

What changes were proposed in this pull request?

Adds DeleteSupport mix-in for DataSourceV2. This mix-in provides a method to delete data with catalyst expressions in support of delete from and overwrite logical operations.

How was this patch tested?

No tests, this adds an interface.

bersprockets · 2018-05-15T18:33:27Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java

Does putting the delete method here (as opposed to say, in DataDeleters on some other thing parallel to to the DataWriters) imply that this is a driver-side operation only? I understand the use case is deleting partitions which is usually only a file system operation, but will that always be the case?

Yes, this is a driver-side operation. That's why the source can reject the delete. Anything that requires a parallel operation should really be implemented as read, filter, and replace data.

Do you think it would be more clear if this were explicitly a driver-side operation?

Do you think it would be more clear if this were explicitly a driver-side operation?

Possibly. Maybe in the big data world this is already obvious. To me, it looks like a general purpose delete. Maybe deletePartitions? (I am bad at naming things, however).

There aren't necessarily partitions in these data sources, so I wouldn't add partitions to the method name. I think we can make this more clear with better docs though.

jose-torres · 2018-05-24T19:11:01Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java

nit: is this a duplicate of the above paragraph

No, these are distinct.

UnsupportedOperationException indicates that the source doesn't understand a filter. For example, it could be date(ts) = '2018-05-13' and the source doesn't support the conversion from timestamp to date.

IllegalArgumentException is thrown when the expression is understood by the source, but the work required to perform the delete is not supported. For example, if you have data partitioned by hour(ts) and the delete expression is ts > '2018-05-13T00:05:00' and ts < '2018-05-13T00:10:00'. Deleting a 5-minute window when data is partitioned by hour probably isn't possible without rewriting data files, so the source can reject it.

After updating this to use Filter, the UnsupportedOperationException is no longer needed, so I removed it. That should also cut down on the confusion here.

rdblue · 2018-07-27T01:18:04Z

#21888 shows how this is used to implement DELETE FROM.

rdblue · 2018-08-15T19:33:57Z

@rxin, I've updated this API to use Filter instead of Expression. I'd ideally like to get it in soon if you guys have a chance to review it. It's pretty small.

cc @cloud-fan

SparkQA · 2018-08-15T23:35:24Z

Test build #94818 has finished for PR 21308 at commit e32e6c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tigerquoll · 2018-09-04T22:55:19Z

I am assuming this API was intended to support the "drop partition" use-case. I'm arguing that adding and deleting partitions deal with a concept that is a slightly higher concept than just a bunch of records that match a filter. Backing up this fact is the concept that partitions are defined independently of any records they may or may not contain - You can add an empty partition and the underlying state of the system will change.

Also - as an end user I would be very upset if I meant to drop a partition, but because of a transcription error accidentally started a delete process with a filter that didn't directly match a partition definition that takes a million times as long to execute.

Partitions are an implementation optimisation that has leaked into higher level APIs because they are an extremely useful and performant implementation optimisation. I am wondering if we should represent them in this API as something slightly more higher level then just a filter definition.

rdblue · 2018-09-04T23:19:26Z

@tigerquoll, what we come up with needs to work across a variety of data sources, including those like JDBC that can delete at a lower granularity than partition.

For Hive tables, the partition columns are exposed directly, so users would supply a predicate that matches partition columns. A Hive table source would also be free to reject delete requests -- by throwing the documented exception -- that would require rewriting data. These avoid the case that you're talking about because the predicate must match entire partitions, the source can reject predicates on non-partition columns, or could reject predicates that can't be cleanly deleted with a metadata operation.

tigerquoll · 2018-09-06T22:44:30Z

@rdblue what about those data sources that support record deletion and partition dropping as two semantically different operations - Kudu and Hbase being two examples.

All systems that support partitions have a different api for dealing with partition level ops. Even file based table storage systems support the different levels of manipulation. (look at the sql DDL that impala supports for parquet partition for an example - they use a filter, but the command is “this partition op applies to the partiton that is defined by this filter”, not “apply this op to all records that match this filter)”

The difference is subtle, but it is an important one, and every system that supports partitions enforces that difference for a reason.

rdblue · 2018-09-06T23:07:42Z

@tigerquoll, there is currently no support to expose partitions through the v2 API. That would be a different operation. If you wanted to implement partition operations through this API, then you would need to follow the guarantees specified here: if you need to delete by partition, then the expression must match records at partition boundaries or reject the delete operation otherwise.

tigerquoll · 2018-09-06T23:32:53Z

@rdblue Actually: https://issues.apache.org/jira/browse/SPARK-22389.

tigerquoll · 2018-09-06T23:38:01Z

@rdblue I think our debate is whether we should expose an API to represent direct operations on partitions in the new datasource api.

rdblue · 2018-09-07T19:35:03Z

@tigerquoll, I'm not debating whether we should or shouldn't expose partitions here. In general, I'm undecided. I don't think that the API proposed here needs to support a first-class partition concept for tables, largely because partitions aren't currently exposed in the v2 API.

The issue you linked to, SPARK-22389, exposes Spark's view of partitioning -- as in repartition(col) -- which is to say data rows are grouped together. That's not the same thing as partitions in a data source that can exist independent of data rows.

tigerquoll · 2018-09-10T06:44:58Z

@rdblue when you say "you don't think the API proposed here needs to support a first-class partition concept", are you referring to the "DeleteSupport" Interface, or to DataSourceV2 in general?
If you are referring to DeleteSupport, then do you have the same objections to a separate "DropPartition"/"AddPartition" interface?
If you mean that you don't think DataSourceV2 requires supporting partitions as a first-class concept, then how are users of spark supposed to perform operations like

adding,
altering,
removing, and
listing
partitions on those data-sources that are represented by particular instances of DatasourceV2?

cloud-fan · 2018-09-10T13:53:02Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java

+   * @param filters filter expressions, used to select rows to delete when all expressions match
+   * @throws IllegalArgumentException If the delete is rejected due to required effort
+   */
+  void deleteWhere(Filter[] filters);


This seems different from what we discussed in the dev list about the new abstraction. I expect to see

Write newDeleteWrite(Filter[] filters);

Do I miss something?

Maybe it's a little unclear: this delete is not a write. It is a driver-side operation using table metadata, like dropping matching partitions in a Hive table or dropping matching files in an Iceberg table. That way, there are no tasks and we don't need to use the commit protocol.

If we want to filter data files, the overwrite API I've proposed is the right way to do it. Spark could read, filter the rows, and replace all of the files that were read.

If there are files that have both rows that should be removed and rows that should be kept, the source should throw IllegalArgumentException to reject the delete.

rdblue · 2018-09-10T16:49:24Z

@tigerquoll, I'm talking about the DataSourceV2 API in general. I'm not sure if I think there is value in exposing partitions, but I'd be happy to hear why you think they are valuable and think through how it would fix with the existing API.

I think that partitions that aren't hidden make tables much harder for users to work with, which is why Iceberg hides partitioning and automatically translates from row filters to partition filters. For Kudu, maybe it is different. Could you write up the use case with a bit more context about what empty partitions are used for, and send it to the dev list?

If we think that the v2 API should expose a partition concept, then that would definitely include a way to add or drop partitions.

cloud-fan · 2019-09-19T15:12:53Z

The DELETE support is already merged, closing this

This comment has been minimized.

Sign in to view

bersprockets reviewed May 15, 2018

View reviewed changes

jose-torres reviewed May 24, 2018

View reviewed changes

rdblue mentioned this pull request Jul 11, 2018

[SPARK-24251][SQL] Add AppendData logical plan. #21305

Closed

rdblue mentioned this pull request Aug 13, 2018

[SPARK-24882][SQL] improve data source v2 API #22009

Closed

rdblue force-pushed the SPARK-24253-add-v2-delete-support branch from ffbd3cb to db77b9a Compare August 15, 2018 19:32

SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.

e32e6c4

rdblue force-pushed the SPARK-24253-add-v2-delete-support branch from db77b9a to e32e6c4 Compare August 15, 2018 19:45

rdblue mentioned this pull request Aug 15, 2018

[SPARK-24252][SQL] Add catalog registration and table catalog APIs. #21306

Closed

This comment has been minimized.

Sign in to view

rdblue changed the title ~~SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.~~ [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSourceV2. Aug 20, 2018

rdblue mentioned this pull request Aug 22, 2018

[SPARK-25188][SQL] Add WriteConfig to v2 write API. #22190

Closed

cloud-fan reviewed Sep 10, 2018

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

rdblue mentioned this pull request Jul 17, 2019

[SPARK-28351][SQL] Support DELETE in DataSource V2 #25115

Closed

cloud-fan closed this Sep 19, 2019

[SPARK-24253][SQL] Add DeleteSupport mix-in for DataSourceV2. #21308

[SPARK-24253][SQL] Add DeleteSupport mix-in for DataSourceV2. #21308

Uh oh!

Conversation

rdblue commented May 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 27, 2018

Uh oh!

rdblue commented Aug 15, 2018

Uh oh!

This comment has been minimized.

SparkQA commented Aug 15, 2018

Uh oh!

tigerquoll commented Sep 4, 2018

Uh oh!

rdblue commented Sep 4, 2018

Uh oh!

tigerquoll commented Sep 6, 2018

Uh oh!

rdblue commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tigerquoll commented Sep 6, 2018

Uh oh!

tigerquoll commented Sep 6, 2018

Uh oh!

rdblue commented Sep 7, 2018

Uh oh!

tigerquoll commented Sep 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 10, 2018

Uh oh!

cloud-fan commented Sep 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rdblue May 24, 2018 •

edited

Loading

rdblue commented Sep 6, 2018 •

edited

Loading