[SPARK-38085][SQL] DataSource V2: Handle DELETE commands for group-based sources #35395

aokolnychyi · 2022-02-04T05:57:28Z

What changes were proposed in this pull request?

This PR contains changes to rewrite DELETE operations for V2 data sources that can replace groups of data (e.g. files, partitions).

Why are the changes needed?

These changes are needed to support row-level operations in Spark per SPIP SPARK-35801.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR comes with tests.

aokolnychyi · 2022-02-04T06:01:00Z

cc @huaxingao @dongjoon-hyun @sunchao @cloud-fan @viirya @rdblue @HyukjinKwon @dbtsai

sql/core/src/test/scala/org/apache/spark/sql/connector/DeleteFromTableSuite.scala

aokolnychyi · 2022-02-08T04:24:26Z

The test failure seem unrelated.

annotations failed mypy checks:
[23](https://github.com/aokolnychyi/spark/runs/5102950901?check_suite_focus=true#step:15:23)
python/pyspark/ml/stat.py:478: error: Item "None" of "Optional[Any]" has no attribute "summary"  [union-attr]
[24](https://github.com/aokolnychyi/spark/runs/5102950901?check_suite_focus=true#step:15:24)
Found 1 error in 1 file (checked 324 source files)

kazuyukitanimura · 2022-02-08T04:40:09Z

The test failure seem unrelated.

annotations failed mypy checks:
[23](https://github.com/aokolnychyi/spark/runs/5102950901?check_suite_focus=true#step:15:23)
python/pyspark/ml/stat.py:478: error: Item "None" of "Optional[Any]" has no attribute "summary"  [union-attr]
[24](https://github.com/aokolnychyi/spark/runs/5102950901?check_suite_focus=true#step:15:24)
Found 1 error in 1 file (checked 324 source files)

I see that there is a revert commit in the upstream e34d8ee

aokolnychyi · 2022-02-08T19:00:08Z

Alright, the tests are green and the PR is ready for a detailed review.

...atalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsRowLevelOperations.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelCommand.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteDeleteFromTable.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Command.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteDeleteFromTable.scala

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RowLevelOperation.java

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RowLevelOperationInfo.java

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

...cala/org/apache/spark/sql/execution/datasources/v2/OptimizeMetadataOnlyDeleteFromTable.scala

huaxingao · 2022-02-13T07:38:30Z

...cala/org/apache/spark/sql/execution/datasources/v2/RowLevelCommandScanRelationPushDown.scala

is it possible to push down the negated filter in the rewrite plan?

We actually have to prevent that (added the new rule to the list of rules that cannot be excluded).

Here is how a DELETE command may look like.

== Analyzed Logical Plan == DeleteFromTable (id#88 <= 1) :- RelationV2[id#88, dep#89] cat.ns1.test_table +- ReplaceData RelationV2[id#88, dep#89] cat.ns1.test_table +- Filter NOT ((id#88 <= 1) <=> true) +- RelationV2[id#88, dep#89, _partition#91] cat.ns1.test_table

The condition we should push down to the source is the DELETE condition (id < 1) (not the condition in the filter on top of the scan). Suppose we have a data source that can replace files. We have two files: File A contains IDs 1, 2, 3 and File B contains IDs 5, 6, 7. If we want to delete the record with ID = 1, we should push down the actual delete condition (ID = 1) for correct file pruning. Once the data source determines that only File A contains records to delete, we need to read the entire file and determine what records did not match the condition (that's what that filter on top of the scan is doing). Those records (IDs 2, 3 in our example) have to be written back to the data source as it can only replace files. That's why pushing the filter condition would actually be wrong and we have to prevent it.

Cause we need to push down the command condition, I couldn't use the existing rule. If anyone has any ideas on how to avoid a separate rule, I'll be glad to do that.

This is okay. I think you could probably add a pushdown function in the existing pushdown class that uses the RewrittenRowLevelCommand matcher but returns the ScanBuilderHolder that is now used. But since pushdown for the row-level rewrite commands is so specific, I think it's probably more readable and maintainable over time to use a separate rule like this.

I took another look at V2ScanRelationPushDown. I think we can make filter pushdown work there by adding separate branches for RewrittenRowLevelCommand but it does not seem to help. Instead, it would make the existing rule even more complicated. Apart from that, we also can't apply regular logic for aggregate pushdown as we have to look at the condition in the row-level operation. Essentially, we have to make sure that none of the logic in the existing rule work for row-level operations. At this point, I agree keeping a separate rule seems cleaner.

+1 for separate rules. The other one is complicated to allow extra pushdown that isn't needed here.

Following previous question, how do we know if the data source can replace files? If it cannot, do we still should/need to push down the command filter?

If a data source does not support replacing groups, it won't extend SupportsRowLevelOperation and we will fail in the analyzer.

aokolnychyi · 2022-02-15T23:47:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

The import looks a bit weird. I can do an aliased import if that's any better.

I would probably move DeleteFromTableWithFilters to a follow-up commit since it is an optimization and not needed for correctness.

Well, Spark can already plan filter-based DELETE today, so not supporting it would be a regression.

@cloud-fan, DeleteFromTableWithFilters is an optimization for SupportsRowLevelOperations. Existing deletes with filters would be unaffected. That being said, I am going to combine the existing logic in DataSourceV2Strategy with the optimizer rule I added, like discussed here. That way, we will have the filter conversion logic just in one place. Let me know if you agree with that.

aokolnychyi · 2022-02-16T00:36:05Z

...cala/org/apache/spark/sql/execution/datasources/v2/OptimizeMetadataOnlyDeleteFromTable.scala

This optimizer rule contains logic similar to what we have in DataSourceV2Strategy. However, it is done in the optimizer to avoid building Scan and Write if a DELETE operation can be handled using filters.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteDeleteFromTable.scala

rdblue · 2022-02-20T23:21:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

It would be good to mention why this always finds the read relation rather than constructing the RowLevelCommand with a hard reference to it. My understanding is that it may be changed by the optimizer. It could be removed based on the condition and there may be more than one depending on the planning for UPDATE queries. Is that right?

I kept the minimum required logic for group-based deletes for now. You are right, this extractor will change to support UPDATE and delta-based sources. What about updating the description once we make those changes? For now, there will be exactly one read relation.

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/write/RowLevelOperationTable.scala

...t/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryRowLevelOperationTable.scala

rdblue · 2022-02-20T23:35:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

Is it possible to merge RowLevelCommandScanRelationPushDown into V2ScanRelationPushDown so they're in one place?

Keeping them separate for now as V2ScanRelationPushDown is already complicated and none of that logic applies.

After looking at this more, I agree with this direction. There is no need to overcomplicate either case.

…urces

viirya · 2022-04-11T17:50:30Z

@dongjoon-hyun @HyukjinKwon do you have any idea about what the GA workflow error is?

Error: Unhandled error: Error: There was a new unsynced commit pushed. Please retrigger the workflow.

dongjoon-hyun · 2022-04-11T20:13:26Z

Hi, @viirya . It is happening on multiple PRs. I don't think it's our issue. However, although I look at their status, there is no clue there either.

https://www.githubstatus.com

viirya · 2022-04-11T20:16:43Z

Thanks @dongjoon-hyun. Then it's weird. Keeping an eye on it...

dongjoon-hyun · 2022-04-11T20:22:50Z

FYI, this is the code path for the error.

spark/.github/workflows/notify_test_workflow.yml

Lines 110 to 112 in 7a6b989

    
                         if (runs.data.workflow_runs[0].head_sha != context.payload.pull_request.head.sha) { 
        
                           throw new Error('There was a new unsynced commit pushed. Please retrigger the workflow.'); 
        
                         }

aokolnychyi · 2022-04-11T21:23:11Z

shall we apply filter pushdown twice for simple DELETE execution? e.g. we first pushdown the DELETE condition to identify the files we need to replace, then we pushdown the negated DELETE condition to prune the parquet row groups.

@cloud-fan, I think discarding entire row groups is possible only for DELETEs when the whole condition was successfully translated into data source filters. This isn’t something we can support for other commands like UPDATE or when certain parts of the condition couldn’t be converted to a data source filter (e.g. subquery).

A few points on my mind right now:

How will data sources know what condition is for filtering files and what for filtering row groups without changes to the API?
Creating a scan builder in one rule and then configuring it further in another one will make the main planning rule even more complicated than it is today.

Technically, if we simply extend the scan builder API to indicate that the entire condition is being pushed down, it should be sufficient for data sources to discard entire row groups of deleted records. We already pass the SQL command and the condition. Data sources just don't know whether it is the entire condition and whether row groups can be discarded.

cloud-fan · 2022-04-12T04:59:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+
+  override lazy val isByName: Boolean = false
+  override lazy val references: AttributeSet = query.outputSet
+  override lazy val stringArgs: Iterator[Any] = Iterator(table, query, write)


nit: these can be val as they are just constants.

Replaced isByName and stringArgs with val. Kept references lazy to avoid eagerly computing those.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

cloud-fan · 2022-04-12T05:02:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+    // metadata columns may be needed to request a correct distribution or ordering
+    // but are not passed back to the data source during writes
+
+    table.skipSchemaResolution || (dataInput.size == table.output.size &&


do we really need to check this? the input query is built by spark and is directly reading the table.

It may be redundant in case of DELETE but it will be required for UPDATE and MERGE when the incoming values no longer solely depend on what was read. This will prevent setting nullable values for non-nullable attributes, for instance.

cloud-fan · 2022-04-12T05:17:37Z

.../org/apache/spark/sql/execution/datasources/v2/GroupBasedRowLevelOperationScanPlanning.scala

+        pushedFilters.right.get.mkString(", ")
+      }
+
+      val (scan, output) = PushDownUtils.pruneColumns(scanBuilder, relation, relation.output, Nil)


This means we don't do column pruning at all. We can make the code a bit simler

val scan = scanBuilder.scan ... DataSourceV2ScanRelation(r, scan, r.output)

You are right we don't do column pruning but this makes sure metadata columns are projected. Otherwise, the scan would just report table attributes.

.../org/apache/spark/sql/execution/datasources/v2/GroupBasedRowLevelOperationScanPlanning.scala

cloud-fan · 2022-04-12T05:22:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

      WriteToDataSourceV2(relation, microBatchWrite, newQuery, customMetrics)
+
+    case rd @ ReplaceData(r: DataSourceV2Relation, _, query, _, None) =>
+      val rowSchema = StructType.fromAttributes(rd.dataInput)


Can we simply use rd.originalTable.output?

We have to use dataInput as it will hold the correct nullability info for UPDATE and MERGE.

cloud-fan

Did one more round of review and left a few more minor comments. Great job!

aokolnychyi · 2022-04-12T18:19:23Z

Thanks for reviewing, @cloud-fan! Could you take one more look? I either addressed comments or replied.

I don't know what happened but the notify test workflow keeps failing for this PR and tests are not triggered. I tried updating the branch and reopening the PR. Did not work.

viirya · 2022-04-12T18:28:40Z

Currently we can check PR test result by https://github.com/aokolnychyi/spark/actions/workflows/build_and_test.yml

aokolnychyi · 2022-04-12T22:28:56Z

Looks like the tests are green.

cloud-fan · 2022-04-13T05:46:59Z

thanks, merging to master/3.3!

…sed sources This PR contains changes to rewrite DELETE operations for V2 data sources that can replace groups of data (e.g. files, partitions). These changes are needed to support row-level operations in Spark per SPIP SPARK-35801. No. This PR comes with tests. Closes #35395 from aokolnychyi/spark-38085. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 5a92ecc) Signed-off-by: Wenchen Fan <[email protected]>

aokolnychyi · 2022-04-13T17:03:48Z

Appreciate all the reviews, @cloud-fan @viirya @huaxingao @rdblue @sunchao @dongjoon-hyun!

viirya · 2022-04-13T17:12:37Z

Thanks @aokolnychyi and all, great work!

github-actions bot added the SQL label Feb 4, 2022

aokolnychyi commented Feb 4, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DeleteFromTableSuite.scala Outdated Show resolved Hide resolved

This was referenced Feb 7, 2022

[SPARK-36665][SQL][FOLLOWUP] Avoid Optimizing Not(InSubquery) #35400

Closed

[SPARK-38132][SQL] Remove NotPropagation rule #35428

Closed

aokolnychyi force-pushed the spark-38085 branch from a7cf002 to cac1449 Compare February 8, 2022 01:00

aokolnychyi force-pushed the spark-38085 branch from cac1449 to 0be11b3 Compare February 8, 2022 16:24

viirya reviewed Feb 9, 2022

View reviewed changes

aokolnychyi force-pushed the spark-38085 branch from 0be11b3 to 0b9165c Compare February 9, 2022 20:55

huaxingao reviewed Feb 13, 2022

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RowLevelOperation.java Outdated Show resolved Hide resolved

huaxingao reviewed Feb 13, 2022

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RowLevelOperationInfo.java Outdated Show resolved Hide resolved

huaxingao reviewed Feb 13, 2022

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala Outdated Show resolved Hide resolved

huaxingao reviewed Feb 13, 2022

View reviewed changes

...cala/org/apache/spark/sql/execution/datasources/v2/OptimizeMetadataOnlyDeleteFromTable.scala Outdated Show resolved Hide resolved

huaxingao reviewed Feb 13, 2022

View reviewed changes

aokolnychyi commented Feb 15, 2022

View reviewed changes

aokolnychyi commented Feb 16, 2022

View reviewed changes