Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Feb 4, 2022

What changes were proposed in this pull request?

This PR contains changes to rewrite DELETE operations for V2 data sources that can replace groups of data (e.g. files, partitions).

Why are the changes needed?

These changes are needed to support row-level operations in Spark per SPIP SPARK-35801.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR comes with tests.

@github-actions github-actions bot added the SQL label Feb 4, 2022
@aokolnychyi
Copy link
Contributor Author

@aokolnychyi
Copy link
Contributor Author

The test failure seem unrelated.

annotations failed mypy checks:
[23](https://github.com/aokolnychyi/spark/runs/5102950901?check_suite_focus=true#step:15:23)
python/pyspark/ml/stat.py:478: error: Item "None" of "Optional[Any]" has no attribute "summary"  [union-attr]
[24](https://github.com/aokolnychyi/spark/runs/5102950901?check_suite_focus=true#step:15:24)
Found 1 error in 1 file (checked 324 source files)

@kazuyukitanimura
Copy link
Contributor

The test failure seem unrelated.

annotations failed mypy checks:
[23](https://github.com/aokolnychyi/spark/runs/5102950901?check_suite_focus=true#step:15:23)
python/pyspark/ml/stat.py:478: error: Item "None" of "Optional[Any]" has no attribute "summary"  [union-attr]
[24](https://github.com/aokolnychyi/spark/runs/5102950901?check_suite_focus=true#step:15:24)
Found 1 error in 1 file (checked 324 source files)

I see that there is a revert commit in the upstream e34d8ee

@aokolnychyi
Copy link
Contributor Author

Alright, the tests are green and the PR is ready for a detailed review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to push down the negated filter in the rewrite plan?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have to prevent that (added the new rule to the list of rules that cannot be excluded).

Here is how a DELETE command may look like.

== Analyzed Logical Plan ==
DeleteFromTable (id#88 <= 1)
:- RelationV2[id#88, dep#89] cat.ns1.test_table
+- ReplaceData RelationV2[id#88, dep#89] cat.ns1.test_table
   +- Filter NOT ((id#88 <= 1) <=> true)
      +- RelationV2[id#88, dep#89, _partition#91] cat.ns1.test_table

The condition we should push down to the source is the DELETE condition (id < 1) (not the condition in the filter on top of the scan). Suppose we have a data source that can replace files. We have two files: File A contains IDs 1, 2, 3 and File B contains IDs 5, 6, 7. If we want to delete the record with ID = 1, we should push down the actual delete condition (ID = 1) for correct file pruning. Once the data source determines that only File A contains records to delete, we need to read the entire file and determine what records did not match the condition (that's what that filter on top of the scan is doing). Those records (IDs 2, 3 in our example) have to be written back to the data source as it can only replace files. That's why pushing the filter condition would actually be wrong and we have to prevent it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cause we need to push down the command condition, I couldn't use the existing rule. If anyone has any ideas on how to avoid a separate rule, I'll be glad to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay. I think you could probably add a pushdown function in the existing pushdown class that uses the RewrittenRowLevelCommand matcher but returns the ScanBuilderHolder that is now used. But since pushdown for the row-level rewrite commands is so specific, I think it's probably more readable and maintainable over time to use a separate rule like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at V2ScanRelationPushDown. I think we can make filter pushdown work there by adding separate branches for RewrittenRowLevelCommand but it does not seem to help. Instead, it would make the existing rule even more complicated. Apart from that, we also can't apply regular logic for aggregate pushdown as we have to look at the condition in the row-level operation. Essentially, we have to make sure that none of the logic in the existing rule work for row-level operations. At this point, I agree keeping a separate rule seems cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for separate rules. The other one is complicated to allow extra pushdown that isn't needed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following previous question, how do we know if the data source can replace files? If it cannot, do we still should/need to push down the command filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a data source does not support replacing groups, it won't extend SupportsRowLevelOperation and we will fail in the analyzer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import looks a bit weird. I can do an aliased import if that's any better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably move DeleteFromTableWithFilters to a follow-up commit since it is an optimization and not needed for correctness.

Copy link
Contributor

@cloud-fan cloud-fan Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, Spark can already plan filter-based DELETE today, so not supporting it would be a regression.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, DeleteFromTableWithFilters is an optimization for SupportsRowLevelOperations. Existing deletes with filters would be unaffected. That being said, I am going to combine the existing logic in DataSourceV2Strategy with the optimizer rule I added, like discussed here. That way, we will have the filter conversion logic just in one place. Let me know if you agree with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

@aokolnychyi aokolnychyi Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimizer rule contains logic similar to what we have in DataSourceV2Strategy. However, it is done in the optimizer to avoid building Scan and Write if a DELETE operation can be handled using filters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to mention why this always finds the read relation rather than constructing the RowLevelCommand with a hard reference to it. My understanding is that it may be changed by the optimizer. It could be removed based on the condition and there may be more than one depending on the planning for UPDATE queries. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the minimum required logic for group-based deletes for now. You are right, this extractor will change to support UPDATE and delta-based sources. What about updating the description once we make those changes? For now, there will be exactly one read relation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to merge RowLevelCommandScanRelationPushDown into V2ScanRelationPushDown so they're in one place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping them separate for now as V2ScanRelationPushDown is already complicated and none of that logic applies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this more, I agree with this direction. There is no need to overcomplicate either case.

@viirya
Copy link
Member

viirya commented Apr 11, 2022

@dongjoon-hyun @HyukjinKwon do you have any idea about what the GA workflow error is?

Error: Unhandled error: Error: There was a new unsynced commit pushed. Please retrigger the workflow.

@dongjoon-hyun
Copy link
Member

Hi, @viirya . It is happening on multiple PRs. I don't think it's our issue. However, although I look at their status, there is no clue there either.

@viirya
Copy link
Member

viirya commented Apr 11, 2022

Thanks @dongjoon-hyun. Then it's weird. Keeping an eye on it...

@dongjoon-hyun
Copy link
Member

FYI, this is the code path for the error.

if (runs.data.workflow_runs[0].head_sha != context.payload.pull_request.head.sha) {
throw new Error('There was a new unsynced commit pushed. Please retrigger the workflow.');
}

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Apr 11, 2022

shall we apply filter pushdown twice for simple DELETE execution? e.g. we first pushdown the DELETE condition to identify the files we need to replace, then we pushdown the negated DELETE condition to prune the parquet row groups.

@cloud-fan, I think discarding entire row groups is possible only for DELETEs when the whole condition was successfully translated into data source filters. This isn’t something we can support for other commands like UPDATE or when certain parts of the condition couldn’t be converted to a data source filter (e.g. subquery).

A few points on my mind right now:

  • How will data sources know what condition is for filtering files and what for filtering row groups without changes to the API?
  • Creating a scan builder in one rule and then configuring it further in another one will make the main planning rule even more complicated than it is today.

Technically, if we simply extend the scan builder API to indicate that the entire condition is being pushed down, it should be sufficient for data sources to discard entire row groups of deleted records. We already pass the SQL command and the condition. Data sources just don't know whether it is the entire condition and whether row groups can be discarded.


override lazy val isByName: Boolean = false
override lazy val references: AttributeSet = query.outputSet
override lazy val stringArgs: Iterator[Any] = Iterator(table, query, write)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these can be val as they are just constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced isByName and stringArgs with val. Kept references lazy to avoid eagerly computing those.

// metadata columns may be needed to request a correct distribution or ordering
// but are not passed back to the data source during writes

table.skipSchemaResolution || (dataInput.size == table.output.size &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to check this? the input query is built by spark and is directly reading the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be redundant in case of DELETE but it will be required for UPDATE and MERGE when the incoming values no longer solely depend on what was read. This will prevent setting nullable values for non-nullable attributes, for instance.

pushedFilters.right.get.mkString(", ")
}

val (scan, output) = PushDownUtils.pruneColumns(scanBuilder, relation, relation.output, Nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we don't do column pruning at all. We can make the code a bit simler

val scan = scanBuilder.scan
...
DataSourceV2ScanRelation(r, scan, r.output)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right we don't do column pruning but this makes sure metadata columns are projected. Otherwise, the scan would just report table attributes.

WriteToDataSourceV2(relation, microBatchWrite, newQuery, customMetrics)

case rd @ ReplaceData(r: DataSourceV2Relation, _, query, _, None) =>
val rowSchema = StructType.fromAttributes(rd.dataInput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply use rd.originalTable.output?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to use dataInput as it will hold the correct nullability info for UPDATE and MERGE.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one more round of review and left a few more minor comments. Great job!

@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @cloud-fan! Could you take one more look? I either addressed comments or replied.

I don't know what happened but the notify test workflow keeps failing for this PR and tests are not triggered. I tried updating the branch and reopening the PR. Did not work.

@viirya
Copy link
Member

viirya commented Apr 12, 2022

Currently we can check PR test result by https://github.com/aokolnychyi/spark/actions/workflows/build_and_test.yml

@aokolnychyi
Copy link
Contributor Author

Looks like the tests are green.

image

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.3!

@cloud-fan cloud-fan closed this in 5a92ecc Apr 13, 2022
cloud-fan pushed a commit that referenced this pull request Apr 13, 2022
…sed sources

This PR contains changes to rewrite DELETE operations for V2 data sources that can replace groups of data (e.g. files, partitions).

These changes are needed to support row-level operations in Spark per SPIP SPARK-35801.

No.

This PR comes with tests.

Closes #35395 from aokolnychyi/spark-38085.

Authored-by: Anton Okolnychyi <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 5a92ecc)
Signed-off-by: Wenchen Fan <[email protected]>
@aokolnychyi
Copy link
Contributor Author

Appreciate all the reviews, @cloud-fan @viirya @huaxingao @rdblue @sunchao @dongjoon-hyun!

@viirya
Copy link
Member

viirya commented Apr 13, 2022

Thanks @aokolnychyi and all, great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants