Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 28, 2025

What changes were proposed in this pull request?

For file source v1, if you do

Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")

At the end, the data of t is [2, "b"], because the v1 command CreateDataSourceTableAsSelectCommand uses Overwrite mode to write the data to the file directory.

With DS v2, we use the v2 command ReplaceTableAsSelect, which uses AppendData to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.

This PR fixes this inconsistency by using OverwriteByExpression in ReplaceTableAsSelect physical commands.

Why are the changes needed?

Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.

Does this PR introduce any user-facing change?

No, file source v2 table is not supported yet.

How was this patch tested?

update an existing test

Was this patch authored or co-authored using generative AI tooling?

no

@cloud-fan
Copy link
Contributor Author

cc @gengliangwang

@github-actions github-actions bot added the SQL label Apr 28, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused. Let's add comments about when setting this.

|OPTIONS (PATH '$path')
|AS VALUES (2, 3)
|""".stripMargin)
checkAnswer(sql("SELECT * FROM test"), Seq(Row(0, 1), Row(0, 1), Row(1, 2), Row(2, 3)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTAS with ds v2 source is newly supported in 4.0: #44190

It's not too late to change it as 4.0 is not released yet.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging into master/4.0/3.5!

@cloud-fan cloud-fan closed this in aea5836 Apr 29, 2025
cloud-fan added a commit that referenced this pull request Apr 29, 2025
…e instead of append

For file source v1, if you do
```
Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")
```
At the end, the data of `t` is `[2, "b"]`, because the v1 command `CreateDataSourceTableAsSelectCommand` uses `Overwrite` mode to write the data to the file directory.

With DS v2, we use the v2 command `ReplaceTableAsSelect`, which uses `AppendData` to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.

This PR fixes this inconsistency by using `OverwriteByExpression` in `ReplaceTableAsSelect` physical commands.

Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.

No, file source v2 table is not supported yet.

update an existing test

no

Closes #50739 from cloud-fan/RTAS.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Apr 29, 2025
…e instead of append

For file source v1, if you do
```
Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")
```
At the end, the data of `t` is `[2, "b"]`, because the v1 command `CreateDataSourceTableAsSelectCommand` uses `Overwrite` mode to write the data to the file directory.

With DS v2, we use the v2 command `ReplaceTableAsSelect`, which uses `AppendData` to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.

This PR fixes this inconsistency by using `OverwriteByExpression` in `ReplaceTableAsSelect` physical commands.

Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.

No, file source v2 table is not supported yet.

update an existing test

no

Closes #50739 from cloud-fan/RTAS.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
yhuang-db pushed a commit to yhuang-db/spark that referenced this pull request Jun 9, 2025
…e instead of append

### What changes were proposed in this pull request?

For file source v1, if you do
```
Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")
```
At the end, the data of `t` is `[2, "b"]`, because the v1 command `CreateDataSourceTableAsSelectCommand` uses `Overwrite` mode to write the data to the file directory.

With DS v2, we use the v2 command `ReplaceTableAsSelect`, which uses `AppendData` to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.

This PR fixes this inconsistency by using `OverwriteByExpression` in `ReplaceTableAsSelect` physical commands.

### Why are the changes needed?

Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.

### Does this PR introduce _any_ user-facing change?

No, file source v2 table is not supported yet.

### How was this patch tested?

update an existing test
### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#50739 from cloud-fan/RTAS.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
vkorukanti added a commit to delta-io/delta that referenced this pull request Aug 27, 2025
… StagedDeltaTableV2 (#4919)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
### Summary

Added support for TRUNCATE operations in Delta tables by:
- Implementing the `SupportsTruncate` interface in `DeltaV1WriteBuilder`
- Adding TRUNCATE and OVERWRITE_BY_FILTER capabilities to the table's
supported capabilities
- Importing required `SupportsTruncate` class from Spark SQL connector
API

This change enables proper handling of TRUNCATE TABLE operations on
Delta tables.


### Root Cause
Spark’s change apache/spark#50739 in 3.5.6
version

Spark 3.5.6 introduced stricter validation in the V2Writes rule (PR
#50739) that requires tables to properly implement overwrite interfaces,
not just declare capabilities.

1. Added stricter capability checks for TRUNCATE TABLE operations

The issue came from Delta tables not declaring TRUNCATE capability,
leading to compatibility problems with Spark's enhanced validation
system which became stricter with 3.5.6.
 
### Fix
Updated StagedDeltaTableV2 class in DeltaCatalog.scala to properly
support overwrite operations

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->

Resolves #4671 

## How was this patch tested?

<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->
Tested locally by running scripts to Create and Replace Table

## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
No

---------

Co-authored-by: Venki Korukanti <[email protected]>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 14, 2025
…e instead of append

For file source v1, if you do
```
Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")
```
At the end, the data of `t` is `[2, "b"]`, because the v1 command `CreateDataSourceTableAsSelectCommand` uses `Overwrite` mode to write the data to the file directory.

With DS v2, we use the v2 command `ReplaceTableAsSelect`, which uses `AppendData` to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.

This PR fixes this inconsistency by using `OverwriteByExpression` in `ReplaceTableAsSelect` physical commands.

Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.

No, file source v2 table is not supported yet.

update an existing test

no

Closes apache#50739 from cloud-fan/RTAS.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants