-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-51936][SQL] ReplaceTableAsSelect should overwrite the new table instead of append #50739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused. Let's add comments about when setting this.
| |OPTIONS (PATH '$path') | ||
| |AS VALUES (2, 3) | ||
| |""".stripMargin) | ||
| checkAnswer(sql("SELECT * FROM test"), Seq(Row(0, 1), Row(0, 1), Row(1, 2), Row(2, 3))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RTAS with ds v2 source is newly supported in 4.0: #44190
It's not too late to change it as 4.0 is not released yet.
|
thanks for the review, merging into master/4.0/3.5! |
…e instead of append
For file source v1, if you do
```
Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")
```
At the end, the data of `t` is `[2, "b"]`, because the v1 command `CreateDataSourceTableAsSelectCommand` uses `Overwrite` mode to write the data to the file directory.
With DS v2, we use the v2 command `ReplaceTableAsSelect`, which uses `AppendData` to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.
This PR fixes this inconsistency by using `OverwriteByExpression` in `ReplaceTableAsSelect` physical commands.
Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.
No, file source v2 table is not supported yet.
update an existing test
no
Closes #50739 from cloud-fan/RTAS.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…e instead of append
For file source v1, if you do
```
Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")
```
At the end, the data of `t` is `[2, "b"]`, because the v1 command `CreateDataSourceTableAsSelectCommand` uses `Overwrite` mode to write the data to the file directory.
With DS v2, we use the v2 command `ReplaceTableAsSelect`, which uses `AppendData` to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.
This PR fixes this inconsistency by using `OverwriteByExpression` in `ReplaceTableAsSelect` physical commands.
Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.
No, file source v2 table is not supported yet.
update an existing test
no
Closes #50739 from cloud-fan/RTAS.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…e instead of append
### What changes were proposed in this pull request?
For file source v1, if you do
```
Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")
```
At the end, the data of `t` is `[2, "b"]`, because the v1 command `CreateDataSourceTableAsSelectCommand` uses `Overwrite` mode to write the data to the file directory.
With DS v2, we use the v2 command `ReplaceTableAsSelect`, which uses `AppendData` to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.
This PR fixes this inconsistency by using `OverwriteByExpression` in `ReplaceTableAsSelect` physical commands.
### Why are the changes needed?
Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.
### Does this PR introduce _any_ user-facing change?
No, file source v2 table is not supported yet.
### How was this patch tested?
update an existing test
### Was this patch authored or co-authored using generative AI tooling?
no
Closes apache#50739 from cloud-fan/RTAS.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
… StagedDeltaTableV2 (#4919) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ### Summary Added support for TRUNCATE operations in Delta tables by: - Implementing the `SupportsTruncate` interface in `DeltaV1WriteBuilder` - Adding TRUNCATE and OVERWRITE_BY_FILTER capabilities to the table's supported capabilities - Importing required `SupportsTruncate` class from Spark SQL connector API This change enables proper handling of TRUNCATE TABLE operations on Delta tables. ### Root Cause Spark’s change apache/spark#50739 in 3.5.6 version Spark 3.5.6 introduced stricter validation in the V2Writes rule (PR #50739) that requires tables to properly implement overwrite interfaces, not just declare capabilities. 1. Added stricter capability checks for TRUNCATE TABLE operations The issue came from Delta tables not declaring TRUNCATE capability, leading to compatibility problems with Spark's enhanced validation system which became stricter with 3.5.6. ### Fix Updated StagedDeltaTableV2 class in DeltaCatalog.scala to properly support overwrite operations <!-- - Describe what this PR changes. - Describe why we need the change. If this PR resolves an issue be sure to include "Resolves #XXX" to correctly link and close the issue upon merge. --> Resolves #4671 ## How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to test the changes thoroughly including negative and positive cases if possible. If the changes were tested in any way other than unit tests, please clarify how you tested step by step (ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future). If the changes were not tested, please explain why. --> Tested locally by running scripts to Create and Replace Table ## Does this PR introduce _any_ user-facing changes? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. --> No --------- Co-authored-by: Venki Korukanti <[email protected]>
…e instead of append
For file source v1, if you do
```
Seq(1 -> "a").toDF().write.option("path", p).saveAsTable("t")
Seq(2 -> "b").toDF().write.mode("overwrite").option("path", p).saveAsTable("t")
```
At the end, the data of `t` is `[2, "b"]`, because the v1 command `CreateDataSourceTableAsSelectCommand` uses `Overwrite` mode to write the data to the file directory.
With DS v2, we use the v2 command `ReplaceTableAsSelect`, which uses `AppendData` to write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.
This PR fixes this inconsistency by using `OverwriteByExpression` in `ReplaceTableAsSelect` physical commands.
Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.
No, file source v2 table is not supported yet.
update an existing test
no
Closes apache#50739 from cloud-fan/RTAS.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
For file source v1, if you do
At the end, the data of
tis[2, "b"], because the v1 commandCreateDataSourceTableAsSelectCommandusesOverwritemode to write the data to the file directory.With DS v2, we use the v2 command
ReplaceTableAsSelect, which usesAppendDatato write to the new table. If the new table still keeps the old data, which can happen for file source tables, as DROP TABLE won't delete the external location, then the behavior will be different from file source v1.This PR fixes this inconsistency by using
OverwriteByExpressioninReplaceTableAsSelectphysical commands.Why are the changes needed?
Fixes a potential inconsistency issue between file source v1 and v2, for now we are fine as we don't support file source v2 table yet.
This is also helpful for third-party v2 sources that may retain old data in the new table.
Does this PR introduce any user-facing change?
No, file source v2 table is not supported yet.
How was this patch tested?
update an existing test
Was this patch authored or co-authored using generative AI tooling?
no