Skip to content

Conversation

@Udbhav30
Copy link
Contributor

What changes were proposed in this pull request?

Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

Why are the changes needed?

Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

Does this PR introduce any user-facing change?

Yes, After insert overwrite the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

How was this patch tested?

Manually

@Udbhav30 Udbhav30 force-pushed the trashSupport branch 2 times, most recently from c2ab531 to 9f6729a Compare July 31, 2020 13:08
@maropu
Copy link
Member

maropu commented Jul 31, 2020

Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

hm, we need to move data into a trash even though users intend to OVERWRITE it with this command? Probably, you need to describe more about what's a benefit for users.

@Udbhav30
Copy link
Contributor Author

Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

hm, we need to move data into a trash even though users intend to OVERWRITE it with this command? Probably, you need to describe more about what's a benefit for users.

A similar jira was raised for truncate table also SPARK-32481, I think it is just to be safe to first move it to trash first and it also provides the flexibility to user if they want to refer the deleted data. To not force this implementation I have added a configuration also to disable this.

@Udbhav30 Udbhav30 force-pushed the trashSupport branch 4 times, most recently from 9447091 to 14b040b Compare July 31, 2020 18:19
@Udbhav30 Udbhav30 force-pushed the trashSupport branch 4 times, most recently from 2c21f5d to 8b67fee Compare August 9, 2020 07:21
@Udbhav30
Copy link
Contributor Author

Hi @dongjoon-hyun , can you please review this one also.

@Udbhav30
Copy link
Contributor Author

Gentle ping @dongjoon-hyun

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127921 has finished for PR 29319 at commit c7b56a4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127918 has finished for PR 29319 at commit 8b67fee.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127923 has finished for PR 29319 at commit 812dd7d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128050 has finished for PR 29319 at commit 6be846b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor

I am doing this too. this pr can avoid our user write wrong dir path that have data (such as DB's path, happened before).
Move data to trash can make recovery of production data faster in the event of such a disaster.
FYI @maropu

throw new RuntimeException(
s"Cannot remove partition directory '$path'")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many case

  1. Hive version > 2.0 & Hive version < 2.9
  2. insert overwrite table
  3. dynamic partition insert

if (Option(existFile.getPath) != createdTempDir) fs.delete(existFile.getPath, true)
if (Option(existFile.getPath) != createdTempDir) {
Utils.moveToTrashOrDelete(fs, existFile.getPath, isTrashEnabled, hadoopConf)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing this too. this pr (such as INSERT OVERWRITE DIRECTION part ) can avoid our user write wrong dir path that have data (such as DB's path, happened before).
Move data to trash can make recovery of production data faster in the event of such a disaster.
FYI @maropu

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 13, 2020
@github-actions github-actions bot closed this Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants