[SPARK-37210][CORE][SQL] Allow forced use of staging directory #37346

wForget · 2022-07-30T06:27:56Z

What changes were proposed in this pull request?

Add forceUseStagingDir config to force use of staging dir when writing.

When setting forceUseStagingDir to true, I set committerOutputPath to staging dir in InsertIntoHadoopFsRelationCommand and for HadoopMapReduceCommitProtocol.newTaskTempFile method I calculate absolute dir and call newTaskTempFileAbsPath.

Why are the changes needed?

As discussed in SPARK-37210, errors or data loss may occur under some concurrent write scenarios.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added test case in InsertSuite.

wForget · 2022-07-30T06:31:02Z

Hi @dongjoon-hyun , could you please help me review it?

AmplabJenkins · 2022-07-31T09:41:52Z

Can one of the admins verify this patch?

dongjoon-hyun · 2022-08-01T21:29:33Z

Thank you for making a PR, @wForget .

To @viirya and @sunchao . This issue has a reproducible example in the JIRA.

viirya

Why it is an issue particular for InsertIntoHadoopFsRelationCommand?

wForget · 2022-08-02T04:02:41Z

Why it is an issue particular for InsertIntoHadoopFsRelationCommand?

InsertIntoHiveTable always uses hive staging dir

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

Line 107 in b0c831d

val tmpLocation = getExternalTmpPath(sparkSession, hadoopConf, tableLocation)

InsertIntoHadoopFsRelationCommand only uses spark staging dir in dynamic overwrite mode, otherwise it uses table_location/_temporary which leads to concurrency conflicts.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

Line 171 in b0c831d

val committerOutputPath = if (dynamicPartitionOverwrite) {

viirya

The usecase looks suspicious to me. Is it a valid one? I'm not sure that InsertIntoHadoopFsRelationCommand guarantees concurrent writing to same table.

wForget · 2022-08-03T09:19:28Z

The usecase looks suspicious to me. Is it a valid one? I'm not sure that InsertIntoHadoopFsRelationCommand guarantees concurrent writing to same table.

It seems a reasonable requirement to concurrently write to different partitions of the same table. Is there some blocking issues?

github-actions · 2022-11-12T00:24:15Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

zhengchenyu · 2025-10-13T11:03:30Z

@viirya @wForget @dongjoon-hyun
I found the same phenomenon. I have this audit log.

# first rename op
cmd=rename	src=/user/testuser/testdb.db/test_table/_temporary/0/task_xxx/pt=20250908000000
dst=/user/testuser/testdb.db/test_table/pt=20250908000000	
# second delete op
cmd=delete	src=/user/testuser/testdb.db/test_table/_temporary

For any partition application, will delete /user/testuser/testdb.db/test_table/_temporary. When multiple application for different partitions are running at the same time, data loss may occur. we can just solve this problem to replace /user/testuser/testdb.db/test_table/_temporary with a unique directory.

How about reopen this PR? And I think the use case is not suspicious. For example, if I want to recalculate the partition data for the last month, I will run multiple application in parallel.

zhengchenyu · 2025-10-14T09:47:34Z

@viirya @dongjoon-hyun @wForget

After some research, I discovered that the .spark_staging_xxx directory is only used for custom partition paths (introduced in #15814) and dynamic partitions overwrite (introduced in #18714, with appropriate modifications in #29000). I suspect the purpose of introducing .spark_staging_xxx is to avoid conflicts, for example, in scenarios where dynamic partitions overwrite to prevent data contamination.

I believe the issue of running multiple partitions application in parallel is similar to the two above. Could we make writing to .spark_staging_xxx as the default behavior? This would not only solve this problem but also make the code structure more clean?

zhengchenyu · 2025-10-16T10:51:29Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

-        new Path(Option(f.getWorkPath).map(_.toString).getOrElse(path))
-      case _ => new Path(path)
-    }
+    if (forceUseStagingDir && !dynamicPartitionOverwrite) {


when spark.sql.hive.convertMetastoreParquet or spark.sql.hive.convertMetastoreOrc is false, mean use hive serde. We also call newTaskTempFileAbsPath. Here will trigger rename. I suspect this is a conflict with hive serde logic.

wForget and others added 2 commits July 29, 2022 15:52

[SPARK-37210] Allow forced use of staging directory

d4fc201

fix

87de653

github-actions bot added CORE SQL labels Jul 30, 2022

dongjoon-hyun changed the title ~~[SPARK-37210] Allow forced use of staging directory~~ [SPARK-37210][CORE][SQL] Allow forced use of staging directory Aug 1, 2022

viirya reviewed Aug 2, 2022

View reviewed changes

github-actions bot added the Stale label Nov 12, 2022

github-actions bot closed this Nov 13, 2022

zhengchenyu reviewed Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-37210][CORE][SQL] Allow forced use of staging directory #37346

[SPARK-37210][CORE][SQL] Allow forced use of staging directory #37346

Uh oh!

wForget commented Jul 30, 2022

Uh oh!

wForget commented Jul 30, 2022

Uh oh!

AmplabJenkins commented Jul 31, 2022

Uh oh!

dongjoon-hyun commented Aug 1, 2022

Uh oh!

viirya left a comment

Uh oh!

wForget commented Aug 2, 2022

Uh oh!

viirya left a comment

Uh oh!

wForget commented Aug 3, 2022

Uh oh!

github-actions bot commented Nov 12, 2022

Uh oh!

zhengchenyu commented Oct 13, 2025

Uh oh!

zhengchenyu commented Oct 14, 2025

Uh oh!

zhengchenyu Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-37210][CORE][SQL] Allow forced use of staging directory #37346

[SPARK-37210][CORE][SQL] Allow forced use of staging directory #37346

Uh oh!

Conversation

wForget commented Jul 30, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wForget commented Jul 30, 2022

Uh oh!

AmplabJenkins commented Jul 31, 2022

Uh oh!

dongjoon-hyun commented Aug 1, 2022

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

wForget commented Aug 2, 2022

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

wForget commented Aug 3, 2022

Uh oh!

github-actions bot commented Nov 12, 2022

Uh oh!

zhengchenyu commented Oct 13, 2025

Uh oh!

zhengchenyu commented Oct 14, 2025

Uh oh!

zhengchenyu Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants