-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-37210][CORE][SQL] Allow forced use of staging directory #37346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @dongjoon-hyun , could you please help me review it? |
|
Can one of the admins verify this patch? |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it is an issue particular for InsertIntoHadoopFsRelationCommand?
spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala Line 107 in b0c831d
InsertIntoHadoopFsRelationCommand only uses spark staging dir in dynamic overwrite mode, otherwise it uses table_location/_temporary which leads to concurrency conflicts.Line 171 in b0c831d
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usecase looks suspicious to me. Is it a valid one? I'm not sure that InsertIntoHadoopFsRelationCommand guarantees concurrent writing to same table.
It seems a reasonable requirement to concurrently write to different partitions of the same table. Is there some blocking issues? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
@viirya @wForget @dongjoon-hyun For any partition application, will delete How about reopen this PR? And I think the use case is not suspicious. For example, if I want to recalculate the partition data for the last month, I will run multiple application in parallel. |
|
@viirya @dongjoon-hyun @wForget After some research, I discovered that the I believe the issue of running multiple partitions application in parallel is similar to the two above. Could we make writing to |
| new Path(Option(f.getWorkPath).map(_.toString).getOrElse(path)) | ||
| case _ => new Path(path) | ||
| } | ||
| if (forceUseStagingDir && !dynamicPartitionOverwrite) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when spark.sql.hive.convertMetastoreParquet or spark.sql.hive.convertMetastoreOrc is false, mean use hive serde. We also call newTaskTempFileAbsPath. Here will trigger rename. I suspect this is a conflict with hive serde logic.
What changes were proposed in this pull request?
Add
forceUseStagingDirconfig to force use of staging dir when writing.When setting
forceUseStagingDirto true, I setcommitterOutputPathto staging dir inInsertIntoHadoopFsRelationCommandand forHadoopMapReduceCommitProtocol.newTaskTempFilemethod I calculate absolute dir and callnewTaskTempFileAbsPath.Why are the changes needed?
As discussed in SPARK-37210, errors or data loss may occur under some concurrent write scenarios.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added test case in
InsertSuite.