Skip to content

Conversation

@warrenzhu25
Copy link
Contributor

What changes were proposed in this pull request?

Add keepReadLock parameter in lockNewBlockForWriting(). When keepReadLock is false, skip lockForReading() to avoid block on read Lock or potential deadlock issue.

When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057]

Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on lockForReading in BlockInfoManager.scala

Why are the changes needed?

This could save unnecessary read lock acquire and avoid deadlock issue mention above.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT in BlockInfoManagerSuite

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Sep 23, 2023
Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @warrenzhu25, and nice explanation !
Would be good to get more eyes on this.
+CC @JoshRosen as well.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase to master branch once more, @warrenzhu25 ?

@warrenzhu25
Copy link
Contributor Author

Could you rebase to master branch once more, @warrenzhu25 ?

Rebased.

@dongjoon-hyun
Copy link
Member

Thank you for updating, @warrenzhu25 .

BTW, GitHub Action seems to lose your running CI link. Could you post the your running GitHub Action link here?

@warrenzhu25
Copy link
Contributor Author

Thank you for updating, @warrenzhu25 .

BTW, GitHub Action seems to lose your running CI link. Could you post the your running GitHub Action link here?

Retriggered build.

@warrenzhu25 warrenzhu25 force-pushed the deadlock branch 2 times, most recently from a34518e to 6db6858 Compare September 25, 2023 19:24
@warrenzhu25
Copy link
Contributor Author

@JoshRosen Could you help take a look?

Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This change looks correct to me and it looks like it preserves the important locking behaviors w.r.t. downstream code: the old code would acquire a read lock only to immediately free it when keepReadLock == false and the new code simply avoids acquiring that lock in the first place.

@mridulm
Copy link
Contributor

mridulm commented Sep 27, 2023

The test failures are unrelated, but can you retrigger them @warrenzhu25 ?
Ideally, would prefer a clean build before merging.

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 👍

@warrenzhu25
Copy link
Contributor Author

The test failures are unrelated, but can you retrigger them @warrenzhu25 ? Ideally, would prefer a clean build before merging.

I tried to rebuild several times, but it seems still failing and hanging on LBFGSClusterSuite, but it should be unrelated.

@mridulm mridulm closed this in 0d6fda5 Sep 28, 2023
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?
Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue.

When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057]

Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24)

### Why are the changes needed?
This could save unnecessary read lock acquire and avoid deadlock issue mention above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT in BlockInfoManagerSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43067 from warrenzhu25/deadlock.

Authored-by: Warren Zhu <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 0d6fda5)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?
Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue.

When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057]

Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24)

### Why are the changes needed?
This could save unnecessary read lock acquire and avoid deadlock issue mention above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT in BlockInfoManagerSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43067 from warrenzhu25/deadlock.

Authored-by: Warren Zhu <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 0d6fda5)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?
Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue.

When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057]

Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24)

### Why are the changes needed?
This could save unnecessary read lock acquire and avoid deadlock issue mention above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT in BlockInfoManagerSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43067 from warrenzhu25/deadlock.

Authored-by: Warren Zhu <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 0d6fda5)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
@mridulm
Copy link
Contributor

mridulm commented Sep 28, 2023

Looks like multiple PR's are impacted by it - and in this case, it is not related.
Merging to master, 3.5, 3.4, 3.3

Thanks for fixing this @warrenzhu25 !
Thanks for reviews @JoshRosen, @Ngone51 :-)

viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
### What changes were proposed in this pull request?
Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue.

When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057]

Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24)

### Why are the changes needed?
This could save unnecessary read lock acquire and avoid deadlock issue mention above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT in BlockInfoManagerSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#43067 from warrenzhu25/deadlock.

Authored-by: Warren Zhu <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 0d6fda5)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
(cherry picked from commit 68db395)
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
apache#252)

### What changes were proposed in this pull request?
Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue.

When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057]

Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24)

### Why are the changes needed?
This could save unnecessary read lock acquire and avoid deadlock issue mention above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT in BlockInfoManagerSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#43067 from warrenzhu25/deadlock.

Authored-by: Warren Zhu <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 0d6fda5)

Co-authored-by: Warren Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants