[SPARK-45057][CORE] Avoid acquire read lock when keepReadLock is false #43067

warrenzhu25 · 2023-09-23T00:08:41Z

What changes were proposed in this pull request?

Add keepReadLock parameter in lockNewBlockForWriting(). When keepReadLock is false, skip lockForReading() to avoid block on read Lock or potential deadlock issue.

When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057]

Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on lockForReading in BlockInfoManager.scala

Why are the changes needed?

This could save unnecessary read lock acquire and avoid deadlock issue mention above.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT in BlockInfoManagerSuite

Was this patch authored or co-authored using generative AI tooling?

No

mridulm

Thanks for fixing this @warrenzhu25, and nice explanation !
Would be good to get more eyes on this.
+CC @JoshRosen as well.

dongjoon-hyun

Could you rebase to master branch once more, @warrenzhu25 ?

warrenzhu25 · 2023-09-24T02:12:58Z

Could you rebase to master branch once more, @warrenzhu25 ?

Rebased.

dongjoon-hyun · 2023-09-24T02:35:28Z

Thank you for updating, @warrenzhu25 .

BTW, GitHub Action seems to lose your running CI link. Could you post the your running GitHub Action link here?

warrenzhu25 · 2023-09-24T02:49:50Z

Thank you for updating, @warrenzhu25 .

BTW, GitHub Action seems to lose your running CI link. Could you post the your running GitHub Action link here?

Retriggered build.

warrenzhu25 · 2023-09-26T00:08:40Z

@JoshRosen Could you help take a look?

core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala

JoshRosen

LGTM.

This change looks correct to me and it looks like it preserves the important locking behaviors w.r.t. downstream code: the old code would acquire a read lock only to immediately free it when keepReadLock == false and the new code simply avoids acquiring that lock in the first place.

mridulm · 2023-09-27T02:21:36Z

The test failures are unrelated, but can you retrigger them @warrenzhu25 ?
Ideally, would prefer a clean build before merging.

Ngone51

Good catch 👍

warrenzhu25 · 2023-09-28T19:13:03Z

The test failures are unrelated, but can you retrigger them @warrenzhu25 ? Ideally, would prefer a clean build before merging.

I tried to rebuild several times, but it seems still failing and hanging on LBFGSClusterSuite, but it should be unrelated.

### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

mridulm · 2023-09-28T23:53:49Z

Looks like multiple PR's are impacted by it - and in this case, it is not related.
Merging to master, 3.5, 3.4, 3.3

Thanks for fixing this @warrenzhu25 !
Thanks for reviews @JoshRosen, @Ngone51 :-)

### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> (cherry picked from commit 68db395)

apache#252) ### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5) Co-authored-by: Warren Zhu <[email protected]>

github-actions bot added the CORE label Sep 23, 2023

mridulm approved these changes Sep 23, 2023

View reviewed changes

dongjoon-hyun reviewed Sep 24, 2023

View reviewed changes

warrenzhu25 force-pushed the deadlock branch from ea8bb1b to bf9c9c6 Compare September 24, 2023 02:08

warrenzhu25 force-pushed the deadlock branch from bf9c9c6 to 80e9dca Compare September 24, 2023 02:43

warrenzhu25 force-pushed the deadlock branch 2 times, most recently from a34518e to 6db6858 Compare September 25, 2023 19:24

JoshRosen reviewed Sep 26, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala Outdated Show resolved Hide resolved

JoshRosen approved these changes Sep 26, 2023

View reviewed changes

warrenzhu25 force-pushed the deadlock branch from 6db6858 to c75d121 Compare September 26, 2023 19:18

warrenzhu25 force-pushed the deadlock branch from c75d121 to 7600779 Compare September 27, 2023 02:58

Ngone51 approved these changes Sep 27, 2023

View reviewed changes

warrenzhu25 force-pushed the deadlock branch from 7600779 to 4579cd9 Compare September 27, 2023 16:08

Avoid accquire read lock when keepReadLock is false

be8c26f

warrenzhu25 force-pushed the deadlock branch from 4579cd9 to be8c26f Compare September 27, 2023 23:20

mridulm closed this in 0d6fda5 Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45057][CORE] Avoid acquire read lock when keepReadLock is false #43067

[SPARK-45057][CORE] Avoid acquire read lock when keepReadLock is false #43067

Uh oh!

warrenzhu25 commented Sep 23, 2023

Uh oh!

mridulm left a comment •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

warrenzhu25 commented Sep 24, 2023

Uh oh!

dongjoon-hyun commented Sep 24, 2023

Uh oh!

warrenzhu25 commented Sep 24, 2023

Uh oh!

warrenzhu25 commented Sep 26, 2023

Uh oh!

Uh oh!

JoshRosen left a comment

Uh oh!

mridulm commented Sep 27, 2023

Uh oh!

Ngone51 left a comment

Uh oh!

warrenzhu25 commented Sep 28, 2023

Uh oh!

mridulm commented Sep 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-45057][CORE] Avoid acquire read lock when keepReadLock is false #43067

[SPARK-45057][CORE] Avoid acquire read lock when keepReadLock is false #43067

Uh oh!

Conversation

warrenzhu25 commented Sep 23, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mridulm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

warrenzhu25 commented Sep 24, 2023

Uh oh!

dongjoon-hyun commented Sep 24, 2023

Uh oh!

warrenzhu25 commented Sep 24, 2023

Uh oh!

warrenzhu25 commented Sep 26, 2023

Uh oh!

Uh oh!

JoshRosen left a comment

Choose a reason for hiding this comment

Uh oh!

mridulm commented Sep 27, 2023

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

warrenzhu25 commented Sep 28, 2023

Uh oh!

mridulm commented Sep 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mridulm left a comment •

edited

Loading