-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-45057][CORE] Avoid acquire read lock when keepReadLock is false #43067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @warrenzhu25, and nice explanation !
Would be good to get more eyes on this.
+CC @JoshRosen as well.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rebase to master branch once more, @warrenzhu25 ?
ea8bb1b to
bf9c9c6
Compare
Rebased. |
|
Thank you for updating, @warrenzhu25 . BTW, GitHub Action seems to lose your running CI link. Could you post the your running GitHub Action link here? |
bf9c9c6 to
80e9dca
Compare
Retriggered build. |
a34518e to
6db6858
Compare
|
@JoshRosen Could you help take a look? |
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala
Outdated
Show resolved
Hide resolved
JoshRosen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This change looks correct to me and it looks like it preserves the important locking behaviors w.r.t. downstream code: the old code would acquire a read lock only to immediately free it when keepReadLock == false and the new code simply avoids acquiring that lock in the first place.
6db6858 to
c75d121
Compare
|
The test failures are unrelated, but can you retrigger them @warrenzhu25 ? |
c75d121 to
7600779
Compare
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch 👍
7600779 to
4579cd9
Compare
4579cd9 to
be8c26f
Compare
I tried to rebuild several times, but it seems still failing and hanging on |
### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
|
Looks like multiple PR's are impacted by it - and in this case, it is not related. Thanks for fixing this @warrenzhu25 ! |
### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> (cherry picked from commit 68db395)
apache#252) ### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5) Co-authored-by: Warren Zhu <[email protected]>
What changes were proposed in this pull request?
Add
keepReadLockparameter inlockNewBlockForWriting(). WhenkeepReadLockisfalse, skiplockForReading()to avoid block on read Lock or potential deadlock issue.When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057]
Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on
lockForReadingin BlockInfoManager.scalaWhy are the changes needed?
This could save unnecessary read lock acquire and avoid deadlock issue mention above.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added UT in BlockInfoManagerSuite
Was this patch authored or co-authored using generative AI tooling?
No