-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26265][Core] Fix deadlock in BytesToBytesMap.MapIterator when locking both BytesToBytesMap.MapIterator and TaskMemoryManager #23272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan |
| } | ||
|
|
||
| @Test | ||
| public void avoidDeadlock() throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @viirya . Since this test case reproduces Deadlock situation, we need a timeout logic. Otherwise, it will hang (instead of failures) when we hit this issue later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried several ways to set a timeout logic, but don't work. The deadlock always hangs the test and timeout logic.
core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java
Outdated
Show resolved
Hide resolved
|
have you seen any bug report caused by this dead lock? |
|
Test build #99900 has finished for PR 23272 at commit
|
The original reporter of the JIRA ticket SPARK-26265 has hit with this bug in their workload. |
| assertFalse(iter.hasNext()); | ||
| } finally { | ||
| map.free(); | ||
| thread.join(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line where the test hangs without the fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When without this line, the test still hangs. The test thread hangs on the deadlock with the other thread of running memoryConsumer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line just makes sure memoryConsumer to end and free acquired memory.
|
Test build #99914 has finished for PR 23272 at commit
|
|
Test build #99915 has finished for PR 23272 at commit
|
|
I think the page is used exclusively by the map and the iterator. So it
could not be freed by other consumer.
…On Tue, Dec 11, 2018, 10:23 Wenchen Fan ***@***.*** wrote:
***@***.**** commented on this pull request.
------------------------------
In core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
<#23272 (comment)>:
> @@ -283,6 +290,9 @@ private void advanceToNextPage() {
}
}
}
+ if (pageToFree != null) {
+ freePage(pageToFree);
is it possible that this page is already freed by another consumer?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23272 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEM9-_lp7w_wK9mWUYjtPT1h9o5O0rzks5u3xcSgaJpZM4ZK2_Y>
.
|
core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
Outdated
Show resolved
Hide resolved
|
|
||
| try { | ||
| int i; | ||
| for (i = 0; i < 1024; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use for (int i = 0; ... here and line 708 because int i is not referenced outside of for loop.
Never mind. I found that this is the convention in this test suite.
|
Oh, you meant that the page is freed by other using this map or iterator.
Is it a problem?
I think it should not be a case that more than one consumers free the same
page at the same time.
…On Tue, Dec 11, 2018, 11:34 Wenchen Fan ***@***.*** wrote:
***@***.**** commented on this pull request.
------------------------------
In core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
<#23272 (comment)>:
> @@ -283,6 +290,9 @@ private void advanceToNextPage() {
}
}
}
+ if (pageToFree != null) {
+ freePage(pageToFree);
the MapIterator.spill will be called by BytesToBytesMap.spill which will
be called by other consumers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23272 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEM91AsLKsA5zS0gXEhip3GuUcVjJ7-ks5u3yfAgaJpZM4ZK2_Y>
.
|
|
If you worry that the page is to be freed by the other consumer using the
map iterator and also the map iterator itself, because I am not in front of
laptop so I can't check it. But I guess freePage should already cover it.
…On Tue, Dec 11, 2018, 11:34 Wenchen Fan ***@***.*** wrote:
***@***.**** commented on this pull request.
------------------------------
In core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
<#23272 (comment)>:
> @@ -283,6 +290,9 @@ private void advanceToNextPage() {
}
}
}
+ if (pageToFree != null) {
+ freePage(pageToFree);
the MapIterator.spill will be called by BytesToBytesMap.spill which will
be called by other consumers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23272 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEM91AsLKsA5zS0gXEhip3GuUcVjJ7-ks5u3yfAgaJpZM4ZK2_Y>
.
|
|
LGTM. last question: does the test always reproduce the bug? Or it has some randomness? |
|
Test build #99956 has finished for PR 23272 at commit
|
If without the change, as I tried it locally 10 times, the test can reproduce the bug 10 times. But I'm not sure if it is 100% to reproduce the bug. I think we can't always to reproduce a deadlock like this. |
|
thanks, merging to master! Can you send a new PR for 2.4 without the |
Ok. Thanks. |
|
Test build #99966 has finished for PR 23272 at commit
|
|
I think the failed tests are unrelated. cc @cloud-fan |
## What changes were proposed in this pull request? Based on the [comment](#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. ## How was this patch tested? Existing tests. Closes #23294 from viirya/SPARK-26265-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 1b604c1) Signed-off-by: Hyukjin Kwon <[email protected]>
## What changes were proposed in this pull request? Based on the [comment](#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. ## How was this patch tested? Existing tests. Closes #23294 from viirya/SPARK-26265-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
## What changes were proposed in this pull request? Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. ## How was this patch tested? Existing tests. Closes apache#23294 from viirya/SPARK-26265-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…locking both BytesToBytesMap.MapIterator and TaskMemoryManager ## What changes were proposed in this pull request? In `BytesToBytesMap.MapIterator.advanceToNextPage`, We will first lock this `MapIterator` and then `TaskMemoryManager` when going to free a memory page by calling `freePage`. At the same time, it is possibly that another memory consumer first locks `TaskMemoryManager` and then this `MapIterator` when it acquires memory and causes spilling on this `MapIterator`. So it ends with the `MapIterator` object holds lock to the `MapIterator` object and waits for lock on `TaskMemoryManager`, and the other consumer holds lock to `TaskMemoryManager` and waits for lock on the `MapIterator` object. To avoid deadlock here, this patch proposes to keep reference to the page to free and free it after releasing the lock of `MapIterator`. ## How was this patch tested? Added test and manually test by running the test 100 times to make sure there is no deadlock. Closes apache#23272 from viirya/SPARK-26265. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. ## How was this patch tested? Existing tests. Closes apache#23294 from viirya/SPARK-26265-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
## What changes were proposed in this pull request? Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. ## How was this patch tested? Existing tests. Closes apache#23294 from viirya/SPARK-26265-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 1b604c1) Signed-off-by: Hyukjin Kwon <[email protected]>
## What changes were proposed in this pull request? Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. ## How was this patch tested? Existing tests. Closes apache#23294 from viirya/SPARK-26265-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 1b604c1) Signed-off-by: Hyukjin Kwon <[email protected]>
## What changes were proposed in this pull request? Based on the [comment](apache/spark#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. ## How was this patch tested? Existing tests. Closes #23294 from viirya/SPARK-26265-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 1b604c1) Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 6019d9a)
…locking both BytesToBytesMap.MapIterator and TaskMemoryManager In `BytesToBytesMap.MapIterator.advanceToNextPage`, We will first lock this `MapIterator` and then `TaskMemoryManager` when going to free a memory page by calling `freePage`. At the same time, it is possibly that another memory consumer first locks `TaskMemoryManager` and then this `MapIterator` when it acquires memory and causes spilling on this `MapIterator`. So it ends with the `MapIterator` object holds lock to the `MapIterator` object and waits for lock on `TaskMemoryManager`, and the other consumer holds lock to `TaskMemoryManager` and waits for lock on the `MapIterator` object. To avoid deadlock here, this patch proposes to keep reference to the page to free and free it after releasing the lock of `MapIterator`. Added test and manually test by running the test 100 times to make sure there is no deadlock. Closes apache#23272 from viirya/SPARK-26265. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry-picked from commit a3bbca9) [SPARK-26265][CORE][FOLLOWUP] Put freePage into a finally block Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. Existing tests. Closes apache#23294 from viirya/SPARK-26265-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry-picked from commit 1b604c1) Ref: LIHADOOP-43221 RB=1518143 BUG=LIHADOOP-43221 G=superfriends-reviewers R=fli,mshen,yezhou,edlu A=fli
What changes were proposed in this pull request?
In
BytesToBytesMap.MapIterator.advanceToNextPage, We will first lock thisMapIteratorand thenTaskMemoryManagerwhen going to free a memory page by callingfreePage. At the same time, it is possibly that another memory consumer first locksTaskMemoryManagerand then thisMapIteratorwhen it acquires memory and causes spilling on thisMapIterator.So it ends with the
MapIteratorobject holds lock to theMapIteratorobject and waits for lock onTaskMemoryManager, and the other consumer holds lock toTaskMemoryManagerand waits for lock on theMapIteratorobject.To avoid deadlock here, this patch proposes to keep reference to the page to free and free it after releasing the lock of
MapIterator.How was this patch tested?
Added test and manually test by running the test 100 times to make sure there is no deadlock.