-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27341][CORE] fix a deadlock between TaskMemoryManager and UnsafeExternalSorter #24269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #104183 has finished for PR 24269 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible lastPage != null on line 550 above when it is set? https://github.com/apache/spark/pull/24269/files#diff-027299fb14327ddcaba457f81ecff32cR550
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible but doesn't matter. It's kind of we delay the page releasing a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that both the 2 methods are inside synchronized (this)
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seams to be solved twice as here is another PR dealing with the same problem (different Jira): #24265
small nit, but LGTM otherwise
| numRecords--; | ||
| upstream.loadNext(); | ||
| } finally { | ||
| if (pageToFree != null) freePage(pageToFree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add braces (according to Oracle java style: if statements always use braces, {})
|
ah didn't see that PR. Since that PR is opened earlier. I'm closing mine. |
|
Test build #104213 has finished for PR 24269 at commit
|
What changes were proposed in this pull request?
This is a long-standing bug.
In
TaskMemoryManager.acquireExecutionMemory, we may lockTaskMemoryManagerand spillMemoryConsumers.UnsafeExternalSorteris aMemoryConsumerand locks itsSpillableIteratorduring spill.In
UnsafeExternalSorter#SpillableIterator.loadNext, we lockSpillableIteratorand may callfreePagewhich locksTaskMemoryManager.If there are 2 threads doing these 2 locking chains together, a deadlock happens:
thread1: lock
TaskMemoryManagerand thenSpillableIteratorthread2: lock
SpillableIteratorand thenTaskMemoryManagerAs an example, PythonUDFExec launches 2 threads for one task, which can trigger this deadlock.
The thread dump when the deadlock happens:
thread 1
thread 2
How was this patch tested?
manual test, by inserting some sync points in the code, to enter to code flow that triggers dead lock.