Skip to content

Conversation

@venkata91
Copy link
Contributor

@venkata91 venkata91 commented Apr 1, 2019

What changes were proposed in this pull request?

In UnsafeExternalSorter.SpillableIterator#loadNext() takes lock on the UnsafeExternalSorter and calls freePage once the lastPage is consumed which needs to take a lock on TaskMemoryManager. At the same time, there can be another MemoryConsumer using UnsafeExternalSorter as part of sorting can try to allocatePage needs to get lock on TaskMemoryManager which can cause spill to happen which requires lock on UnsafeExternalSorter again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265.

To fix this, we can move the freePage call in loadNext outside of Synchronized block similar to the fix in SPARK-26265

How was this patch tested?

Manual tests were being done and will also try to add a test.

@beliefer
Copy link
Contributor

beliefer commented Apr 2, 2019

I checked the code and found UnsafeExternalSorter.split will call the SpillableIterator.split.
SpillableIterator.split need to get lock on UnsafeExternalSorter.
UnsafeExternalSorter.loadNext will call the MemoryConsumer.freePage and MemoryConsumer.freePage will call the TaskMemoryManager.freePage.
TaskMemoryManager.freePage need to get lock on TaskMemoryManager.
I think your PR seems good!

if (lastPage != null) {
freePage(lastPage);
lastPage = null;
MemoryBlock pageToFree = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your PR seems good!

if (nextUpstream != null) {
// Just consumed the last record from in memory iterator
if(lastPage != null) {
pageToFree = lastPage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some comment to explain it? You can just copy it from https://github.com/apache/spark/pull/24269/files#diff-027299fb14327ddcaba457f81ecff32cR583

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @cloud-fan, will add and update the PR

Copy link
Contributor Author

@venkata91 venkata91 Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done @cloud-fan . I didn't know that you also filed a PR simultaneously for the same issue. :) Thanks for graciously closing your PR and accepting this PR.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding some explanation (the comment mentioned by @cloud-fan) otherwise LGTM

@cloud-fan
Copy link
Contributor

ok to test

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please update the comment.

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104215 has finished for PR 24265 at commit 6a9de27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 3, 2019

Test build #104255 has finished for PR 24265 at commit 229cfae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 6c4552c Apr 4, 2019
cloud-fan pushed a commit that referenced this pull request Apr 4, 2019
…rator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager

## What changes were proposed in this pull request?

In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265.

To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265

## How was this patch tested?

Manual tests were being done and will also try to add a test.

Closes #24265 from venkata91/deadlock-sorter.

Authored-by: Venkata krishnan Sowrirajan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6c4552c)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4/2.3!

cloud-fan pushed a commit that referenced this pull request Apr 4, 2019
…rator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager

## What changes were proposed in this pull request?

In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265.

To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265

## How was this patch tested?

Manual tests were being done and will also try to add a test.

Closes #24265 from venkata91/deadlock-sorter.

Authored-by: Venkata krishnan Sowrirajan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6c4552c)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Apr 4, 2019
## What changes were proposed in this pull request?

#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes #24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Apr 4, 2019
## What changes were proposed in this pull request?

#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes #24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Apr 4, 2019
## What changes were proposed in this pull request?

#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes #24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…rator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager

## What changes were proposed in this pull request?

In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265.

To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265

## How was this patch tested?

Manual tests were being done and will also try to add a test.

Closes apache#24265 from venkata91/deadlock-sorter.

Authored-by: Venkata krishnan Sowrirajan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6c4552c)
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
## What changes were proposed in this pull request?

apache#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes apache#24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
…rator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager

## What changes were proposed in this pull request?

In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265.

To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265

## How was this patch tested?

Manual tests were being done and will also try to add a test.

Closes apache#24265 from venkata91/deadlock-sorter.

Authored-by: Venkata krishnan Sowrirajan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6c4552c)
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
## What changes were proposed in this pull request?

apache#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes apache#24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…rator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager

## What changes were proposed in this pull request?

In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265.

To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265

## How was this patch tested?

Manual tests were being done and will also try to add a test.

Closes apache#24265 from venkata91/deadlock-sorter.

Authored-by: Venkata krishnan Sowrirajan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6c4552c)
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
## What changes were proposed in this pull request?

apache#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes apache#24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
zhongjinhan pushed a commit to zhongjinhan/spark-1 that referenced this pull request Sep 3, 2019
## What changes were proposed in this pull request?

apache/spark#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes #24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit af0a4bb)
yoock pushed a commit to yoock/spark-apache that referenced this pull request Jan 14, 2020
## What changes were proposed in this pull request?

apache/spark#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes #24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…rator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager

In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265.

To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265

Manual tests were being done and will also try to add a test.

Closes apache#24265 from venkata91/deadlock-sorter.

Authored-by: Venkata krishnan Sowrirajan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6c4552c)
Signed-off-by: Wenchen Fan <[email protected]>

Ref: LIHADOOP-58062

RB=2553586
BUG=LIHADOOP-58062
G=spark-reviewers
R=mmuralid
A=mmuralid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants