Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Dec 10, 2018

What changes were proposed in this pull request?

In BytesToBytesMap.MapIterator.advanceToNextPage, We will first lock this MapIterator and then TaskMemoryManager when going to free a memory page by calling freePage. At the same time, it is possibly that another memory consumer first locks TaskMemoryManager and then this MapIterator when it acquires memory and causes spilling on this MapIterator.

So it ends with the MapIterator object holds lock to the MapIterator object and waits for lock on TaskMemoryManager, and the other consumer holds lock to TaskMemoryManager and waits for lock on the MapIterator object.

To avoid deadlock here, this patch proposes to keep reference to the page to free and free it after releasing the lock of MapIterator.

How was this patch tested?

Added test and manually test by running the test 100 times to make sure there is no deadlock.

@viirya
Copy link
Member Author

viirya commented Dec 10, 2018

cc @cloud-fan

}

@Test
public void avoidDeadlock() throws InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @viirya . Since this test case reproduces Deadlock situation, we need a timeout logic. Otherwise, it will hang (instead of failures) when we hit this issue later.

Copy link
Member Author

@viirya viirya Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried several ways to set a timeout logic, but don't work. The deadlock always hangs the test and timeout logic.

@cloud-fan
Copy link
Contributor

have you seen any bug report caused by this dead lock?

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99900 has finished for PR 23272 at commit 25e8e06.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 10, 2018

have you seen any bug report caused by this dead lock?

The original reporter of the JIRA ticket SPARK-26265 has hit with this bug in their workload.

assertFalse(iter.hasNext());
} finally {
map.free();
thread.join();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line where the test hangs without the fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When without this line, the test still hangs. The test thread hangs on the deadlock with the other thread of running memoryConsumer.

Copy link
Member Author

@viirya viirya Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line just makes sure memoryConsumer to end and free acquired memory.

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99914 has finished for PR 23272 at commit 4c621d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99915 has finished for PR 23272 at commit 9d52320.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 11, 2018 via email


try {
int i;
for (i = 0; i < 1024; i++) {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use for (int i = 0; ... here and line 708 because int i is not referenced outside of for loop.
Never mind. I found that this is the convention in this test suite.

@viirya
Copy link
Member Author

viirya commented Dec 11, 2018 via email

@viirya
Copy link
Member Author

viirya commented Dec 11, 2018 via email

@cloud-fan
Copy link
Contributor

LGTM. last question: does the test always reproduce the bug? Or it has some randomness?

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99956 has finished for PR 23272 at commit 0405527.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 11, 2018

last question: does the test always reproduce the bug? Or it has some randomness?

If without the change, as I tried it locally 10 times, the test can reproduce the bug 10 times. But I'm not sure if it is 100% to reproduce the bug. I think we can't always to reproduce a deadlock like this.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Can you send a new PR for 2.4 without the synchronized move around?

@viirya
Copy link
Member Author

viirya commented Dec 11, 2018

Can you send a new PR for 2.4 without the synchronized move around?

Ok. Thanks.

@asfgit asfgit closed this in a3bbca9 Dec 11, 2018
@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99966 has finished for PR 23272 at commit 0849083.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 11, 2018

I think the failed tests are unrelated. cc @cloud-fan

asfgit pushed a commit that referenced this pull request Dec 15, 2018
## What changes were proposed in this pull request?

Based on the [comment](#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes #23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 1b604c1)
Signed-off-by: Hyukjin Kwon <[email protected]>
asfgit pushed a commit that referenced this pull request Dec 15, 2018
## What changes were proposed in this pull request?

Based on the [comment](#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes #23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?

Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes apache#23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…locking both BytesToBytesMap.MapIterator and TaskMemoryManager

## What changes were proposed in this pull request?

In `BytesToBytesMap.MapIterator.advanceToNextPage`, We will first lock this `MapIterator` and then `TaskMemoryManager` when going to free a memory page by calling `freePage`. At the same time, it is possibly that another memory consumer first locks `TaskMemoryManager` and then this `MapIterator` when it acquires memory and causes spilling on this `MapIterator`.

So it ends with the `MapIterator` object holds lock to the `MapIterator` object and waits for lock on `TaskMemoryManager`, and the other consumer holds lock to `TaskMemoryManager` and waits for lock on the `MapIterator` object.

To avoid deadlock here, this patch proposes to keep reference to the page to free and free it after releasing the lock of `MapIterator`.

## How was this patch tested?

Added test and manually test by running the test 100 times to make sure there is no deadlock.

Closes apache#23272 from viirya/SPARK-26265.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes apache#23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
## What changes were proposed in this pull request?

Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes apache#23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 1b604c1)
Signed-off-by: Hyukjin Kwon <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
## What changes were proposed in this pull request?

Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes apache#23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 1b604c1)
Signed-off-by: Hyukjin Kwon <[email protected]>
zhongjinhan pushed a commit to zhongjinhan/spark-1 that referenced this pull request Sep 3, 2019
## What changes were proposed in this pull request?

Based on the [comment](apache/spark#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes #23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 1b604c1)
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 6019d9a)
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…locking both BytesToBytesMap.MapIterator and TaskMemoryManager

In `BytesToBytesMap.MapIterator.advanceToNextPage`, We will first lock this `MapIterator` and then `TaskMemoryManager` when going to free a memory page by calling `freePage`. At the same time, it is possibly that another memory consumer first locks `TaskMemoryManager` and then this `MapIterator` when it acquires memory and causes spilling on this `MapIterator`.

So it ends with the `MapIterator` object holds lock to the `MapIterator` object and waits for lock on `TaskMemoryManager`, and the other consumer holds lock to `TaskMemoryManager` and waits for lock on the `MapIterator` object.

To avoid deadlock here, this patch proposes to keep reference to the page to free and free it after releasing the lock of `MapIterator`.

Added test and manually test by running the test 100 times to make sure there is no deadlock.

Closes apache#23272 from viirya/SPARK-26265.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry-picked from commit a3bbca9)

[SPARK-26265][CORE][FOLLOWUP] Put freePage into a finally block

Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

Existing tests.

Closes apache#23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry-picked from commit 1b604c1)

Ref: LIHADOOP-43221

RB=1518143
BUG=LIHADOOP-43221
G=superfriends-reviewers
R=fli,mshen,yezhou,edlu
A=fli
@viirya viirya deleted the SPARK-26265 branch December 27, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants