[SPARK-30274][Core] Avoid BytesToBytesMap lookup hang forever when holding keys reaching max capacity #26914

viirya · 2019-12-17T01:28:34Z

What changes were proposed in this pull request?

We should not append keys to BytesToBytesMap to be its max capacity.

Why are the changes needed?

BytesToBytesMap.append allows to append keys until the number of keys reaches MAX_CAPACITY. But once the the pointer array in the map holds MAX_CAPACITY keys, next time call of lookup will hang forever.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually test by:

@Test
  public void testCapacity() {
    TestMemoryManager memoryManager2 =
            new TestMemoryManager(
                    new SparkConf()
                            .set(package$.MODULE$.MEMORY_OFFHEAP_ENABLED(), true)
                            .set(package$.MODULE$.MEMORY_OFFHEAP_SIZE(), 25600 * 1024 * 1024L)
                            .set(package$.MODULE$.SHUFFLE_SPILL_COMPRESS(), false)
                            .set(package$.MODULE$.SHUFFLE_COMPRESS(), false));
    TaskMemoryManager taskMemoryManager2 = new TaskMemoryManager(memoryManager2, 0);
    final long pageSizeBytes = 8000000 + 8; // 8 bytes for end-of-page marker
    final BytesToBytesMap map = new BytesToBytesMap(taskMemoryManager2, 1024, pageSizeBytes);

    try {
      for (long i = 0; i < BytesToBytesMap.MAX_CAPACITY + 1; i++) {
        final long[] value = new long[]{i};
        boolean succeed = map.lookup(value, Platform.LONG_ARRAY_OFFSET, 8).append(
                value,
                Platform.LONG_ARRAY_OFFSET,
                8,
                value,
                Platform.LONG_ARRAY_OFFSET,
                8);
      }
      map.free();
    } finally {
      map.free();
    }
  }

Once the map was appended to 536870912 keys (MAX_CAPACITY), the next lookup will hang.

…max capacity.

SparkQA · 2019-12-17T04:09:55Z

Test build #115418 has finished for PR 26914 at commit d5a1ec2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2019-12-17T09:11:56Z

We have a job that seems related to this issue, this job sometimes hangs in a BroadcastJoin stage(task 131726) for about 2 hours.

2019-12-17 04:11:31,072 [619016] - INFO  [Executor task launch worker for task 131726:Logging$class@54] - Code generated in 4163.18468 ms
2019-12-17 04:11:34,701 [622645] - INFO  [dispatcher-event-loop-3:Logging$class@54] - Executor is trying to kill task 2208.0 in stage 16.0 (TID 117693), reason: another attempt succeeded
2019-12-17 04:11:34,712 [622656] - INFO  [Executor task launch worker for task 117693:Logging$class@54] - Executor killed task 2208.0 in stage 16.0 (TID 117693), reason: another attempt succeeded
2019-12-17 04:11:34,717 [622661] - INFO  [dispatcher-event-loop-0:Logging$class@54] - Got assigned task 132494
2019-12-17 04:11:34,717 [622661] - INFO  [Executor task launch worker for task 132494:Logging$class@54] - Running task 13488.1 in stage 16.0 (TID 132494)
2019-12-17 04:11:34,804 [622748] - INFO  [Executor task launch worker for task 132494:Logging$class@54] - Getting 13825 non-empty blocks out of 15000 blocks
2019-12-17 04:11:34,904 [622848] - INFO  [Executor task launch worker for task 132494:Logging$class@54] - Started 1006 remote fetches in 107 ms
2019-12-17 04:11:34,907 [622851] - INFO  [Executor task launch worker for task 132494:Logging$class@54] - Getting 100 non-empty blocks out of 104 blocks
2019-12-17 04:11:34,909 [622853] - INFO  [Executor task launch worker for task 132494:Logging$class@54] - Started 56 remote fetches in 2 ms
2019-12-17 04:11:36,870 [624814] - INFO  [Executor task launch worker for task 132012:Logging$class@54] - Code generated in 4465.326931 ms
2019-12-17 04:11:36,895 [624839] - INFO  [Executor task launch worker for task 132494:Logging$class@54] - Code generated in 227.660186 ms
2019-12-17 04:11:37,311 [625255] - INFO  [dispatcher-event-loop-2:Logging$class@54] - Executor is trying to kill task 10766.1 in stage 16.0 (TID 130906), reason: another attempt succeeded
2019-12-17 04:11:37,323 [625267] - INFO  [Executor task launch worker for task 130906:Logging$class@54] - Executor killed task 10766.1 in stage 16.0 (TID 130906), reason: another attempt succeeded
2019-12-17 04:11:37,327 [625271] - INFO  [dispatcher-event-loop-1:Logging$class@54] - Got assigned task 132680
2019-12-17 04:11:37,327 [625271] - INFO  [Executor task launch worker for task 132680:Logging$class@54] - Running task 3992.1 in stage 16.0 (TID 132680)
2019-12-17 04:11:37,399 [625343] - INFO  [Executor task launch worker for task 132680:Logging$class@54] - Getting 14650 non-empty blocks out of 15000 blocks
2019-12-17 04:11:37,472 [625416] - INFO  [Executor task launch worker for task 132680:Logging$class@54] - Started 1006 remote fetches in 84 ms
2019-12-17 04:11:37,489 [625433] - INFO  [Executor task launch worker for task 132680:Logging$class@54] - Getting 100 non-empty blocks out of 104 blocks
2019-12-17 04:11:37,491 [625435] - INFO  [Executor task launch worker for task 132680:Logging$class@54] - Started 56 remote fetches in 2 ms
2019-12-17 04:11:40,958 [628902] - INFO  [Executor task launch worker for task 132680:Logging$class@54] - Code generated in 1301.014453 ms
2019-12-17 04:11:48,583 [636527] - INFO  [Executor task launch worker for task 131726:UnsafeExternalSorter@209] - Thread 119 spilling sort data of 2.1 GB to disk (0  time so far)
2019-12-17 04:11:52,807 [640751] - INFO  [Executor task launch worker for task 132494:Logging$class@54] - Finished task 13488.1 in stage 16.0 (TID 132494). 10395 bytes result sent to driver
2019-12-17 04:12:07,486 [655430] - INFO  [dispatcher-event-loop-3:Logging$class@54] - Executor is trying to kill task 3992.1 in stage 16.0 (TID 132680), reason: another attempt succeeded
2019-12-17 04:12:07,489 [655433] - INFO  [Executor task launch worker for task 132680:Logging$class@54] - Executor killed task 3992.1 in stage 16.0 (TID 132680), reason: another attempt succeeded
2019-12-17 04:12:47,910 [695854] - INFO  [dispatcher-event-loop-0:Logging$class@54] - Executor is trying to kill task 3949.1 in stage 16.0 (TID 132012), reason: another attempt succeeded
2019-12-17 04:12:47,914 [695858] - INFO  [Executor task launch worker for task 132012:Logging$class@54] - Executor killed task 3949.1 in stage 16.0 (TID 132012), reason: another attempt succeeded
2019-12-17 06:24:45,498 [8613442] - INFO  [Executor task launch worker for task 131726:Logging$class@54] - Finished task 3638.1 in stage 16.0 (TID 131726). 10524 bytes result sent to driver

viirya · 2019-12-17T16:16:20Z

cc @dongjoon-hyun @cloud-fan

dongjoon-hyun · 2019-12-17T19:06:44Z

Thank you so much for making the validation example in the PR description!

dongjoon-hyun

+1, LGTM.
This is logically correct. And, since it's difficult to add a logic into safeLookup, this looks like the best location to prevent this bug. I also verified with the given example with MAX_CAPACITY = (1 << 21) in both master/2.4.

Merged to master/2.4

…lding keys reaching max capacity ### What changes were proposed in this pull request? We should not append keys to BytesToBytesMap to be its max capacity. ### Why are the changes needed? BytesToBytesMap.append allows to append keys until the number of keys reaches MAX_CAPACITY. But once the the pointer array in the map holds MAX_CAPACITY keys, next time call of lookup will hang forever. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually test by: ```java Test public void testCapacity() { TestMemoryManager memoryManager2 = new TestMemoryManager( new SparkConf() .set(package$.MODULE$.MEMORY_OFFHEAP_ENABLED(), true) .set(package$.MODULE$.MEMORY_OFFHEAP_SIZE(), 25600 * 1024 * 1024L) .set(package$.MODULE$.SHUFFLE_SPILL_COMPRESS(), false) .set(package$.MODULE$.SHUFFLE_COMPRESS(), false)); TaskMemoryManager taskMemoryManager2 = new TaskMemoryManager(memoryManager2, 0); final long pageSizeBytes = 8000000 + 8; // 8 bytes for end-of-page marker final BytesToBytesMap map = new BytesToBytesMap(taskMemoryManager2, 1024, pageSizeBytes); try { for (long i = 0; i < BytesToBytesMap.MAX_CAPACITY + 1; i++) { final long[] value = new long[]{i}; boolean succeed = map.lookup(value, Platform.LONG_ARRAY_OFFSET, 8).append( value, Platform.LONG_ARRAY_OFFSET, 8, value, Platform.LONG_ARRAY_OFFSET, 8); } map.free(); } finally { map.free(); } } ``` Once the map was appended to 536870912 keys (MAX_CAPACITY), the next lookup will hang. Closes #26914 from viirya/fix-bytemap2. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b2baaa2) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2019-12-17T19:46:38Z

Thank you so much, @viirya !

viirya · 2019-12-17T19:50:57Z

Thank you! @dongjoon-hyun

Avoid BytesToBytesMap lookup hang forever when holding keys reaching …

d5a1ec2

…max capacity.

dongjoon-hyun added the SPARK CORE label Dec 17, 2019

dongjoon-hyun approved these changes Dec 17, 2019

View reviewed changes

dongjoon-hyun closed this in b2baaa2 Dec 17, 2019

ankurdave mentioned this pull request Sep 13, 2020

[SPARK-30198][Core] BytesToBytesMap does not grow internal long array as expected #26828

Closed

viirya deleted the fix-bytemap2 branch December 27, 2023 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30274][Core] Avoid BytesToBytesMap lookup hang forever when holding keys reaching max capacity #26914

[SPARK-30274][Core] Avoid BytesToBytesMap lookup hang forever when holding keys reaching max capacity #26914

Uh oh!

viirya commented Dec 17, 2019 •

edited by dongjoon-hyun

Loading

Uh oh!

SparkQA commented Dec 17, 2019

Uh oh!

yaooqinn commented Dec 17, 2019

Uh oh!

viirya commented Dec 17, 2019

Uh oh!

dongjoon-hyun commented Dec 17, 2019

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

dongjoon-hyun commented Dec 17, 2019

Uh oh!

viirya commented Dec 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-30274][Core] Avoid BytesToBytesMap lookup hang forever when holding keys reaching max capacity #26914

[SPARK-30274][Core] Avoid BytesToBytesMap lookup hang forever when holding keys reaching max capacity #26914

Uh oh!

Conversation

viirya commented Dec 17, 2019 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 17, 2019

Uh oh!

yaooqinn commented Dec 17, 2019

Uh oh!

viirya commented Dec 17, 2019

Uh oh!

dongjoon-hyun commented Dec 17, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 17, 2019

Uh oh!

viirya commented Dec 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya commented Dec 17, 2019 •

edited by dongjoon-hyun

Loading

dongjoon-hyun left a comment •

edited

Loading