Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Dec 15, 2019

What changes were proposed in this pull request?

This PR aims to investigate OOM issue in Jenkins Maven (branch-2.4).

Why are the changes needed?

OOM at Jenkins Maven (branch-2.4)

throw exception on barrier() call timeout *** FAILED ***
  "[SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more CPU cores or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage." did not contain "The coordinator didn't get all barrier sync requests" (BarrierTaskContextSuite.scala:97)
Exception in thread "ExecutorRunner for app-20191215080059-0000/29" java.lang.OutOfMemoryError: Java heap space
Exception in thread "dispatcher-event-loop-12" java.lang.OutOfMemoryError: Java heap space
Exception in thread "ExecutorRunner for app-20191215080059-0000/30" java.lang.OutOfMemoryError: Java heap space
Exception in thread "dispatcher-event-loop-24" Exception in thread "dispatcher-event-loop-5" java.lang.OutOfMemoryError: Java heap space
Exception in thread "dispatcher-event-loop-30" java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "dispatcher-event-loop-25" Exception in thread "dispatcher-event-loop-9" java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
*** RUN ABORTED ***
  java.lang.OutOfMemoryError: Java heap space

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the Jenkins

@dongjoon-hyun dongjoon-hyun changed the title [WIP] Investigate OOM at BarrierTaskContextSuite [WIP][TESTS][test-maven] Investigate OOM at BarrierTaskContextSuite Dec 15, 2019
@dongjoon-hyun dongjoon-hyun changed the title [WIP][TESTS][test-maven] Investigate OOM at BarrierTaskContextSuite [WIP][TESTS][test-maven][2.4] Investigate OOM at BarrierTaskContextSuite Dec 15, 2019
@dongjoon-hyun
Copy link
Member Author

The first run is the dummy commit with branch-2.4.
The second and third one are the same reverting one.

@SparkQA
Copy link

SparkQA commented Dec 15, 2019

Test build #115365 has finished for PR 26900 at commit 223c0aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The first run failed again in this PR with Maven with the same reason.

BarrierTaskContextSuite:
- global sync by barrier() call
- support multiple barrier() call within a single task *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(0, 0) finished unsuccessfully.
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0

@SparkQA
Copy link

SparkQA commented Dec 15, 2019

Test build #115366 has finished for PR 26900 at commit 37da74a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

#26841 is reverted according to this result.

@SparkQA
Copy link

SparkQA commented Dec 15, 2019

Test build #115367 has finished for PR 26900 at commit 3206af7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants