[SPARK-11334][CORE] Fix bug in Executor allocation manager in running tasks calculation #19580

sitalkedia · 2017-10-26T17:33:36Z

What changes were proposed in this pull request?

We often see the issue of Spark jobs stuck because the Executor Allocation Manager does not ask for any executor even if there are pending tasks in case dynamic allocation is turned on. Looking at the logic in Executor Allocation Manager, which calculates the running tasks, it can happen that the calculation will be wrong and the number of running tasks can become negative.

How was this patch tested?

Added unit test

… tasks calculation

sitalkedia · 2017-10-26T17:34:09Z

duplicate of #19534

cc - @vanzin,

vanzin · 2017-10-26T17:40:59Z

Could you fix the bug number (SPARK-11334)?

SparkQA · 2017-10-26T21:00:48Z

Test build #83089 has finished for PR 19580 at commit f8fcc35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-10-26T22:53:27Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

     */
-    def totalRunningTasks(): Int = numRunningTasks
+    def totalRunningTasks(): Int = {
+      stageIdToNumRunningTask.values.sum


This needs to be inside allocationManager.synchronized, no?

Nevermind, this is called from a synchronized context. Except in your unit tests, that is (which call the privatetotalRunningTasks you added to the manager).

It'd be nice to make the other method calling this synchronized, just to be paranoid.

vanzin · 2017-10-26T22:53:49Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

        s"when it is already pending to be removed!")
      return false
    }
-


nit: no need for this change.

vanzin · 2017-10-26T22:54:11Z

core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala

    assert(numExecutorsToAdd(manager) === 1)
  }

+  test("Ignore task end events from completed stages") {


nit: lower case "ignore" to match other tests.

jerryshao · 2017-10-27T01:05:39Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

      allocationManager.synchronized {
-        numRunningTasks += 1
+        if (stageIdToNumRunningTask.contains(stageId)) {
+          stageIdToNumRunningTask(stageId) = stageIdToNumRunningTask(stageId) + 1


nit: this can be changed to stageIdToNumRunningTask(stageId) += 1

jerryshao · 2017-10-27T01:08:26Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

    (numRunningOrPendingTasks + tasksPerExecutor - 1) / tasksPerExecutor
  }

+  private def totalRunningTasks(): Int = synchronized {


Looks like no one invoke this method?

This is being called from the test.

I'm not sure why do we need to add a method which only used for unit test. If want to verify the behavior of totalRunningTasks, I think maxNumExecutorsNeeded can also be used indirectly for verification.

Its okay to add a method which is used for unit testing purpose only. I am not inclined towards the idea of using maxNumExecutorsNeeded to indirectly verify totalRunningTasks for the following reason -

Currently, the test case is testing what it is supposed to. If you check for maxNumExecutorsNeeded instead, it might not be clear what we are testing.

jerryshao · 2017-10-27T01:09:36Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

      allocationManager.synchronized {
-        numRunningTasks -= 1
+        if (stageIdToNumRunningTask.contains(stageId)) {
+          stageIdToNumRunningTask(stageId) = stageIdToNumRunningTask(stageId) - 1


SparkQA · 2017-10-27T02:41:51Z

Test build #83099 has finished for PR 19580 at commit e884a96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-27T03:22:43Z

Test build #83101 has finished for PR 19580 at commit 6f22f93.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-27T07:05:02Z

Test build #83107 has finished for PR 19580 at commit 8abaa2c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-10-31T09:05:43Z

jenkins, retest this please.

jerryshao · 2017-10-31T12:45:59Z

Jenkins, retest this please.

SparkQA · 2017-10-31T16:24:24Z

Test build #83257 has finished for PR 19580 at commit 8abaa2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-10-31T16:49:35Z

Merging to master / 2.2 / 2.1.

vanzin · 2017-10-31T16:50:45Z

failed to merge to 2.2...

[SPARK-22312][CORE] Fix bug in Executor allocation manager in running…

f8fcc35

… tasks calculation

sitalkedia changed the title ~~[SPARK-22312][CORE] Fix bug in Executor allocation manager in running tasks calculation~~ [SPARK-11334][CORE] Fix bug in Executor allocation manager in running tasks calculation Oct 26, 2017

vanzin reviewed Oct 26, 2017

View reviewed changes

Sital Kedia added 2 commits October 26, 2017 16:15

review comments

e884a96

Make totalRunningTasks synchronized

6f22f93

jerryshao reviewed Oct 27, 2017

View reviewed changes

review comment

8abaa2c

asfgit closed this in 7986cc0 Oct 31, 2017

da-liii mentioned this pull request Apr 26, 2018

[SPARK-11334][CORE] clear idle executors in executorIdToTaskIds keySet #21166

Closed

[SPARK-11334][CORE] Fix bug in Executor allocation manager in running tasks calculation #19580

[SPARK-11334][CORE] Fix bug in Executor allocation manager in running tasks calculation #19580

Uh oh!

Conversation

sitalkedia commented Oct 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sitalkedia commented Oct 26, 2017

Uh oh!

vanzin commented Oct 26, 2017

Uh oh!

SparkQA commented Oct 26, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 27, 2017

Uh oh!

SparkQA commented Oct 27, 2017

Uh oh!

SparkQA commented Oct 27, 2017

Uh oh!

jerryshao commented Oct 31, 2017

Uh oh!

jerryshao commented Oct 31, 2017

Uh oh!

SparkQA commented Oct 31, 2017

Uh oh!

vanzin commented Oct 31, 2017

Uh oh!

vanzin commented Oct 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants