Skip to content

Conversation

@sitalkedia
Copy link

What changes were proposed in this pull request?

We often see the issue of Spark jobs stuck because the Executor Allocation Manager does not ask for any executor even if there are pending tasks in case dynamic allocation is turned on. Looking at the logic in Executor Allocation Manager, which calculates the running tasks, it can happen that the calculation will be wrong and the number of running tasks can become negative.

How was this patch tested?

Added unit test

@sitalkedia
Copy link
Author

duplicate of #19534

cc - @vanzin,

@vanzin
Copy link
Contributor

vanzin commented Oct 26, 2017

Could you fix the bug number (SPARK-11334)?

@sitalkedia sitalkedia changed the title [SPARK-22312][CORE] Fix bug in Executor allocation manager in running tasks calculation [SPARK-11334][CORE] Fix bug in Executor allocation manager in running tasks calculation Oct 26, 2017
@SparkQA
Copy link

SparkQA commented Oct 26, 2017

Test build #83089 has finished for PR 19580 at commit f8fcc35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
def totalRunningTasks(): Int = numRunningTasks
def totalRunningTasks(): Int = {
stageIdToNumRunningTask.values.sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be inside allocationManager.synchronized, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, this is called from a synchronized context. Except in your unit tests, that is (which call the privatetotalRunningTasks you added to the manager).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to make the other method calling this synchronized, just to be paranoid.

s"when it is already pending to be removed!")
return false
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for this change.

assert(numExecutorsToAdd(manager) === 1)
}

test("Ignore task end events from completed stages") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lower case "ignore" to match other tests.

allocationManager.synchronized {
numRunningTasks += 1
if (stageIdToNumRunningTask.contains(stageId)) {
stageIdToNumRunningTask(stageId) = stageIdToNumRunningTask(stageId) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be changed to stageIdToNumRunningTask(stageId) += 1

(numRunningOrPendingTasks + tasksPerExecutor - 1) / tasksPerExecutor
}

private def totalRunningTasks(): Int = synchronized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like no one invoke this method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being called from the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why do we need to add a method which only used for unit test. If want to verify the behavior of totalRunningTasks, I think maxNumExecutorsNeeded can also be used indirectly for verification.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its okay to add a method which is used for unit testing purpose only. I am not inclined towards the idea of using maxNumExecutorsNeeded to indirectly verify totalRunningTasks for the following reason -

Currently, the test case is testing what it is supposed to. If you check for maxNumExecutorsNeeded instead, it might not be clear what we are testing.

allocationManager.synchronized {
numRunningTasks -= 1
if (stageIdToNumRunningTask.contains(stageId)) {
stageIdToNumRunningTask(stageId) = stageIdToNumRunningTask(stageId) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@SparkQA
Copy link

SparkQA commented Oct 27, 2017

Test build #83099 has finished for PR 19580 at commit e884a96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2017

Test build #83101 has finished for PR 19580 at commit 6f22f93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2017

Test build #83107 has finished for PR 19580 at commit 8abaa2c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

jenkins, retest this please.

@jerryshao
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 31, 2017

Test build #83257 has finished for PR 19580 at commit 8abaa2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Oct 31, 2017

Merging to master / 2.2 / 2.1.

@vanzin
Copy link
Contributor

vanzin commented Oct 31, 2017

failed to merge to 2.2...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants