Skip to content

Conversation

@sitalkedia
Copy link

What changes were proposed in this pull request?

We often see the issue of Spark jobs stuck because the Executor Allocation Manager does not ask for any executor even if there are pending tasks in case dynamic allocation is turned on. Looking at the logic in Executor Allocation Manager, which calculates the running tasks, it can happen that the calculation will be wrong and the number of running tasks can become negative.

How was this patch tested?

Added unit test

@sitalkedia
Copy link
Author

cc - @vanzin

@sitalkedia sitalkedia force-pushed the skedia/fix_stuck_job branch from 4f0cffa to f8fcc35 Compare October 19, 2017 05:31
@jerryshao
Copy link
Contributor

@sitalkedia would you please fix the PR title, seems it is broken now.

@SparkQA
Copy link

SparkQA commented Oct 19, 2017

Test build #82903 has finished for PR 19534 at commit f8fcc35.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

Do you mean we may first set numRunningTasks to 0 and then run into onTaskEnd and have numRunningTasks -= 1? Could we simply check stageIdToSpeculativeTaskIndices/stageIdToTaskIndices to see whether the stageId is still valid to avoid the issue?

@sitalkedia
Copy link
Author

@jiangxb1987 - yes that is the issue and you are right, we can avoid it by checking if the stageId is valid when we get a task end event. But I like this approach better because we can clean up the hack which sets the numRunningTasks to 0 when stage ends and also it is inline with the way we are doing bookkeeping in ExecutorAllocationListener i.e, keeping entry per stage. Let me know what you think.

@sitalkedia
Copy link
Author

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Oct 19, 2017

Test build #82914 has finished for PR 19534 at commit f8fcc35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia sitalkedia changed the title [SPARK-22312][CORE] Fix bug in Executor allocation manager in running… [SPARK-22312][CORE] Fix bug in Executor allocation manager in running tasks calculation Oct 19, 2017
@jerryshao
Copy link
Contributor

@sitalkedia I have a very old similar PR #11205 , maybe you can refer to it.

@jiangxb1987
Copy link
Contributor

@sitalkedia That makes sense. The proposed solutions are quite similar, we can choose to continue with either PR, WDYT @jerryshao @sitalkedia ?

@sitalkedia
Copy link
Author

I think other PR is fixing one more issue on top of runningTasks being negative, so we can proceed with the other one. What do you think @jerryshao ?

@jerryshao
Copy link
Contributor

@sitalkedia I'm OK with either.

@vanzin
Copy link
Contributor

vanzin commented Oct 23, 2017

Let's fix up Saisai's PR then.

@sitalkedia sitalkedia closed this Oct 23, 2017
@jerryshao
Copy link
Contributor

@sitalkedia would you please reopen this PR, I think the second issue I fixed before is not valid anymore, for the first issue the fix is no difference compared to here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants