-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22312][CORE] Fix bug in Executor allocation manager in running tasks calculation #19534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc - @vanzin |
… tasks calculation
4f0cffa to
f8fcc35
Compare
|
@sitalkedia would you please fix the PR title, seems it is broken now. |
|
Test build #82903 has finished for PR 19534 at commit
|
|
Do you mean we may first set |
|
@jiangxb1987 - yes that is the issue and you are right, we can avoid it by checking if the stageId is valid when we get a task end event. But I like this approach better because we can clean up the hack which sets the |
|
Jenkins retest this please. |
|
Test build #82914 has finished for PR 19534 at commit
|
|
@sitalkedia I have a very old similar PR #11205 , maybe you can refer to it. |
|
@sitalkedia That makes sense. The proposed solutions are quite similar, we can choose to continue with either PR, WDYT @jerryshao @sitalkedia ? |
|
I think other PR is fixing one more issue on top of runningTasks being negative, so we can proceed with the other one. What do you think @jerryshao ? |
|
@sitalkedia I'm OK with either. |
|
Let's fix up Saisai's PR then. |
|
@sitalkedia would you please reopen this PR, I think the second issue I fixed before is not valid anymore, for the first issue the fix is no difference compared to here. |
What changes were proposed in this pull request?
We often see the issue of Spark jobs stuck because the Executor Allocation Manager does not ask for any executor even if there are pending tasks in case dynamic allocation is turned on. Looking at the logic in Executor Allocation Manager, which calculates the running tasks, it can happen that the calculation will be wrong and the number of running tasks can become negative.
How was this patch tested?
Added unit test