-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-34329][YARN] When hit ApplicationAttemptNotFoundException, we can't just stop app for all case #31437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #134779 has finished for PR 31437 at commit
|
|
cc @tgravescs and @jerryshao FYI |
|
can you provide more detail here. |
Our case is client mode. this error throw in driver side.
I mean client mode AM's container was preempted. Then RM can't find this attempt. Then client driver backend receive ApplicationAttemptNotFoundException. |
|
ok, please update the description with those details. The only thing better would be if yarn told us this was a preempt case, did you look at that at all? its been a while since I looked into the yarn code. |
Done
Yea, and origin change also happen in yarn-client mode since this part code only use in yarn-client mode.
yea
From the error stack, it only tell us the attempt can't be found. I got root cause in yarn's log. Since we can't change yarn's code in spark side, what we can do here is just retry. I am looking into yarn's code these days. |
|
Test build #134883 has finished for PR 31437 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #134901 has finished for PR 31437 at commit
|
|
retest this please |
|
Test build #134908 has finished for PR 31437 at commit
|
|
retest this please |
|
Test build #135292 has finished for PR 31437 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
gentle ping @tgravescs Any more suggestion? |
|
Sorry was out of office for a bit. did you look at the test failure?
did you have a chance to look at the yarn code here? I thought YARN was supposed to give indication that it was going to preempt you but its been a while since I did anything in YARN. |
|
RM clean preempted container's info, then code in
In yarn-client mode, |
|
Test build #137715 has finished for PR 31437 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
retest this please |
Seems a flaky test. |
|
Test build #137717 has finished for PR 31437 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
ping @tgravescs |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
With
yarn-clientmode, our user meet case that because yarn queue's setting, some app's container is preempted by a higher level request due to the scheduling framework. Then throwApplicationAttemptNotFoundException, in an old PR #10129, it add the behavior we always close app directly without retry when hitApplicationAttemptNotFoundException.In yarn-client mode, when AM's container was preempted. Then RM can't find this attempt. Then client driver backend receive
ApplicationAttemptNotFoundException. But for some case, such as queue issue caused by peak usage, we can have a retry then it may success.Since for not all case throw
ApplicationAttemptNotFoundExceptionwe don't need to retry. IMO, we should not just catch this exception and stop app. So in this pr, I suppose to add a condition, that only when cluster mode, we just stop when hitApplicationAttemptNotFoundExceptionWhy are the changes needed?
Make job more tolerable about container issue
Does this PR introduce any user-facing change?
No
How was this patch tested?
Not need