Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Feb 2, 2021

What changes were proposed in this pull request?

With yarn-client mode, our user meet case that because yarn queue's setting, some app's container is preempted by a higher level request due to the scheduling framework. Then throw ApplicationAttemptNotFoundException, in an old PR #10129, it add the behavior we always close app directly without retry when hit ApplicationAttemptNotFoundException.

In yarn-client mode, when AM's container was preempted. Then RM can't find this attempt. Then client driver backend receive ApplicationAttemptNotFoundException. But for some case, such as queue issue caused by peak usage, we can have a retry then it may success.

Since for not all case throw ApplicationAttemptNotFoundException we don't need to retry. IMO, we should not just catch this exception and stop app. So in this pr, I suppose to add a condition, that only when cluster mode, we just stop when hit
ApplicationAttemptNotFoundException

Why are the changes needed?

Make job more tolerable about container issue

Does this PR introduce any user-facing change?

No

How was this patch tested?

Not need

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Test build #134779 has finished for PR 31437 at commit 1aed1ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot added the YARN label Feb 2, 2021
@HyukjinKwon
Copy link
Member

cc @tgravescs and @jerryshao FYI

@tgravescs tgravescs changed the title [SPARK-34329][SQL] When hit ApplicationAttemptNotFoundException, we can't just stop app for all case [SPARK-34329][YARN] When hit ApplicationAttemptNotFoundException, we can't just stop app for all case Feb 3, 2021
@tgravescs
Copy link
Contributor

can you provide more detail here.
what mode is this running in when you hit this (client or cluster - unmanaged am or managed)?
When you say its preempted, I assume that you mean this application was killed by the RM due to higher priority? If it was killed how can it stay alive

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Feb 3, 2021

can you provide more detail here.
what mode is this running in when you hit this (client or cluster - unmanaged am or managed)?

Our case is client mode. this error throw in driver side.

When you say its preempted, I assume that you mean this application was killed by the RM due to higher priority? If it was killed how can it stay alive.

I mean client mode AM's container was preempted. Then RM can't find this attempt. Then client driver backend receive ApplicationAttemptNotFoundException.

@tgravescs
Copy link
Contributor

ok, please update the description with those details.
maybe cluster mode doesn't matter here because application master would be killed anyway.
The original change I believe was before we could handle application master being killed and restarted. I think that is handled ok now. so just to verify with this change, the application master gets preempted and killed and application master gets restarted and the driver process continue, correct?

The only thing better would be if yarn told us this was a preempt case, did you look at that at all? its been a while since I looked into the yarn code.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Feb 4, 2021

ok, please update the description with those details.

Done

maybe cluster mode doesn't matter here because application master would be killed anyway.

Yea, and origin change also happen in yarn-client mode since this part code only use in yarn-client mode.

The original change I believe was before we could handle application master being killed and restarted.
I think that is handled ok now. so just to verify with this change, the application master gets preempted and killed and application master gets restarted and the driver process continue, correct?

yea

The only thing better would be if yarn told us this was a preempt case, did you look at that at all? its been a while since I looked into the yarn code.

From the error stack, it only tell us the attempt can't be found. I got root cause in yarn's log. Since we can't change yarn's code in spark side, what we can do here is just retry. I am looking into yarn's code these days.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134883 has finished for PR 31437 at commit b36cf61.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39469/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39469/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134901 has finished for PR 31437 at commit db37f5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134908 has finished for PR 31437 at commit db37f5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2021

Test build #135292 has finished for PR 31437 at commit db37f5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39871/

@SparkQA
Copy link

SparkQA commented Feb 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39871/

@AngersZhuuuu
Copy link
Contributor Author

gentle ping @tgravescs Any more suggestion?

@tgravescs
Copy link
Contributor

Sorry was out of office for a bit. did you look at the test failure?

The only thing better would be if yarn told us this was a preempt case, did you look at that at all? its been a while since I looked into the yarn code.
From the error stack, it only tell us the attempt can't be found. I got root cause in yarn's log. Since we can't change yarn's code in spark side, what we can do here is just retry. I am looking into yarn's code these days.

did you have a chance to look at the yarn code here? I thought YARN was supposed to give indication that it was going to preempt you but its been a while since I did anything in YARN.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Apr 21, 2021

RM clean preempted container's info, then code in ClientRMService throw ApplicationAttemptNotFoundException

    RMAppAttempt appAttempt = application.getAppAttempts().get(attemptId);
    if (appAttempt == null) {
      throw new ApplicationAttemptNotFoundException(
          "ApplicationAttempt with id '" + attemptId + "' doesn't exist in RM.");
    }

org.apache.spark.deploy.yarn.Client will receive this exception。

In yarn-client mode, Client run in client driver side, when it receive ApplicationAttemptNotFoundException, only means AM's container lose. In this situation, retry and restart a AM attempt is ok.
In cluster mode, Client will be wrapper in YarnClusterApplication and start this in user client side.

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Test build #137715 has finished for PR 31437 at commit 69abde6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42243/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42243/

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@AngersZhuuuu
Copy link
Contributor Author

Sorry was out of office for a bit. did you look at the test failure?

Seems a flaky test.

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Test build #137717 has finished for PR 31437 at commit 69abde6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42245/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42245/

@AngersZhuuuu
Copy link
Contributor Author

ping @tgravescs

@github-actions
Copy link

github-actions bot commented Sep 2, 2021

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 2, 2021
@github-actions github-actions bot closed this Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants