[SPARK-13852][YARN]handle the InterruptedException caused by YARN HA switch #11692

WangTaoTheTonic · 2016-03-14T08:30:36Z

when sc stops, it will interrupt thread using to monitor app status.
the thread will throw an InterruptedException if YARN is switch as there is a sleep method in retry logic.
If YARN is switch between active and standby, sc.stop will return YarnApplicationState.FAILED as the InterruptedException is not caught.

SparkQA · 2016-03-14T08:52:19Z

Test build #53060 has finished for PR 11692 at commit 9d6de23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-14T09:01:16Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

            logError(s"Application $appId not found.")
            return (YarnApplicationState.KILLED, FinalApplicationStatus.KILLED)
          case NonFatal(e) =>
+            if (e.isInstanceOf[InterruptedException]


Shouldn't these just be additional case statements above?

we can only move InterruptedException to above but not exception caused by it.
just move InterruptedException or leave these two here, which option do you think is better?

Hm does this not work?

case e: InterruptedException => ... case e: Exception if e.getCause.isInstanceOf[InterruptedException] => ...

then the code segments will be seperated into two parts. i am not sure it's better.

Ah sure, you can almost combine the two conditions with | in Scala but not quite in this case, but you can at least do ...

case e: Exception if e.isInstanceOf[InterruptedException] ||e.getCause.isInstanceOf[InterruptedException] =>

WangTaoTheTonic · 2016-03-14T14:15:16Z

@srowen thanks for your comments. I've changed it, please check.

SparkQA · 2016-03-14T14:36:56Z

Test build #53072 has finished for PR 11692 at commit 27203d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-14T14:38:13Z

OK by me

srowen · 2016-03-15T10:25:13Z

@vanzin @jerryshao does that sound right?

tgravescs · 2016-03-15T14:01:09Z

@WangTaoTheTonic Can you please clarify exactly what is going on here?
Does this happen in client or cluster mode or both?

You are saying if YARN RM fails over from active to standby then our client logic can no longer connect to RM and gets an interrupted Exception? Who is interrupting the monitor thread? If its the spark context then how do you know its success and not failure?

I'm not sure reporting success is the right thing to do here if we don't know the real status that is why I want to understand exactly what is going on.

tgravescs · 2016-03-15T14:18:49Z

Also which sleep are you referring to because the place you put the try/catch isn't around the Thread.sleep(interval) in monitorApplication, its only around getApplicationReport

WangTaoTheTonic · 2016-03-15T15:46:01Z

hi @tgravescs , it happened when sc stop normally in client mode. sc.stop will stop dagscheduler -> stop taskscheduler -> stop scheduler backend -> interrupt the monitor thread, in which it will enter into a retry logic where sleep intervals occurs(which is not the sleep here) waiting for RM's switching.

The sleep methods will throw an InterruptedException when it is interrupted, so we need to catch it because it will log the application failed as treated as NonFatal(e), for now.

WangTaoTheTonic · 2016-03-15T16:04:30Z

for another concern about the final application status returned, we don't need too much worry as it is barely used by the codes who invoke this.

tgravescs · 2016-03-15T16:11:47Z

So you are saying that if spark context in yarn client mode is cleanly exiting while the RM is switching to the standby node, the call to getApplicationReport in Yarn can internally retry and sleep, since sc.stop() was called it ends up calling the scheduler backend stop interrupting the monitoring thread.

The MonitorThread has a catch for interruptedException and should be printing an info message " Interrupting monitor thread", you are seeing this?

Interrupted Exception is not a NonFatal error so it shouldn't be catching it:

From scaladoc:
Extractor of non-fatal Throwables. Will not match fatal errors like VirtualMachineError (for example, OutOfMemoryError and StackOverflowError, subclasses of VirtualMachineError), ThreadDeath, LinkageError, InterruptedException, ControlThrowable.

tgravescs · 2016-03-15T16:15:09Z

Also is this just what is being printed by the client or is the YARN final status (in RM) actually failed?

WangTaoTheTonic · 2016-03-15T16:30:33Z

I've only observed exception caused by InterruptedException but not itself directly, thought it should be wrapped internally. The status in RM is ok as it is decided by ApplicationMaster not spark client.

In my recall i didn't see the message "Interrupting monitor thread" but not 100% sure. I will try to reproduce it and confirm.

jerryshao · 2016-03-16T01:50:06Z

@WangTaoTheTonic , would you please elaborate specific problem you met when InterruptedException is thrown?

From my understanding, it will only throw some exceptions mentioned that this application is failed, is that right?

WangTaoTheTonic · 2016-03-16T02:33:08Z

@tgravescs I reproduce it and the error message like:

16/03/16 10:29:33 INFO YarnClientSchedulerBackend: Shutting down all executors
16/03/16 10:29:33 ERROR Client: Failed to contact YARN for application application_1457924833801_0003.
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy10.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:431)
at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:221)
at org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:882)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:144)
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:155)
... 5 more
16/03/16 10:29:33 INFO YarnClientSchedulerBackend: Asking each executor to shut down
16/03/16 10:29:33 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
16/03/16 10:29:33 INFO SparkContext: SparkContext already stopped.
16/03/16 10:29:33 INFO YarnClientSchedulerBackend: Stopped

There's no "Interrupting monitor thread" and the exception is UndeclaredThrowableException caused by InterruptedException.

WangTaoTheTonic · 2016-03-16T02:39:37Z

@jerryshao yes, the problem is that client side's log will throw exception and show app is failed. more details are in log i pasted.

jerryshao · 2016-03-16T03:06:40Z

But from my understanding, this exception does no harm to your application, since your application is about to finish itself, also this may happen occasionally.

Also does it relate to RM HA, from my understanding, this InterruptedException will be thrown in any case where the code run into sleep, no matter in HA or not. Is there any special thing for RM HA?

WangTaoTheTonic · 2016-03-16T03:27:35Z

the application is finished successfully(RM UI also show success state) but log shows it failed, that's the problem i think.

yeah you're right sleep method can throw InterruptedException. this pr is trying to fix the problem we find in RM HA switching.

what i am trying to say is that interrupting a monitor thread should not print the failed message in log.

jerryshao · 2016-03-16T04:59:21Z

I see your point, so the real issue should only be the log issue.

But marking the state as FINISHED with SUCCEED should be open to question, since here we don't know the real state of application, if it is finished with failure, say job is aborted due to continuous stage failure, should we still mark this as SUCCEED? So here I'm conservative to this change, because:

This issue is happened rarely and does no harm to the result of application.
We cannot get the real exit state of application, so marking as SUCCEED should be open to question.

WangTaoTheTonic · 2016-03-16T07:05:58Z

the added log just says "app is finished" but not "success". if sc stops because continuous stage failure, the returned FinalApplicationStatus is not used by other codes. only YarnApplicationState returned is used.

tgravescs · 2016-03-16T13:54:09Z

So inside of hadoop in the getApplicationReport call, it was in RetryInvocationHandler which was doing a sleep and got an interrupted exception. That ended up throwing a java.lang.reflect.UndeclaredThrowableException up to monitorApplication which is why it was handled by the NonFatal catch.

I need to look at it a bit closer.

WangTaoTheTonic · 2016-03-18T16:03:25Z

so, how about it guys?

tgravescs · 2016-03-18T17:50:21Z

I have had time to look further to see what we should be doing, but as I read the exception you listed above, the fix you are proposing here won't work. Its not getting an InterruptedException back to the monitorApplication routine, its getting an UndeclaredThrowableException.

WangTaoTheTonic · 2016-03-19T16:23:57Z

have you tried to reproduce the scenaro and see what happend? The UndeclaredThrowableException will be caught by e.getCause.isInstanceOf[InterruptedException], i think.

tgravescs · 2016-03-22T19:20:51Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+          case e: Exception if (e.isInstanceOf[InterruptedException]
+            || e.getCause.isInstanceOf[InterruptedException]) =>
+            logInfo("The reporter thread is interrupted, we assume app is finished.")
+            return (YarnApplicationState.FINISHED, FinalApplicationStatus.SUCCEEDED)


how about we change the status to be UNDEFINED.

Since we really don't know the status it seems like returning the undefined in this case make more sense. Hopefully the user would then go look more at the RM or spark job details.

tgravescs · 2016-03-22T19:22:12Z

sorry I didn't read the diff close enough thought you were just catching that type. I don't have an RM HA setup to quickly test it.

I think there are cases this can return the wrong thing still (success when failure). Like if you control-c out of spark-shell during this same time period. Yes it probably doesn't make much difference, but at the same time I don't see it printing the exception to the log on shutdown as that big of a deal either.

Is this actually causing you an issue or just annoying in the log?

WangTaoTheTonic · 2016-03-23T14:42:10Z

It will not impact on actuall result, but a error stacktrace and log showing failure will make user confused and easy to believe that the application is failed and needed to be submitted again.

Like I said above, we return 2-tuple in which the last one (FinalApplicationStatus) is not used, the first one said application is finished so it is ok.

vanzin · 2016-12-05T23:02:58Z

Wow this is old. @WangTaoTheTonic I'm not sure this is the right fix. The code in YarnClientSchedulerBackend is already catching InterruptedException:

      try {
        val (state, _) = client.monitorApplication(appId.get, logApplicationReport = false)
        logError(s"Yarn application has already exited with state $state!")
        allowInterrupt = false
        sc.stop()
      } catch {
        case e: InterruptedException => logInfo("Interrupting monitor thread")
      }

It seems it just needs to be tweaked to also handled the UndeclaredThrowableException case.

HyukjinKwon · 2017-02-09T12:53:07Z

Hi @vanzin, would this be then a soft-suggestion for closing this if there is no objection for about , way, a week?

vanzin · 2017-02-09T17:55:11Z

Either close or make the right fix. As it is, the PR is not doing the right thing.

HyukjinKwon · 2017-02-11T12:09:16Z

Let me try to propose to close this after a week if the author seems not active on this.

handle the InterruptedException caused by YARN HA switch

9d6de23

WangTaoTheTonic changed the title ~~[SPARK-13852]handle the InterruptedException caused by YARN HA switch~~ [SPARK-13852][YARN]handle the InterruptedException caused by YARN HA switch Mar 14, 2016

srowen reviewed Mar 14, 2016
View reviewed changes

put it in a seperate case

27203d8

tgravescs reviewed Mar 22, 2016
View reviewed changes

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

[SPARK-13852][YARN]handle the InterruptedException caused by YARN HA switch #11692

[SPARK-13852][YARN]handle the InterruptedException caused by YARN HA switch #11692

Uh oh!

Conversation

WangTaoTheTonic commented Mar 14, 2016

Uh oh!

SparkQA commented Mar 14, 2016

Uh oh!

srowen Mar 14, 2016

Choose a reason for hiding this comment

Uh oh!

WangTaoTheTonic Mar 14, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Mar 14, 2016

Choose a reason for hiding this comment

Uh oh!

WangTaoTheTonic Mar 14, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Mar 14, 2016

Choose a reason for hiding this comment

Uh oh!

WangTaoTheTonic commented Mar 14, 2016

Uh oh!

SparkQA commented Mar 14, 2016

Uh oh!

srowen commented Mar 14, 2016

Uh oh!

srowen commented Mar 15, 2016

Uh oh!

tgravescs commented Mar 15, 2016

Uh oh!

tgravescs commented Mar 15, 2016

Uh oh!

WangTaoTheTonic commented Mar 15, 2016

Uh oh!

WangTaoTheTonic commented Mar 15, 2016

Uh oh!

tgravescs commented Mar 15, 2016

Uh oh!

tgravescs commented Mar 15, 2016

Uh oh!

WangTaoTheTonic commented Mar 15, 2016

Uh oh!

jerryshao commented Mar 16, 2016

Uh oh!

WangTaoTheTonic commented Mar 16, 2016

Uh oh!

WangTaoTheTonic commented Mar 16, 2016

Uh oh!

jerryshao commented Mar 16, 2016

Uh oh!

WangTaoTheTonic commented Mar 16, 2016

Uh oh!

jerryshao commented Mar 16, 2016

Uh oh!

WangTaoTheTonic commented Mar 16, 2016

Uh oh!

tgravescs commented Mar 16, 2016

Uh oh!

WangTaoTheTonic commented Mar 18, 2016

Uh oh!

tgravescs commented Mar 18, 2016

Uh oh!

WangTaoTheTonic commented Mar 19, 2016

Uh oh!

tgravescs Mar 22, 2016

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Mar 22, 2016

Uh oh!

WangTaoTheTonic commented Mar 23, 2016

Uh oh!

vanzin commented Dec 5, 2016

Uh oh!

HyukjinKwon commented Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Feb 9, 2017

HyukjinKwon commented Feb 9, 2017 •

edited

Loading