Skip to content

Conversation

@WangTaoTheTonic
Copy link
Contributor

when sc stops, it will interrupt thread using to monitor app status.
the thread will throw an InterruptedException if YARN is switch as there is a sleep method in retry logic.
If YARN is switch between active and standby, sc.stop will return YarnApplicationState.FAILED as the InterruptedException is not caught.

@WangTaoTheTonic WangTaoTheTonic changed the title [SPARK-13852]handle the InterruptedException caused by YARN HA switch [SPARK-13852][YARN]handle the InterruptedException caused by YARN HA switch Mar 14, 2016
@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53060 has finished for PR 11692 at commit 9d6de23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

logError(s"Application $appId not found.")
return (YarnApplicationState.KILLED, FinalApplicationStatus.KILLED)
case NonFatal(e) =>
if (e.isInstanceOf[InterruptedException]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these just be additional case statements above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can only move InterruptedException to above but not exception caused by it.
just move InterruptedException or leave these two here, which option do you think is better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm does this not work?

case e: InterruptedException => ...
case e: Exception if e.getCause.isInstanceOf[InterruptedException] => ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then the code segments will be seperated into two parts. i am not sure it's better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, you can almost combine the two conditions with | in Scala but not quite in this case, but you can at least do ...

case e: Exception if e.isInstanceOf[InterruptedException] ||e.getCause.isInstanceOf[InterruptedException] =>

@WangTaoTheTonic
Copy link
Contributor Author

@srowen thanks for your comments. I've changed it, please check.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53072 has finished for PR 11692 at commit 27203d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 14, 2016

OK by me

@srowen
Copy link
Member

srowen commented Mar 15, 2016

@vanzin @jerryshao does that sound right?

@tgravescs
Copy link
Contributor

@WangTaoTheTonic Can you please clarify exactly what is going on here?
Does this happen in client or cluster mode or both?

You are saying if YARN RM fails over from active to standby then our client logic can no longer connect to RM and gets an interrupted Exception? Who is interrupting the monitor thread? If its the spark context then how do you know its success and not failure?

I'm not sure reporting success is the right thing to do here if we don't know the real status that is why I want to understand exactly what is going on.

@tgravescs
Copy link
Contributor

Also which sleep are you referring to because the place you put the try/catch isn't around the Thread.sleep(interval) in monitorApplication, its only around getApplicationReport

@WangTaoTheTonic
Copy link
Contributor Author

hi @tgravescs , it happened when sc stop normally in client mode. sc.stop will stop dagscheduler -> stop taskscheduler -> stop scheduler backend -> interrupt the monitor thread, in which it will enter into a retry logic where sleep intervals occurs(which is not the sleep here) waiting for RM's switching.

The sleep methods will throw an InterruptedException when it is interrupted, so we need to catch it because it will log the application failed as treated as NonFatal(e), for now.

@WangTaoTheTonic
Copy link
Contributor Author

for another concern about the final application status returned, we don't need too much worry as it is barely used by the codes who invoke this.

@tgravescs
Copy link
Contributor

So you are saying that if spark context in yarn client mode is cleanly exiting while the RM is switching to the standby node, the call to getApplicationReport in Yarn can internally retry and sleep, since sc.stop() was called it ends up calling the scheduler backend stop interrupting the monitoring thread.

The MonitorThread has a catch for interruptedException and should be printing an info message " Interrupting monitor thread", you are seeing this?

Interrupted Exception is not a NonFatal error so it shouldn't be catching it:

From scaladoc:
Extractor of non-fatal Throwables. Will not match fatal errors like VirtualMachineError (for example, OutOfMemoryError and StackOverflowError, subclasses of VirtualMachineError), ThreadDeath, LinkageError, InterruptedException, ControlThrowable.

@tgravescs
Copy link
Contributor

Also is this just what is being printed by the client or is the YARN final status (in RM) actually failed?

@WangTaoTheTonic
Copy link
Contributor Author

I've only observed exception caused by InterruptedException but not itself directly, thought it should be wrapped internally. The status in RM is ok as it is decided by ApplicationMaster not spark client.

In my recall i didn't see the message "Interrupting monitor thread" but not 100% sure. I will try to reproduce it and confirm.

@jerryshao
Copy link
Contributor

@WangTaoTheTonic , would you please elaborate specific problem you met when InterruptedException is thrown?

From my understanding, it will only throw some exceptions mentioned that this application is failed, is that right?

@WangTaoTheTonic
Copy link
Contributor Author

@tgravescs I reproduce it and the error message like:

16/03/16 10:29:33 INFO YarnClientSchedulerBackend: Shutting down all executors
16/03/16 10:29:33 ERROR Client: Failed to contact YARN for application application_1457924833801_0003.
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy10.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:431)
at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:221)
at org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:882)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:144)
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:155)
... 5 more
16/03/16 10:29:33 INFO YarnClientSchedulerBackend: Asking each executor to shut down
16/03/16 10:29:33 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
16/03/16 10:29:33 INFO SparkContext: SparkContext already stopped.
16/03/16 10:29:33 INFO YarnClientSchedulerBackend: Stopped

There's no "Interrupting monitor thread" and the exception is UndeclaredThrowableException caused by InterruptedException.

@WangTaoTheTonic
Copy link
Contributor Author

@jerryshao yes, the problem is that client side's log will throw exception and show app is failed. more details are in log i pasted.

@jerryshao
Copy link
Contributor

But from my understanding, this exception does no harm to your application, since your application is about to finish itself, also this may happen occasionally.

Also does it relate to RM HA, from my understanding, this InterruptedException will be thrown in any case where the code run into sleep, no matter in HA or not. Is there any special thing for RM HA?

@WangTaoTheTonic
Copy link
Contributor Author

the application is finished successfully(RM UI also show success state) but log shows it failed, that's the problem i think.

yeah you're right sleep method can throw InterruptedException. this pr is trying to fix the problem we find in RM HA switching.

what i am trying to say is that interrupting a monitor thread should not print the failed message in log.

@jerryshao
Copy link
Contributor

I see your point, so the real issue should only be the log issue.

But marking the state as FINISHED with SUCCEED should be open to question, since here we don't know the real state of application, if it is finished with failure, say job is aborted due to continuous stage failure, should we still mark this as SUCCEED? So here I'm conservative to this change, because:

  1. This issue is happened rarely and does no harm to the result of application.
  2. We cannot get the real exit state of application, so marking as SUCCEED should be open to question.

@WangTaoTheTonic
Copy link
Contributor Author

the added log just says "app is finished" but not "success". if sc stops because continuous stage failure, the returned FinalApplicationStatus is not used by other codes. only YarnApplicationState returned is used.

@tgravescs
Copy link
Contributor

So inside of hadoop in the getApplicationReport call, it was in RetryInvocationHandler which was doing a sleep and got an interrupted exception. That ended up throwing a java.lang.reflect.UndeclaredThrowableException up to monitorApplication which is why it was handled by the NonFatal catch.

I need to look at it a bit closer.

@WangTaoTheTonic
Copy link
Contributor Author

so, how about it guys?

@tgravescs
Copy link
Contributor

I have had time to look further to see what we should be doing, but as I read the exception you listed above, the fix you are proposing here won't work. Its not getting an InterruptedException back to the monitorApplication routine, its getting an UndeclaredThrowableException.

@WangTaoTheTonic
Copy link
Contributor Author

have you tried to reproduce the scenaro and see what happend? The UndeclaredThrowableException will be caught by e.getCause.isInstanceOf[InterruptedException], i think.

case e: Exception if (e.isInstanceOf[InterruptedException]
|| e.getCause.isInstanceOf[InterruptedException]) =>
logInfo("The reporter thread is interrupted, we assume app is finished.")
return (YarnApplicationState.FINISHED, FinalApplicationStatus.SUCCEEDED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we change the status to be UNDEFINED.

Since we really don't know the status it seems like returning the undefined in this case make more sense. Hopefully the user would then go look more at the RM or spark job details.

@tgravescs
Copy link
Contributor

sorry I didn't read the diff close enough thought you were just catching that type. I don't have an RM HA setup to quickly test it.

I think there are cases this can return the wrong thing still (success when failure). Like if you control-c out of spark-shell during this same time period. Yes it probably doesn't make much difference, but at the same time I don't see it printing the exception to the log on shutdown as that big of a deal either.

Is this actually causing you an issue or just annoying in the log?

@WangTaoTheTonic
Copy link
Contributor Author

It will not impact on actuall result, but a error stacktrace and log showing failure will make user confused and easy to believe that the application is failed and needed to be submitted again.

Like I said above, we return 2-tuple in which the last one (FinalApplicationStatus) is not used, the first one said application is finished so it is ok.

@vanzin
Copy link
Contributor

vanzin commented Dec 5, 2016

Wow this is old. @WangTaoTheTonic I'm not sure this is the right fix. The code in YarnClientSchedulerBackend is already catching InterruptedException:

      try {
        val (state, _) = client.monitorApplication(appId.get, logApplicationReport = false)
        logError(s"Yarn application has already exited with state $state!")
        allowInterrupt = false
        sc.stop()
      } catch {
        case e: InterruptedException => logInfo("Interrupting monitor thread")
      }

It seems it just needs to be tweaked to also handled the UndeclaredThrowableException case.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 9, 2017

Hi @vanzin, would this be then a soft-suggestion for closing this if there is no objection for about , way, a week?

@vanzin
Copy link
Contributor

vanzin commented Feb 9, 2017

Either close or make the right fix. As it is, the PR is not doing the right thing.

@HyukjinKwon
Copy link
Member

Let me try to propose to close this after a week if the author seems not active on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants