-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13852][YARN]handle the InterruptedException caused by YARN HA switch #11692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #53060 has finished for PR 11692 at commit
|
| logError(s"Application $appId not found.") | ||
| return (YarnApplicationState.KILLED, FinalApplicationStatus.KILLED) | ||
| case NonFatal(e) => | ||
| if (e.isInstanceOf[InterruptedException] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these just be additional case statements above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can only move InterruptedException to above but not exception caused by it.
just move InterruptedException or leave these two here, which option do you think is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm does this not work?
case e: InterruptedException => ...
case e: Exception if e.getCause.isInstanceOf[InterruptedException] => ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then the code segments will be seperated into two parts. i am not sure it's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure, you can almost combine the two conditions with | in Scala but not quite in this case, but you can at least do ...
case e: Exception if e.isInstanceOf[InterruptedException] ||e.getCause.isInstanceOf[InterruptedException] =>
|
@srowen thanks for your comments. I've changed it, please check. |
|
Test build #53072 has finished for PR 11692 at commit
|
|
OK by me |
|
@vanzin @jerryshao does that sound right? |
|
@WangTaoTheTonic Can you please clarify exactly what is going on here? You are saying if YARN RM fails over from active to standby then our client logic can no longer connect to RM and gets an interrupted Exception? Who is interrupting the monitor thread? If its the spark context then how do you know its success and not failure? I'm not sure reporting success is the right thing to do here if we don't know the real status that is why I want to understand exactly what is going on. |
|
Also which sleep are you referring to because the place you put the try/catch isn't around the Thread.sleep(interval) in monitorApplication, its only around getApplicationReport |
|
hi @tgravescs , it happened when sc stop normally in client mode. sc.stop will stop dagscheduler -> stop taskscheduler -> stop scheduler backend -> interrupt the monitor thread, in which it will enter into a retry logic where sleep intervals occurs(which is not the sleep here) waiting for RM's switching. The sleep methods will throw an InterruptedException when it is interrupted, so we need to catch it because it will log the application failed as treated as NonFatal(e), for now. |
|
for another concern about the final application status returned, we don't need too much worry as it is barely used by the codes who invoke this. |
|
So you are saying that if spark context in yarn client mode is cleanly exiting while the RM is switching to the standby node, the call to getApplicationReport in Yarn can internally retry and sleep, since sc.stop() was called it ends up calling the scheduler backend stop interrupting the monitoring thread. The MonitorThread has a catch for interruptedException and should be printing an info message " Interrupting monitor thread", you are seeing this? Interrupted Exception is not a NonFatal error so it shouldn't be catching it: From scaladoc: |
|
Also is this just what is being printed by the client or is the YARN final status (in RM) actually failed? |
|
I've only observed exception caused by InterruptedException but not itself directly, thought it should be wrapped internally. The status in RM is ok as it is decided by ApplicationMaster not spark client. In my recall i didn't see the message "Interrupting monitor thread" but not 100% sure. I will try to reproduce it and confirm. |
|
@WangTaoTheTonic , would you please elaborate specific problem you met when From my understanding, it will only throw some exceptions mentioned that this application is failed, is that right? |
|
@tgravescs I reproduce it and the error message like:
There's no "Interrupting monitor thread" and the exception is UndeclaredThrowableException caused by InterruptedException. |
|
@jerryshao yes, the problem is that client side's log will throw exception and show app is failed. more details are in log i pasted. |
|
But from my understanding, this exception does no harm to your application, since your application is about to finish itself, also this may happen occasionally. Also does it relate to RM HA, from my understanding, this |
|
the application is finished successfully(RM UI also show success state) but log shows it failed, that's the problem i think. yeah you're right sleep method can throw InterruptedException. this pr is trying to fix the problem we find in RM HA switching. what i am trying to say is that interrupting a monitor thread should not print the failed message in log. |
|
I see your point, so the real issue should only be the log issue. But marking the state as
|
|
the added log just says "app is finished" but not "success". if sc stops because continuous stage failure, the returned |
|
So inside of hadoop in the getApplicationReport call, it was in RetryInvocationHandler which was doing a sleep and got an interrupted exception. That ended up throwing a java.lang.reflect.UndeclaredThrowableException up to monitorApplication which is why it was handled by the NonFatal catch. I need to look at it a bit closer. |
|
so, how about it guys? |
|
I have had time to look further to see what we should be doing, but as I read the exception you listed above, the fix you are proposing here won't work. Its not getting an InterruptedException back to the monitorApplication routine, its getting an UndeclaredThrowableException. |
|
have you tried to reproduce the scenaro and see what happend? The |
| case e: Exception if (e.isInstanceOf[InterruptedException] | ||
| || e.getCause.isInstanceOf[InterruptedException]) => | ||
| logInfo("The reporter thread is interrupted, we assume app is finished.") | ||
| return (YarnApplicationState.FINISHED, FinalApplicationStatus.SUCCEEDED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we change the status to be UNDEFINED.
Since we really don't know the status it seems like returning the undefined in this case make more sense. Hopefully the user would then go look more at the RM or spark job details.
|
sorry I didn't read the diff close enough thought you were just catching that type. I don't have an RM HA setup to quickly test it. I think there are cases this can return the wrong thing still (success when failure). Like if you control-c out of spark-shell during this same time period. Yes it probably doesn't make much difference, but at the same time I don't see it printing the exception to the log on shutdown as that big of a deal either. Is this actually causing you an issue or just annoying in the log? |
|
It will not impact on actuall result, but a error stacktrace and log showing failure will make user confused and easy to believe that the application is failed and needed to be submitted again. Like I said above, we return 2-tuple in which the last one ( |
|
Wow this is old. @WangTaoTheTonic I'm not sure this is the right fix. The code in YarnClientSchedulerBackend is already catching InterruptedException: It seems it just needs to be tweaked to also handled the |
|
Hi @vanzin, would this be then a soft-suggestion for closing this if there is no objection for about , way, a week? |
|
Either close or make the right fix. As it is, the PR is not doing the right thing. |
|
Let me try to propose to close this after a week if the author seems not active on this. |
when sc stops, it will interrupt thread using to monitor app status.
the thread will throw an InterruptedException if YARN is switch as there is a sleep method in retry logic.
If YARN is switch between active and standby, sc.stop will return YarnApplicationState.FAILED as the InterruptedException is not caught.