-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16522][MESOS] Spark application throws exception on exit. #14175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62225 has finished for PR 14175 at commit
|
| reason: String): Unit = { | ||
| stateLock.synchronized { | ||
| removeExecutor(taskId, SlaveLost(reason)) | ||
| if (!stopCalled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be adding the guard here. It's the parent class that's incorrectly making a request to the driverEndpoint despite the driverEndpoint being shut down. So it's the parent class that should add the guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add the guard in the parent class, namely CoarseGrainedSchedulerBackend, what's the appropriate behavior of the guard? Silently ignore all message requests after stop() is called and log warnings, or throw an exception? If latter, then the call to removeExecutor has to be wrapped with a try.
Since the call to removeExecutor() is done in MesosCoarseGrainedSchedulerBackend, I think current fix is simpler and reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK you've convinced me. But please add a clarifying comment to super.removeExecutor() specifying that it should not becalled after super.stop() is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment added.
|
Test build #62514 has finished for PR 14175 at commit
|
| reason: String): Unit = { | ||
| stateLock.synchronized { | ||
| removeExecutor(taskId, SlaveLost(reason)) | ||
| // Do not call removeExecutor() after this scheduler backend was stopped because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment needs to be on the super class's removeExecutor method. All clients need to be aware of when they're allowed to call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only removeExecutor(), but also other methods, like reviveOffers(), killTask(), ..., should not be called after stopped. If you prefer adding comment in the parent class, then it seems it is more complete to add comment to all methods that may encounter such case. However, I don't think it is necessary to do so, as exceptions will be thrown in such case notifying the caller it is not valid to do such calls, just as why this issue was found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgummelt, what's your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this going in as-is just to the get the problem solved, but I do still think that classes should try to ensure that their public methods are callable w/o state consideration, so I would have rather we fixed this in the parent. Let's try to maintain that going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about submitting another JIRA issue on better handling of state management after stop() is called for CoarseGrainedSchedulerBackend?
|
@sun-rui ping |
|
Can you add a regression test? Then LGTM |
|
Sure, will add it |
|
@sun-rui are you going to get to this? Otherwise I'll take it over. |
|
@mgummelt, will do it soon |
|
thanks! |
|
@mgummelt, regression test case added. Not sure it is the expected one. |
|
Test build #63312 has finished for PR 14175 at commit
|
|
rebased to master |
|
Test build #63321 has finished for PR 14175 at commit
|
|
Test build #63327 has finished for PR 14175 at commit
|
| } | ||
| }.start | ||
|
|
||
| backend.stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include parens for methods with side effects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
|
A couple minor style issues then LGTM. Will ping the committer when style issues are resolved. |
|
Test build #63345 has finished for PR 14175 at commit
|
|
@srowen LGTM. Can you merge this into master/2.0? |
|
Thanks @sun-rui ! |
|
Merged to master, but it doesn't pick cleanly into 2.0, and the conflict in the tests wasn't entirely trivial. You can open another PR if it's important. |
|
ok, will submit another PR for 2.0 branch |
|
@sun-rui Let me know if you are unable to do so. We need this in 2.0 |
Spark applications running on Mesos throw exception upon exit. For details, refer to https://issues.apache.org/jira/browse/SPARK-16522. I am not sure if there is any better fix, so wait for review comments. Manual test. Observed that the exception is gone upon application exit. Author: Sun Rui <[email protected]> Closes apache#14175 from sun-rui/SPARK-16522.
This is backport of #14175 to branch 2.0 Author: Sun Rui <[email protected]> Closes #14575 from sun-rui/SPARK-16522-branch-2.0.
What changes were proposed in this pull request?
Spark applications running on Mesos throw exception upon exit. For details, refer to https://issues.apache.org/jira/browse/SPARK-16522.
I am not sure if there is any better fix, so wait for review comments.
How was this patch tested?
Manual test. Observed that the exception is gone upon application exit.