Skip to content

Conversation

@sun-rui
Copy link
Contributor

@sun-rui sun-rui commented Jul 13, 2016

What changes were proposed in this pull request?

Spark applications running on Mesos throw exception upon exit. For details, refer to https://issues.apache.org/jira/browse/SPARK-16522.

I am not sure if there is any better fix, so wait for review comments.

How was this patch tested?

Manual test. Observed that the exception is gone upon application exit.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62225 has finished for PR 14175 at commit 6fe96e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

reason: String): Unit = {
stateLock.synchronized {
removeExecutor(taskId, SlaveLost(reason))
if (!stopCalled) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be adding the guard here. It's the parent class that's incorrectly making a request to the driverEndpoint despite the driverEndpoint being shut down. So it's the parent class that should add the guard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add the guard in the parent class, namely CoarseGrainedSchedulerBackend, what's the appropriate behavior of the guard? Silently ignore all message requests after stop() is called and log warnings, or throw an exception? If latter, then the call to removeExecutor has to be wrapped with a try.
Since the call to removeExecutor() is done in MesosCoarseGrainedSchedulerBackend, I think current fix is simpler and reasonable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK you've convinced me. But please add a clarifying comment to super.removeExecutor() specifying that it should not becalled after super.stop() is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62514 has finished for PR 14175 at commit b90ded8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

reason: String): Unit = {
stateLock.synchronized {
removeExecutor(taskId, SlaveLost(reason))
// Do not call removeExecutor() after this scheduler backend was stopped because

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment needs to be on the super class's removeExecutor method. All clients need to be aware of when they're allowed to call it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only removeExecutor(), but also other methods, like reviveOffers(), killTask(), ..., should not be called after stopped. If you prefer adding comment in the parent class, then it seems it is more complete to add comment to all methods that may encounter such case. However, I don't think it is necessary to do so, as exceptions will be thrown in such case notifying the caller it is not valid to do such calls, just as why this issue was found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgummelt, what's your opinion?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this going in as-is just to the get the problem solved, but I do still think that classes should try to ensure that their public methods are callable w/o state consideration, so I would have rather we fixed this in the parent. Let's try to maintain that going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about submitting another JIRA issue on better handling of state management after stop() is called for CoarseGrainedSchedulerBackend?

@mgummelt
Copy link

@sun-rui ping

@mgummelt
Copy link

Can you add a regression test?

Then LGTM

@sun-rui
Copy link
Contributor Author

sun-rui commented Jul 27, 2016

Sure, will add it

@mgummelt
Copy link

mgummelt commented Aug 3, 2016

@sun-rui are you going to get to this? Otherwise I'll take it over.

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 4, 2016

@mgummelt, will do it soon

@mgummelt
Copy link

mgummelt commented Aug 4, 2016

thanks!

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 6, 2016

@mgummelt, regression test case added. Not sure it is the expected one.

@SparkQA
Copy link

SparkQA commented Aug 6, 2016

Test build #63312 has finished for PR 14175 at commit 1a8b8e6.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 7, 2016

rebased to master

@SparkQA
Copy link

SparkQA commented Aug 7, 2016

Test build #63321 has finished for PR 14175 at commit f848530.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2016

Test build #63327 has finished for PR 14175 at commit 3526f53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}.start

backend.stop
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include parens for methods with side effects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@mgummelt
Copy link

mgummelt commented Aug 8, 2016

A couple minor style issues then LGTM. Will ping the committer when style issues are resolved.

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63345 has finished for PR 14175 at commit 5348138.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgummelt
Copy link

mgummelt commented Aug 8, 2016

@srowen LGTM. Can you merge this into master/2.0?

@mgummelt
Copy link

mgummelt commented Aug 8, 2016

Thanks @sun-rui !

@asfgit asfgit closed this in af710e5 Aug 9, 2016
@srowen
Copy link
Member

srowen commented Aug 9, 2016

Merged to master, but it doesn't pick cleanly into 2.0, and the conflict in the tests wasn't entirely trivial. You can open another PR if it's important.

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 9, 2016

ok, will submit another PR for 2.0 branch

@mgummelt
Copy link

mgummelt commented Aug 9, 2016

@sun-rui Let me know if you are unable to do so. We need this in 2.0

sun-rui added a commit to sun-rui/spark that referenced this pull request Aug 10, 2016
Spark applications running on Mesos throw exception upon exit. For details, refer to https://issues.apache.org/jira/browse/SPARK-16522.

I am not sure if there is any better fix, so wait for review comments.

Manual test. Observed that the exception is gone upon application exit.

Author: Sun Rui <[email protected]>

Closes apache#14175 from sun-rui/SPARK-16522.
asfgit pushed a commit that referenced this pull request Aug 10, 2016
This is backport of #14175 to branch 2.0

Author: Sun Rui <[email protected]>

Closes #14575 from sun-rui/SPARK-16522-branch-2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants