-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3005] Fix spark driver hang in mesos fine-grained mode #1940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
I commented on the JIRA - but we already have code that handles the fact that cancellation is not supported in Mesos. It's likely this is related to some other type of error. |
|
Can one of the admins verify this patch? |
|
@xuzhongxing I followed the conversation on the JIRA and it looks like we still don't have a good idea of why Spark driver is hanging. Although we have a fix that makes the problem go away, the root cause is probably deeper, and the behavior you observed on fine-grained mode is just a symptom. My guess is that when an |
|
@andrewor14 I think you're right that there's a deeper problem here. I haven't tested this but here's what I think is going on: (1) In TaskSchedulerImpl.cancelTasks(), the killTask call throws an unsupported operation exception, as is logged (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L194). As a result, tsm.abort() never gets called. So, the TaskSetManager still thinks everything is hunky dory. Regardless of what's going on with (3), I think the right way to fix this is to move the tsm.abort() call here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L196 up to before we try to kill the task. That way, regardless of whether killTask() is successful, we'll mark the task set as aborted and send all the appropriate events. Also, whoever fixes this should definitely add a unit test!! It would be great to add a short unit test to show the problem first, so it's easier for others to reproduce, and then deal with the fix. |
|
@kayousterhout thanks for the thorough analysis. Do you have any thoughts on just defining killTasks to be "best effort"? I think that would generally simplify the code a lot here. |
|
This seems like it could be ok -- my only concern is about the semantics of On Mon, Sep 15, 2014 at 10:06 PM, Patrick Wendell [email protected]
|
|
Yeah I think we should just change it to say that the kill request has been acknowledge, but since killing as asynchronous and best-effort, it may not have stopped executing. The semantics are already somewhat weird, because right now users will get that message even if tasks from their job are still running (since it's asynchronous). |
|
Just chiming in about the two different fixes about the killTask, where this PR does nothing while the Brenden's PR #2453 calls the Mesos driver kill task. |
|
Hey @xuzhongxing I think this is resolved in #2453. Would you mind closing this issue? |
|
Please close this PR as this is no longer needed |
https://issues.apache.org/jira/browse/SPARK-3005