-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-4991][CORE] Worker should reconnect to Master when Master actor restart #3825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #24859 has started for PR 3825 at commit
|
|
Test build #24859 has finished for PR 3825 at commit
|
|
Test PASSed. |
|
Wouldn't it be better to ensure that actors like Master and DAGScheduler never die due to uncaught exceptions? |
|
More specifically, I guess I'm suggesting that we modify wrap the |
|
Hi @JoshRosen , it'll be a little weird to wrap the |
|
I think a better way is to use the Akka's persistence feature, recover the actor's state when actor restart. Anyway, still this PR has it's value that when an unregistered worker has heartbeat to this Master, this master should not just ignore it, at least should tell the worker it has disconnected. What do you think @JoshRosen ? |
|
I'd rather use fewer Akka features than more, since this will make it easier to replace Akka with our own RPC layer in the future. Therefore, I'd much prefer to not allow exceptions to trigger actor restarts / state clearing. I think that adding an experimental Akka feature like persistence would be a huge risk for little obvious gain. I'm not sure if the "heartbeat from unknown worker" can ever occur if we don't clear the master's state because I think that workers only begin sending heartbeats once a master has ack'd their registration in which case the master would know that it was a previously-registered worker and instruct it to reconnect. |
|
It doesn't seem to me that usage of the newer Akka persistence API is called for, but it does seem that wrapping the http://doc.akka.io/docs/akka/2.3.4/general/supervision.html#supervision |
|
@markhamstra From that page:
It sounds like we want approach 1, resuming the subordinate without losing its state, so I'd be in favor of reworking this PR to use that type of supervision strategy. I don't think that extending our actors to support restart necessarily makes sense, since it adds a lot of complexity (for instance, I don't think that this PR handles loss of the |
|
@JoshRosen your thinking is that Master will be in good shape even though an exception has been thrown? If you can guarantee that, then resuming the actor while keeping the accumulated state should do the job. Otherwise, things get more complicated. Within the lengthy process of handling exceptions thrown within the DAGScheduler (#186), we ended up taking the conservative approach of restarting the whole system instead of trying to restart the DAGScheduler actor with fixed or reconstructed state. I haven't dug into the details of this PR yet, so I can't say for certain, but there are probably lessons to be learned from that DAGScheduler epic PR. Something else that we'll need to consider at some point if other actors start requiring supervision strategies other than the default is what the overall structure of the supervision hierarchy should be. Right now, only the DAGScheduler has another level of supervision, but perhaps Spark actors from outside the DAGScheduler should also be handled under one or more levels of common supervision. |
|
@JoshRosen , If we want to use the supervision mechanism. We need to add another actor level as parent of the current Master actor. I don't know if that is suitable. |
|
@liyezhang556520 That's been done already in the DAGScheduler. If we need another level of supervision for Master or other actors, we should consider whether these actors need separate Supervisors or whether they can be combined in the supervision hierarchy. |
|
@markhamstra , thanks for reminder, I'll update this PR by making a try to introduce the supervision. |
|
I think this has timed out. Would you close this PR for now? |
When Master akka actor encounter an exception, the Master will restart (akka actor restart not JVM restart) (like the case in SPARK-4989). And all old information are cleared on Master (including workers, applications, etc). However, the workers are not aware of this at all. The state of the cluster is that: the master is on, and all workers are also on, but master is not aware of the exists of workers, and will ignore all worker's heartbeat because all workers are not registered. So that the whole cluster is not available.
In this PR, master will tell worker the connection is disconnected, so that worker will register to master again.