[SPARK-4991][CORE] Worker should reconnect to Master when Master actor restart #3825

liyezhang556520 · 2014-12-29T09:00:01Z

When Master akka actor encounter an exception, the Master will restart (akka actor restart not JVM restart) (like the case in SPARK-4989). And all old information are cleared on Master (including workers, applications, etc). However, the workers are not aware of this at all. The state of the cluster is that: the master is on, and all workers are also on, but master is not aware of the exists of workers, and will ignore all worker's heartbeat because all workers are not registered. So that the whole cluster is not available.

In this PR, master will tell worker the connection is disconnected, so that worker will register to master again.

SparkQA · 2014-12-29T09:02:32Z

Test build #24859 has started for PR 3825 at commit e9c99e3.

This patch merges cleanly.

SparkQA · 2014-12-29T10:26:57Z

Test build #24859 has finished for PR 3825 at commit e9c99e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MasterDisconnected(masterUrl: String) extends DeployMessage

AmplabJenkins · 2014-12-29T10:27:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24859/
Test PASSed.

JoshRosen · 2014-12-29T18:44:13Z

Wouldn't it be better to ensure that actors like Master and DAGScheduler never die due to uncaught exceptions?

JoshRosen · 2014-12-29T19:04:40Z

More specifically, I guess I'm suggesting that we modify wrap the receive and receiveWithLogging methods of our actors with try-catch blocks to log any exceptions that bubble up to the top of the actors.

liyezhang556520 · 2014-12-31T05:08:57Z

Hi @JoshRosen , it'll be a little weird to wrap the receive or receiveWithLogging method with try-catch blocks. And also this is conflict with the fault tolerance design of Akka, which is supposed to handle exception within supervisor. If you think it's fine to add a try-catch block to receive method. Then I'll make an additional commit.

liyezhang556520 · 2014-12-31T07:35:46Z

I think a better way is to use the Akka's persistence feature, recover the actor's state when actor restart.

Anyway, still this PR has it's value that when an unregistered worker has heartbeat to this Master, this master should not just ignore it, at least should tell the worker it has disconnected. What do you think @JoshRosen ?

JoshRosen · 2014-12-31T08:18:23Z

I'd rather use fewer Akka features than more, since this will make it easier to replace Akka with our own RPC layer in the future. Therefore, I'd much prefer to not allow exceptions to trigger actor restarts / state clearing. I think that adding an experimental Akka feature like persistence would be a huge risk for little obvious gain.

I'm not sure if the "heartbeat from unknown worker" can ever occur if we don't clear the master's state because I think that workers only begin sending heartbeats once a master has ack'd their registration in which case the master would know that it was a previously-registered worker and instruct it to reconnect.

markhamstra · 2014-12-31T14:40:37Z

It doesn't seem to me that usage of the newer Akka persistence API is called for, but it does seem that wrapping the receive in a try-catch is trying to do the job for which Akka's SupervisorStrategy is intended. I can't recommend the hand-rolled try-catch approach.

http://doc.akka.io/docs/akka/2.3.4/general/supervision.html#supervision

JoshRosen · 2015-01-04T20:53:07Z

@markhamstra From that page:

Depending on the nature of the work to be supervised and the nature of the failure, the supervisor has a choice of the following four options:

Resume the subordinate, keeping its accumulated internal state

Restart the subordinate, clearing out its accumulated internal state

Stop the subordinate permanently

Escalate the failure, thereby failing itself

It sounds like we want approach 1, resuming the subordinate without losing its state, so I'd be in favor of reworking this PR to use that type of supervision strategy. I don't think that extending our actors to support restart necessarily makes sense, since it adds a lot of complexity (for instance, I don't think that this PR handles loss of the appInfo or driverInfo data structures).

markhamstra · 2015-01-04T21:39:54Z

@JoshRosen your thinking is that Master will be in good shape even though an exception has been thrown? If you can guarantee that, then resuming the actor while keeping the accumulated state should do the job. Otherwise, things get more complicated. Within the lengthy process of handling exceptions thrown within the DAGScheduler (#186), we ended up taking the conservative approach of restarting the whole system instead of trying to restart the DAGScheduler actor with fixed or reconstructed state. I haven't dug into the details of this PR yet, so I can't say for certain, but there are probably lessons to be learned from that DAGScheduler epic PR.

Something else that we'll need to consider at some point if other actors start requiring supervision strategies other than the default is what the overall structure of the supervision hierarchy should be. Right now, only the DAGScheduler has another level of supervision, but perhaps Spark actors from outside the DAGScheduler should also be handled under one or more levels of common supervision.

liyezhang556520 · 2015-01-05T09:22:29Z

@JoshRosen , If we want to use the supervision mechanism. We need to add another actor level as parent of the current Master actor. I don't know if that is suitable.

markhamstra · 2015-01-05T15:19:10Z

@liyezhang556520 That's been done already in the DAGScheduler. If we need another level of supervision for Master or other actors, we should consider whether these actors need separate Supervisors or whether they can be combined in the supervision hierarchy.

liyezhang556520 · 2015-01-06T05:36:19Z

@markhamstra , thanks for reminder, I'll update this PR by making a try to introduce the supervision.

srowen · 2015-05-18T18:19:01Z

I think this has timed out. Would you close this PR for now?

liyezhang556520 added 2 commits December 29, 2014 15:35

worker reconnect to master when master restart for exception

107e5c5

add log info

e9c99e3

liyezhang556520 closed this May 19, 2015

[SPARK-4991][CORE] Worker should reconnect to Master when Master actor restart #3825

[SPARK-4991][CORE] Worker should reconnect to Master when Master actor restart #3825

Uh oh!

Conversation

liyezhang556520 commented Dec 29, 2014

Uh oh!

SparkQA commented Dec 29, 2014

Uh oh!

SparkQA commented Dec 29, 2014

Uh oh!

AmplabJenkins commented Dec 29, 2014

Uh oh!

JoshRosen commented Dec 29, 2014

Uh oh!

JoshRosen commented Dec 29, 2014

Uh oh!

liyezhang556520 commented Dec 31, 2014

Uh oh!

liyezhang556520 commented Dec 31, 2014

Uh oh!

JoshRosen commented Dec 31, 2014

Uh oh!

markhamstra commented Dec 31, 2014

Uh oh!

JoshRosen commented Jan 4, 2015

Uh oh!

markhamstra commented Jan 4, 2015

Uh oh!

liyezhang556520 commented Jan 5, 2015

Uh oh!

markhamstra commented Jan 5, 2015

Uh oh!

liyezhang556520 commented Jan 6, 2015

Uh oh!

srowen commented May 18, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants