Skip to content

Conversation

@liyezhang556520
Copy link
Contributor

When Master akka actor encounter an exception, the Master will restart (akka actor restart not JVM restart) (like the case in SPARK-4989). And all old information are cleared on Master (including workers, applications, etc). However, the workers are not aware of this at all. The state of the cluster is that: the master is on, and all workers are also on, but master is not aware of the exists of workers, and will ignore all worker's heartbeat because all workers are not registered. So that the whole cluster is not available.

In this PR, master will tell worker the connection is disconnected, so that worker will register to master again.

@SparkQA
Copy link

SparkQA commented Dec 29, 2014

Test build #24859 has started for PR 3825 at commit e9c99e3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 29, 2014

Test build #24859 has finished for PR 3825 at commit e9c99e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class MasterDisconnected(masterUrl: String) extends DeployMessage

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24859/
Test PASSed.

@JoshRosen
Copy link
Contributor

Wouldn't it be better to ensure that actors like Master and DAGScheduler never die due to uncaught exceptions?

@JoshRosen
Copy link
Contributor

More specifically, I guess I'm suggesting that we modify wrap the receive and receiveWithLogging methods of our actors with try-catch blocks to log any exceptions that bubble up to the top of the actors.

@liyezhang556520
Copy link
Contributor Author

Hi @JoshRosen , it'll be a little weird to wrap the receive or receiveWithLogging method with try-catch blocks. And also this is conflict with the fault tolerance design of Akka, which is supposed to handle exception within supervisor. If you think it's fine to add a try-catch block to receive method. Then I'll make an additional commit.

@liyezhang556520
Copy link
Contributor Author

I think a better way is to use the Akka's persistence feature, recover the actor's state when actor restart.

Anyway, still this PR has it's value that when an unregistered worker has heartbeat to this Master, this master should not just ignore it, at least should tell the worker it has disconnected. What do you think @JoshRosen ?

@JoshRosen
Copy link
Contributor

I'd rather use fewer Akka features than more, since this will make it easier to replace Akka with our own RPC layer in the future. Therefore, I'd much prefer to not allow exceptions to trigger actor restarts / state clearing. I think that adding an experimental Akka feature like persistence would be a huge risk for little obvious gain.

I'm not sure if the "heartbeat from unknown worker" can ever occur if we don't clear the master's state because I think that workers only begin sending heartbeats once a master has ack'd their registration in which case the master would know that it was a previously-registered worker and instruct it to reconnect.

@markhamstra
Copy link
Contributor

It doesn't seem to me that usage of the newer Akka persistence API is called for, but it does seem that wrapping the receive in a try-catch is trying to do the job for which Akka's SupervisorStrategy is intended. I can't recommend the hand-rolled try-catch approach.

http://doc.akka.io/docs/akka/2.3.4/general/supervision.html#supervision

@JoshRosen
Copy link
Contributor

@markhamstra From that page:

Depending on the nature of the work to be supervised and the nature of the failure, the supervisor has a choice of the following four options:

  1. Resume the subordinate, keeping its accumulated internal state
  2. Restart the subordinate, clearing out its accumulated internal state
  3. Stop the subordinate permanently
  4. Escalate the failure, thereby failing itself

It sounds like we want approach 1, resuming the subordinate without losing its state, so I'd be in favor of reworking this PR to use that type of supervision strategy. I don't think that extending our actors to support restart necessarily makes sense, since it adds a lot of complexity (for instance, I don't think that this PR handles loss of the appInfo or driverInfo data structures).

@markhamstra
Copy link
Contributor

@JoshRosen your thinking is that Master will be in good shape even though an exception has been thrown? If you can guarantee that, then resuming the actor while keeping the accumulated state should do the job. Otherwise, things get more complicated. Within the lengthy process of handling exceptions thrown within the DAGScheduler (#186), we ended up taking the conservative approach of restarting the whole system instead of trying to restart the DAGScheduler actor with fixed or reconstructed state. I haven't dug into the details of this PR yet, so I can't say for certain, but there are probably lessons to be learned from that DAGScheduler epic PR.

Something else that we'll need to consider at some point if other actors start requiring supervision strategies other than the default is what the overall structure of the supervision hierarchy should be. Right now, only the DAGScheduler has another level of supervision, but perhaps Spark actors from outside the DAGScheduler should also be handled under one or more levels of common supervision.

@liyezhang556520
Copy link
Contributor Author

@JoshRosen , If we want to use the supervision mechanism. We need to add another actor level as parent of the current Master actor. I don't know if that is suitable.

@markhamstra
Copy link
Contributor

@liyezhang556520 That's been done already in the DAGScheduler. If we need another level of supervision for Master or other actors, we should consider whether these actors need separate Supervisors or whether they can be combined in the supervision hierarchy.

@liyezhang556520
Copy link
Contributor Author

@markhamstra , thanks for reminder, I'll update this PR by making a try to introduce the supervision.

@srowen
Copy link
Member

srowen commented May 18, 2015

I think this has timed out. Would you close this PR for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants