[SPARK-11131] [core] Fix race in worker registration protocol. #9138

vanzin · 2015-10-15T18:35:19Z

Because the registration RPC was not really an RPC, but a bunch of
disconnected messages, it was possible for other messages to be
sent before the reply to the registration arrived, and that would
confuse the Worker. Especially in local-cluster mode, the worker was
succeptible to receiving an executor request before it received a
message from the master saying registration succeeded.

On top of the above, the change also fixes a ClassCastException when
the registration fails, which also affects the executor registration
protocol. Because the ask is issued with a specific return type,
if the error message (of a different type) was returned instead, the
code would just die with an exception. This is fixed by having a common
base trait for these reply messages.

Because the registration RPC was not really an RPC, but a bunch of disconnected messages, it was possible for other messages to be sent before the reply to the registration arrived, and that would confuse the Worker. Especially in local-cluster mode, the worker was succeptible to receiving an executor request before it received a message from the master saying registration succeeded. On top of the above, the change also fixes a ClassCastException when the registration fails, which also affects the executor registration protocol. Because the `ask` was issued with a specific return type, if the error message (of a different type) was returned instead, the code will just die with an exception. This is fixed by having a common base trait for these reply messages.

SparkQA · 2015-10-15T20:44:01Z

Test build #43800 has finished for PR 9138 at commit cb3f972.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RegisterWorkerFailed(message: String) extends DeployMessage with RegisterWorkerResponse
- case class RegisteredExecutor() extends CoarseGrainedClusterMessage
- case class HyperLogLogPlusPlus(

vanzin · 2015-10-15T20:45:42Z

pyspark failure, sigh. Is there a bug tracking these flaky tests? retest this please

SparkQA · 2015-10-15T23:07:52Z

Test build #43808 has finished for PR 9138 at commit cb3f972.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RegisterWorkerFailed(message: String) extends DeployMessage with RegisterWorkerResponse
- case class RegisteredExecutor() extends CoarseGrainedClusterMessage

vanzin · 2015-10-16T17:51:00Z

/cc @andrewor14 @zsxwing I think you're the people most familiar with this code.

andrewor14 · 2015-10-17T00:31:40Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

this is kind of scary. We should always send a response otherwise we might get random future timeout exceptions

Perhaps, but not really related to the problem at hand.

what I'm saying is in general in receiveAndReply you should always reply because that's what the caller expects. Otherwise it's really confusing when we get future timeouts cause they're hard to debug.

I agree with you, I'm just saying that's a bug in itself, and I'd rather not make that as part of this, because that might affect other parts of the code where the workers retry connections to different masters.

wait, actually that's a problem. Right now we have a send and an ask. The send actually won't be received by anyone because we only handle this in receiveAndReply

I might have missed a send. But the lack of a reply here is not necessarily a bug, and is definitely not related to this one.

Actually, nevermind. I see what you mean. I'm introducing the need to reply by switching to ask. Let me take a look...

andrewor14 · 2015-10-17T01:06:03Z

@vanzin just so I understand, it's these two lines that are racing, right?

spark/core/src/main/scala/org/apache/spark/deploy/master/Master.scala

Line 249 in 8ac71d6

workerRef.send(RegisteredWorker(self, masterWebUiUrl))

spark/core/src/main/scala/org/apache/spark/deploy/master/Master.scala

Line 703 in 8ac71d6

worker.endpoint.send(LaunchExecutor(masterUrl,

vanzin · 2015-10-17T01:38:41Z

@andrewor14 correct, that's what causes the race.

SparkQA · 2015-10-17T05:40:54Z

Test build #43876 has finished for PR 9138 at commit ff2d3e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RegisterWorkerFailed(message: String) extends DeployMessage with RegisterWorkerResponse
- case class RegisteredExecutor() extends CoarseGrainedClusterMessage

Also go back to using case objects since they seem to work; I'll make the change to case classes where needed in the corresponding PR (not yet sent).

SparkQA · 2015-10-17T19:57:00Z

Test build #43884 has finished for PR 9138 at commit 2d35024.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RegisterWorkerFailed(message: String) extends DeployMessage with RegisterWorkerResponse

andrewor14 · 2015-10-19T23:14:39Z

LGTM I'm merging this into master thanks @vanzin

zsxwing · 2015-12-10T19:06:32Z

@vanzin just found an issue about this change. Now if the master receives RegisterWorker, it won't use the workerRef to send the reply. So there is no connection from Master to the server in Worker. If the Worker is killed now, Master only observes some client is lost, but the address is just a client address in Worker and won't match the Worker address. So Master cannot remove this dead Worker at once. However, this Worker will be removed in 60 seconds because of no heartbeat.

See the log here:
https://www.mail-archive.com/[email protected]/msg12332.html

andrewor14 · 2015-12-11T02:35:26Z

To echo @vanzin on SPARK-12267, the cause of SPARK-12267 is not this PR but #9210.

andrewor14 reviewed Oct 17, 2015
View reviewed changes

Fix other send call site.

ff2d3e2

Explicitly reply even when master is in standby.

2d35024

Also go back to using case objects since they seem to work; I'll make the change to case classes where needed in the corresponding PR (not yet sent).

asfgit closed this in 7ab0ce6 Oct 19, 2015

vanzin deleted the SPARK-11131 branch October 24, 2015 21:40

[SPARK-11131] [core] Fix race in worker registration protocol. #9138

[SPARK-11131] [core] Fix race in worker registration protocol. #9138

Uh oh!

Conversation

vanzin commented Oct 15, 2015

Uh oh!

SparkQA commented Oct 15, 2015

Uh oh!

vanzin commented Oct 15, 2015

Uh oh!

SparkQA commented Oct 15, 2015

Uh oh!

vanzin commented Oct 16, 2015

Uh oh!

andrewor14 Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Oct 17, 2015

Uh oh!

vanzin commented Oct 17, 2015

Uh oh!

SparkQA commented Oct 17, 2015

Uh oh!

SparkQA commented Oct 17, 2015

Uh oh!

andrewor14 commented Oct 19, 2015

Uh oh!

zsxwing commented Dec 10, 2015

Uh oh!

andrewor14 commented Dec 11, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants