-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11131] [core] Fix race in worker registration protocol. #9138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Because the registration RPC was not really an RPC, but a bunch of disconnected messages, it was possible for other messages to be sent before the reply to the registration arrived, and that would confuse the Worker. Especially in local-cluster mode, the worker was succeptible to receiving an executor request before it received a message from the master saying registration succeeded. On top of the above, the change also fixes a ClassCastException when the registration fails, which also affects the executor registration protocol. Because the `ask` was issued with a specific return type, if the error message (of a different type) was returned instead, the code will just die with an exception. This is fixed by having a common base trait for these reply messages.
|
Test build #43800 has finished for PR 9138 at commit
|
|
pyspark failure, sigh. Is there a bug tracking these flaky tests? retest this please |
|
Test build #43808 has finished for PR 9138 at commit
|
|
/cc @andrewor14 @zsxwing I think you're the people most familiar with this code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kind of scary. We should always send a response otherwise we might get random future timeout exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but not really related to the problem at hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I'm saying is in general in receiveAndReply you should always reply because that's what the caller expects. Otherwise it's really confusing when we get future timeouts cause they're hard to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, I'm just saying that's a bug in itself, and I'd rather not make that as part of this, because that might affect other parts of the code where the workers retry connections to different masters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, actually that's a problem. Right now we have a send and an ask. The send actually won't be received by anyone because we only handle this in receiveAndReply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed a send. But the lack of a reply here is not necessarily a bug, and is definitely not related to this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, nevermind. I see what you mean. I'm introducing the need to reply by switching to ask. Let me take a look...
|
@vanzin just so I understand, it's these two lines that are racing, right?
|
|
@andrewor14 correct, that's what causes the race. |
|
Test build #43876 has finished for PR 9138 at commit
|
Also go back to using case objects since they seem to work; I'll make the change to case classes where needed in the corresponding PR (not yet sent).
|
Test build #43884 has finished for PR 9138 at commit
|
|
LGTM I'm merging this into master thanks @vanzin |
|
@vanzin just found an issue about this change. Now if the master receives See the log here: |
|
To echo @vanzin on SPARK-12267, the cause of SPARK-12267 is not this PR but #9210. |
Because the registration RPC was not really an RPC, but a bunch of
disconnected messages, it was possible for other messages to be
sent before the reply to the registration arrived, and that would
confuse the Worker. Especially in local-cluster mode, the worker was
succeptible to receiving an executor request before it received a
message from the master saying registration succeeded.
On top of the above, the change also fixes a ClassCastException when
the registration fails, which also affects the executor registration
protocol. Because the
askis issued with a specific return type,if the error message (of a different type) was returned instead, the
code would just die with an exception. This is fixed by having a common
base trait for these reply messages.