-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections #19217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #81741 has finished for PR 19217 at commit
|
|
Looks reasonable to me. @vanzin was this OK with you? |
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this started @nivox ! I'm seeing it on a cluster too so am interested in getting this merged in to Apache.
| timeout.run(); | ||
| } | ||
| synchronized (clients) { | ||
| clients.add(clientConnection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are now adding to clients before starting the clientThread instead of after. What's the expected ordering for these two operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the connection to clients before starting the thread should be safer since otherwise the thread could have a chance of running and terminating before the connection is added to clients. This would cause the addition of an already terminated connection to the clients list which nobody would ever cleanup (i.e. memory leak).
This situation is really unlikely but still a possibility.
Changing the order of operation shouldn't affect any logic since the clients list is only used for cleanup in the close method.
| }; | ||
| ServerConnection clientConnection = new ServerConnection(client, timeout); | ||
| Thread clientThread = factory.newThread(clientConnection); | ||
| synchronized (timeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're no longer synchronizing on the timeout here, but I didn't see anywhere else synchronizing on it either (including ServerConnection). Given it doesn't escape this method, I'm not sure how multiple threads could ever access timeout at once, so it makes sense to remove this synchronization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was exactly my reasoning. I couldn't find any reason why the synchronisation was needed. The removal was only a means to make the code simpler removing the cognitive cost suggested by the presence of a synchronise keyword. The actual fix doesn't depend on it.
| // 0 is used for testing to avoid issues with clock resolution / thread scheduling, | ||
| // and force an immediate timeout. | ||
| if (timeoutMs > 0) { | ||
| timeoutTimer.schedule(timeout, timeoutMs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on not calling getConnectionTimeout() multiple times
|
The code LGTM, but the PR title should describe the fix, not the bug. |
|
How about |
|
That sounds better. |
|
@nivox can you please update the PR title when you get the chance? |
|
Merging to master / 2.2 / 2.1. |
…nnections ## What changes were proposed in this pull request? This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it. ## How was this patch tested? Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue. Author: Andrea zito <[email protected]> Closes #19217 from nivox/SPARK-21991. (cherry picked from commit 6ea8a56) Signed-off-by: Marcelo Vanzin <[email protected]>
…nnections ## What changes were proposed in this pull request? This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it. ## How was this patch tested? Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue. Author: Andrea zito <[email protected]> Closes #19217 from nivox/SPARK-21991. (cherry picked from commit 6ea8a56) Signed-off-by: Marcelo Vanzin <[email protected]>
…nnections ## What changes were proposed in this pull request? This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it. ## How was this patch tested? Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue. Author: Andrea zito <[email protected]> Closes #19217 from nivox/SPARK-21991. (cherry picked from commit 6ea8a56)
|
Merged to 2.0 also. I have a feeling that maven builds might fail because I noticed some trailing whitespace when looking at the raw patch... |
will send a followup PR |
…nnections ## What changes were proposed in this pull request? This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it. ## How was this patch tested? Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue. Author: Andrea zito <[email protected]> Closes apache#19217 from nivox/SPARK-21991. (cherry picked from commit 6ea8a56) Signed-off-by: Marcelo Vanzin <[email protected]>
What changes were proposed in this pull request?
This patch changes the order in which acceptConnections starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it.
How was this patch tested?
Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue.