-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23020][core] Fix another race in the in-process launcher test. #20462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
First the bad news: there's an unfixable race in the launcher code. (By unfixable I mean it would take a lot more effort than this change to fix it.) The good news is that it should only affect super short lived applications, such as the one run by the flaky test, so it's possible to work around it in our test. The fix also uncovered an issue with the recently added "closeAndWait()" method; closing the connection would still possibly cause data loss, so this change waits a while for the connection to finish itself, and closes the socket if that times out. The existing connection timeout is reused so that if desired it's possible to control how long to wait. As part of that I also restored the old behavior that disconnect() would force a disconnection from the child app; the "wait for data to arrive" approach is only taken when disposing of the handle. I tested this by inserting a bunch of sleeps in the test and the socket handling code in the launcher library; with those I was able to reproduce the error from the jenkins jobs. With the changes, even with all the sleeps still in place, all tests pass.
|
@cloud-fan @sameeragarwal hopefully the last one. |
|
Test build #86893 has finished for PR 20462 at commit
|
|
|
||
| @Override | ||
| public synchronized void addListener(Listener l) { | ||
| public void addListener(Listener l) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove synchronized here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add this one back.
| } | ||
|
|
||
| disconnect(); | ||
| dispose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add more document to disconnect and dispose? So that people can understand the difference between them clearly and have a better understanding of changes like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disconnect() is actually a public method and already documented in the SparkAppHandle interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm still not able to figure out what's the difference between them after reading the doc, do you mind leave a short description here? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the documentation for dispose.
| } | ||
|
|
||
| /** | ||
| * Close the connection and wait for any buffered data to be processed before returning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the document, shall we still call it closeAndWait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wouldn't be accurate anymore, because the wait happens first now. waitAndClose() is an option but also not totally accurate. Open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we update the document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think I may have thought of a somewhat simple way to fix the race without needing the workaround in the test. Let me try that. If that doesn't work I'll update the javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was getting too complicated for a fix to a rare race, so I'll just update the doc here and leave that race for another time.
| } | ||
|
|
||
| disconnect(); | ||
| dispose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we call disconnect here, we would close the connection, and then wait the close to finish in dispose. If we call dispose directly, we also close and wait the connection(in waitForClose). What the actual difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order in which the connection is closed. waitForClose will wait for the connection to be closed by the remote side (the finished app) before closing it itself, like disconnect does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see
|
Test build #86904 has finished for PR 20462 at commit
|
|
Test build #86942 has finished for PR 20462 at commit
|
|
LGTM, merging to master! Let's see how it goes. If it's good, we can backport it to 2.3, thanks! |
|
lgtm The fix here makes sense to me, I see how it breaks the test. I'm just wondering, do we need to doc this at all for users, eg. just clearly describe it in jira? I realize most users will never hit this, as its only super short apps, but just say that its possible for very short apps, they never enter the FINISHED state but instead go to LOST, even though the app finished successfully? |
Is there a release notes / ki kinda thing for Spark releases? We can easily put a comment in the attached bug, but not sure how visible that is. |
|
bq. Is there a release notes / ki kinda thing for Spark releases? not that I know of -- I was just thinking of putting it in the jira, I think that is the best things users have to search. I know its not great, but its something. The current bug description doesn't hint at this at all. |
|
I'll just file a new bug to track a possible future fix for the race, and that can serve as documentation, I guess. |
First the bad news: there's an unfixable race in the launcher code. (By unfixable I mean it would take a lot more effort than this change to fix it.) The good news is that it should only affect super short lived applications, such as the one run by the flaky test, so it's possible to work around it in our test. The fix also uncovered an issue with the recently added "closeAndWait()" method; closing the connection would still possibly cause data loss, so this change waits a while for the connection to finish itself, and closes the socket if that times out. The existing connection timeout is reused so that if desired it's possible to control how long to wait. As part of that I also restored the old behavior that disconnect() would force a disconnection from the child app; the "wait for data to arrive" approach is only taken when disposing of the handle. I tested this by inserting a bunch of sleeps in the test and the socket handling code in the launcher library; with those I was able to reproduce the error from the jenkins jobs. With the changes, even with all the sleeps still in place, all tests pass. Author: Marcelo Vanzin <[email protected]> Closes apache#20462 from vanzin/SPARK-23020. (cherry picked from commit 969eda4)
First the bad news: there's an unfixable race in the launcher code.
(By unfixable I mean it would take a lot more effort than this change
to fix it.) The good news is that it should only affect super short
lived applications, such as the one run by the flaky test, so it's
possible to work around it in our test.
The fix also uncovered an issue with the recently added "closeAndWait()"
method; closing the connection would still possibly cause data loss,
so this change waits a while for the connection to finish itself, and
closes the socket if that times out. The existing connection timeout
is reused so that if desired it's possible to control how long to wait.
As part of that I also restored the old behavior that disconnect() would
force a disconnection from the child app; the "wait for data to arrive"
approach is only taken when disposing of the handle.
I tested this by inserting a bunch of sleeps in the test and the socket
handling code in the launcher library; with those I was able to reproduce
the error from the jenkins jobs. With the changes, even with all the
sleeps still in place, all tests pass.