[SPARK-23020][core] Fix another race in the in-process launcher test. #20462

vanzin · 2018-01-31T22:08:48Z

First the bad news: there's an unfixable race in the launcher code.
(By unfixable I mean it would take a lot more effort than this change
to fix it.) The good news is that it should only affect super short
lived applications, such as the one run by the flaky test, so it's
possible to work around it in our test.

The fix also uncovered an issue with the recently added "closeAndWait()"
method; closing the connection would still possibly cause data loss,
so this change waits a while for the connection to finish itself, and
closes the socket if that times out. The existing connection timeout
is reused so that if desired it's possible to control how long to wait.

As part of that I also restored the old behavior that disconnect() would
force a disconnection from the child app; the "wait for data to arrive"
approach is only taken when disposing of the handle.

I tested this by inserting a bunch of sleeps in the test and the socket
handling code in the launcher library; with those I was able to reproduce
the error from the jenkins jobs. With the changes, even with all the
sleeps still in place, all tests pass.

First the bad news: there's an unfixable race in the launcher code. (By unfixable I mean it would take a lot more effort than this change to fix it.) The good news is that it should only affect super short lived applications, such as the one run by the flaky test, so it's possible to work around it in our test. The fix also uncovered an issue with the recently added "closeAndWait()" method; closing the connection would still possibly cause data loss, so this change waits a while for the connection to finish itself, and closes the socket if that times out. The existing connection timeout is reused so that if desired it's possible to control how long to wait. As part of that I also restored the old behavior that disconnect() would force a disconnection from the child app; the "wait for data to arrive" approach is only taken when disposing of the handle. I tested this by inserting a bunch of sleeps in the test and the socket handling code in the launcher library; with those I was able to reproduce the error from the jenkins jobs. With the changes, even with all the sleeps still in place, all tests pass.

vanzin · 2018-01-31T22:09:06Z

@cloud-fan @sameeragarwal hopefully the last one.

SparkQA · 2018-02-01T01:56:59Z

Test build #86893 has finished for PR 20462 at commit daa5b70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-01T02:14:57Z

launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java


  @Override
-  public synchronized void addListener(Listener l) {
+  public void addListener(Listener l) {


why remove synchronized here?

I should add this one back.

cloud-fan · 2018-02-01T02:18:08Z

launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java

      }

-      disconnect();
+      dispose();


Can we add more document to disconnect and dispose? So that people can understand the difference between them clearly and have a better understanding of changes like this.

disconnect() is actually a public method and already documented in the SparkAppHandle interface.

Sorry I'm still not able to figure out what's the difference between them after reading the doc, do you mind leave a short description here? thanks!

I updated the documentation for dispose.

cloud-fan · 2018-02-01T02:50:05Z

launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java

    }

    /**
     * Close the connection and wait for any buffered data to be processed before returning.


according to the document, shall we still call it closeAndWait?

That wouldn't be accurate anymore, because the wait happens first now. waitAndClose() is an option but also not totally accurate. Open to suggestions.

shall we update the document?

Hmm, I think I may have thought of a somewhat simple way to fix the race without needing the workaround in the test. Let me try that. If that doesn't work I'll update the javadoc.

My idea was getting too complicated for a fix to a rare race, so I'll just update the doc here and leave that race for another time.

cloud-fan · 2018-02-01T02:53:20Z

launcher/src/main/java/org/apache/spark/launcher/InProcessAppHandle.java

      }

-      disconnect();
+      dispose();


If we call disconnect here, we would close the connection, and then wait the close to finish in dispose. If we call dispose directly, we also close and wait the connection(in waitForClose). What the actual difference here?

The order in which the connection is closed. waitForClose will wait for the connection to be closed by the remote side (the finished app) before closing it itself, like disconnect does.

SparkQA · 2018-02-01T06:14:53Z

Test build #86904 has finished for PR 20462 at commit b967775.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-01T23:06:44Z

Test build #86942 has finished for PR 20462 at commit 82c276f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-02T03:43:39Z

LGTM, merging to master! Let's see how it goes. If it's good, we can backport it to 2.3, thanks!

squito · 2018-02-02T20:44:46Z

lgtm

The fix here makes sense to me, I see how it breaks the test. I'm just wondering, do we need to doc this at all for users, eg. just clearly describe it in jira? I realize most users will never hit this, as its only super short apps, but just say that its possible for very short apps, they never enter the FINISHED state but instead go to LOST, even though the app finished successfully?

vanzin · 2018-02-02T21:03:41Z

do we need to doc this at all for users

Is there a release notes / ki kinda thing for Spark releases?

We can easily put a comment in the attached bug, but not sure how visible that is.

squito · 2018-02-02T21:31:36Z

bq. Is there a release notes / ki kinda thing for Spark releases?

not that I know of -- I was just thinking of putting it in the jira, I think that is the best things users have to search. I know its not great, but its something. The current bug description doesn't hint at this at all.

vanzin · 2018-02-02T21:58:01Z

I'll just file a new bug to track a possible future fix for the race, and that can serve as documentation, I guess.

First the bad news: there's an unfixable race in the launcher code. (By unfixable I mean it would take a lot more effort than this change to fix it.) The good news is that it should only affect super short lived applications, such as the one run by the flaky test, so it's possible to work around it in our test. The fix also uncovered an issue with the recently added "closeAndWait()" method; closing the connection would still possibly cause data loss, so this change waits a while for the connection to finish itself, and closes the socket if that times out. The existing connection timeout is reused so that if desired it's possible to control how long to wait. As part of that I also restored the old behavior that disconnect() would force a disconnection from the child app; the "wait for data to arrive" approach is only taken when disposing of the handle. I tested this by inserting a bunch of sleeps in the test and the socket handling code in the launcher library; with those I was able to reproduce the error from the jenkins jobs. With the changes, even with all the sleeps still in place, all tests pass. Author: Marcelo Vanzin <[email protected]> Closes apache#20462 from vanzin/SPARK-23020. (cherry picked from commit 969eda4)

cloud-fan reviewed Feb 1, 2018

View reviewed changes

Feedback.

b967775

cloud-fan reviewed Feb 1, 2018

View reviewed changes

Reword comment.

82c276f

asfgit closed this in 969eda4 Feb 2, 2018

vanzin deleted the SPARK-23020 branch February 6, 2018 18:22

[SPARK-23020][core] Fix another race in the in-process launcher test. #20462

[SPARK-23020][core] Fix another race in the in-process launcher test. #20462

Uh oh!

Conversation

vanzin commented Jan 31, 2018

Uh oh!

vanzin commented Jan 31, 2018

Uh oh!

SparkQA commented Feb 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2018

Uh oh!

SparkQA commented Feb 1, 2018

Uh oh!

cloud-fan commented Feb 2, 2018

Uh oh!

squito commented Feb 2, 2018

Uh oh!

vanzin commented Feb 2, 2018

Uh oh!

squito commented Feb 2, 2018

Uh oh!

vanzin commented Feb 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vanzin Feb 1, 2018 •

edited

Loading