Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Jan 31, 2018

First the bad news: there's an unfixable race in the launcher code.
(By unfixable I mean it would take a lot more effort than this change
to fix it.) The good news is that it should only affect super short
lived applications, such as the one run by the flaky test, so it's
possible to work around it in our test.

The fix also uncovered an issue with the recently added "closeAndWait()"
method; closing the connection would still possibly cause data loss,
so this change waits a while for the connection to finish itself, and
closes the socket if that times out. The existing connection timeout
is reused so that if desired it's possible to control how long to wait.

As part of that I also restored the old behavior that disconnect() would
force a disconnection from the child app; the "wait for data to arrive"
approach is only taken when disposing of the handle.

I tested this by inserting a bunch of sleeps in the test and the socket
handling code in the launcher library; with those I was able to reproduce
the error from the jenkins jobs. With the changes, even with all the
sleeps still in place, all tests pass.

First the bad news: there's an unfixable race in the launcher code.
(By unfixable I mean it would take a lot more effort than this change
to fix it.) The good news is that it should only affect super short
lived applications, such as the one run by the flaky test, so it's
possible to work around it in our test.

The fix also uncovered an issue with the recently added "closeAndWait()"
method; closing the connection would still possibly cause data loss,
so this change waits a while for the connection to finish itself, and
closes the socket if that times out. The existing connection timeout
is reused so that if desired it's possible to control how long to wait.

As part of that I also restored the old behavior that disconnect() would
force a disconnection from the child app; the "wait for data to arrive"
approach is only taken when disposing of the handle.

I tested this by inserting a bunch of sleeps in the test and the socket
handling code in the launcher library; with those I was able to reproduce
the error from the jenkins jobs. With the changes, even with all the
sleeps still in place, all tests pass.
@vanzin
Copy link
Contributor Author

vanzin commented Jan 31, 2018

@cloud-fan @sameeragarwal hopefully the last one.

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86893 has finished for PR 20462 at commit daa5b70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


@Override
public synchronized void addListener(Listener l) {
public void addListener(Listener l) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove synchronized here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add this one back.

}

disconnect();
dispose();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more document to disconnect and dispose? So that people can understand the difference between them clearly and have a better understanding of changes like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disconnect() is actually a public method and already documented in the SparkAppHandle interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm still not able to figure out what's the difference between them after reading the doc, do you mind leave a short description here? thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the documentation for dispose.

}

/**
* Close the connection and wait for any buffered data to be processed before returning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the document, shall we still call it closeAndWait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't be accurate anymore, because the wait happens first now. waitAndClose() is an option but also not totally accurate. Open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we update the document?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think I may have thought of a somewhat simple way to fix the race without needing the workaround in the test. Let me try that. If that doesn't work I'll update the javadoc.

Copy link
Contributor Author

@vanzin vanzin Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was getting too complicated for a fix to a rare race, so I'll just update the doc here and leave that race for another time.

}

disconnect();
dispose();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we call disconnect here, we would close the connection, and then wait the close to finish in dispose. If we call dispose directly, we also close and wait the connection(in waitForClose). What the actual difference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order in which the connection is closed. waitForClose will wait for the connection to be closed by the remote side (the finished app) before closing it itself, like disconnect does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86904 has finished for PR 20462 at commit b967775.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86942 has finished for PR 20462 at commit 82c276f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master! Let's see how it goes. If it's good, we can backport it to 2.3, thanks!

@asfgit asfgit closed this in 969eda4 Feb 2, 2018
@squito
Copy link
Contributor

squito commented Feb 2, 2018

lgtm

The fix here makes sense to me, I see how it breaks the test. I'm just wondering, do we need to doc this at all for users, eg. just clearly describe it in jira? I realize most users will never hit this, as its only super short apps, but just say that its possible for very short apps, they never enter the FINISHED state but instead go to LOST, even though the app finished successfully?

@vanzin
Copy link
Contributor Author

vanzin commented Feb 2, 2018

do we need to doc this at all for users

Is there a release notes / ki kinda thing for Spark releases?

We can easily put a comment in the attached bug, but not sure how visible that is.

@squito
Copy link
Contributor

squito commented Feb 2, 2018

bq. Is there a release notes / ki kinda thing for Spark releases?

not that I know of -- I was just thinking of putting it in the jira, I think that is the best things users have to search. I know its not great, but its something. The current bug description doesn't hint at this at all.

@vanzin
Copy link
Contributor Author

vanzin commented Feb 2, 2018

I'll just file a new bug to track a possible future fix for the race, and that can serve as documentation, I guess.

@vanzin vanzin deleted the SPARK-23020 branch February 6, 2018 18:22
vanzin pushed a commit to vanzin/spark that referenced this pull request Mar 6, 2018
First the bad news: there's an unfixable race in the launcher code.
(By unfixable I mean it would take a lot more effort than this change
to fix it.) The good news is that it should only affect super short
lived applications, such as the one run by the flaky test, so it's
possible to work around it in our test.

The fix also uncovered an issue with the recently added "closeAndWait()"
method; closing the connection would still possibly cause data loss,
so this change waits a while for the connection to finish itself, and
closes the socket if that times out. The existing connection timeout
is reused so that if desired it's possible to control how long to wait.

As part of that I also restored the old behavior that disconnect() would
force a disconnection from the child app; the "wait for data to arrive"
approach is only taken when disposing of the handle.

I tested this by inserting a bunch of sleeps in the test and the socket
handling code in the launcher library; with those I was able to reproduce
the error from the jenkins jobs. With the changes, even with all the
sleeps still in place, all tests pass.

Author: Marcelo Vanzin <[email protected]>

Closes apache#20462 from vanzin/SPARK-23020.

(cherry picked from commit 969eda4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants