Skip to content

Conversation

@nivox
Copy link

@nivox nivox commented Sep 13, 2017

What changes were proposed in this pull request?

This patch changes the order in which acceptConnections starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it.

How was this patch tested?

Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue.

@vanzin
Copy link
Contributor

vanzin commented Sep 13, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81741 has finished for PR 19217 at commit 72af1aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 24, 2017

Looks reasonable to me. @vanzin was this OK with you?

Copy link
Contributor

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started @nivox ! I'm seeing it on a cluster too so am interested in getting this merged in to Apache.

timeout.run();
}
synchronized (clients) {
clients.add(clientConnection);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are now adding to clients before starting the clientThread instead of after. What's the expected ordering for these two operations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the connection to clients before starting the thread should be safer since otherwise the thread could have a chance of running and terminating before the connection is added to clients. This would cause the addition of an already terminated connection to the clients list which nobody would ever cleanup (i.e. memory leak).

This situation is really unlikely but still a possibility.

Changing the order of operation shouldn't affect any logic since the clients list is only used for cleanup in the close method.

};
ServerConnection clientConnection = new ServerConnection(client, timeout);
Thread clientThread = factory.newThread(clientConnection);
synchronized (timeout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're no longer synchronizing on the timeout here, but I didn't see anywhere else synchronizing on it either (including ServerConnection). Given it doesn't escape this method, I'm not sure how multiple threads could ever access timeout at once, so it makes sense to remove this synchronization

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was exactly my reasoning. I couldn't find any reason why the synchronisation was needed. The removal was only a means to make the code simpler removing the cognitive cost suggested by the presence of a synchronise keyword. The actual fix doesn't depend on it.

// 0 is used for testing to avoid issues with clock resolution / thread scheduling,
// and force an immediate timeout.
if (timeoutMs > 0) {
timeoutTimer.schedule(timeout, timeoutMs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on not calling getConnectionTimeout() multiple times

@vanzin
Copy link
Contributor

vanzin commented Oct 24, 2017

The code LGTM, but the PR title should describe the fix, not the bug.

@ash211
Copy link
Contributor

ash211 commented Oct 24, 2017

How about [SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections ?

@vanzin
Copy link
Contributor

vanzin commented Oct 24, 2017

That sounds better.

@ash211
Copy link
Contributor

ash211 commented Oct 24, 2017

@nivox can you please update the PR title when you get the chance?

@nivox nivox changed the title [SPARK-21991][LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine has very high load [SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections Oct 25, 2017
@nivox
Copy link
Author

nivox commented Oct 25, 2017

@vanzin @ash211 I just modified the title of the PR as per your suggestion

@vanzin
Copy link
Contributor

vanzin commented Oct 25, 2017

Merging to master / 2.2 / 2.1.

asfgit pushed a commit that referenced this pull request Oct 25, 2017
…nnections

## What changes were proposed in this pull request?
This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it.

## How was this patch tested?
Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue.

Author: Andrea zito <[email protected]>

Closes #19217 from nivox/SPARK-21991.

(cherry picked from commit 6ea8a56)
Signed-off-by: Marcelo Vanzin <[email protected]>
asfgit pushed a commit that referenced this pull request Oct 25, 2017
…nnections

## What changes were proposed in this pull request?
This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it.

## How was this patch tested?
Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue.

Author: Andrea zito <[email protected]>

Closes #19217 from nivox/SPARK-21991.

(cherry picked from commit 6ea8a56)
Signed-off-by: Marcelo Vanzin <[email protected]>
@asfgit asfgit closed this in 6ea8a56 Oct 25, 2017
asfgit pushed a commit that referenced this pull request Oct 25, 2017
…nnections

## What changes were proposed in this pull request?
This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it.

## How was this patch tested?
Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue.

Author: Andrea zito <[email protected]>

Closes #19217 from nivox/SPARK-21991.

(cherry picked from commit 6ea8a56)
@vanzin
Copy link
Contributor

vanzin commented Oct 25, 2017

Merged to 2.0 also.

I have a feeling that maven builds might fail because I noticed some trailing whitespace when looking at the raw patch...

@ash211
Copy link
Contributor

ash211 commented Oct 25, 2017

========================================================================
Running Java style checks
========================================================================
Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[238] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[247] (regexp) RegexpSingleline: No trailing whitespace allowed.
[error] running /home/ubuntu/spark/dev/lint-java ; received return code 1

will send a followup PR

@ash211
Copy link
Contributor

ash211 commented Oct 25, 2017

#19574

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…nnections

## What changes were proposed in this pull request?
This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it.

## How was this patch tested?
Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue.

Author: Andrea zito <[email protected]>

Closes apache#19217 from nivox/SPARK-21991.

(cherry picked from commit 6ea8a56)
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants