[SPARK-6132] ContextCleaner race condition across SparkContexts #4869

andrewor14 · 2015-03-03T06:47:36Z

The problem is that ContextCleaner may clean variables that belong to a different SparkContext. This can happen if the SparkContext to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through SparkEnv.get.blockManager, which could be one that belongs to a different SparkContext.

@JoshRosen and I suspect that this is the cause of many flaky tests, most notably the JavaAPISuite. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce).

SparkQA · 2015-03-03T06:52:32Z

Test build #28224 has started for PR 4869 at commit 29168c0.

This patch merges cleanly.

SparkQA · 2015-03-03T08:06:54Z

Test build #28224 has finished for PR 4869 at commit 29168c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-03T08:06:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28224/
Test PASSed.

andrewor14 · 2015-03-03T09:00:59Z

I just confirmed locally that this fix is effective. I ran the JavaAPISuite 100 times back-to-back on my local machine. Before this patch the fail count was 5 runs and the MTBF was 17 runs. All failed runs involved the Failed to get broadcast_0_piece0 of broadcast_0 exception, which was the first broadcast block fetched in each test. After this patch we no longer observed any such exception and the fail count was 0.

JoshRosen · 2015-03-03T18:36:22Z

In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through SparkEnv.get.blockManager, which could be one that belongs to a different SparkContext.

In the old code, the race could happen even if we weren't in the middle of a cleanup task when SparkContext was stopped; there's about a 100 millisecond window where this race can occur. One potential race looks something like this:

The original SparkContext's ContextCleaner thread is blocked in a referenceQueue.remove() call. This is called with a 100ms timeout, hence the 100ms window for a race.
SparkContext.stop() is called on the original context
A new SparkContext is created
A job with begins running with the new SparkContext and creates new broadcast variables. These broadcast variables' ids can overlap with the ones created by the old context, since broadcast ids are only unique within a SparkContext and not globally-unique or unique within a JVM.
The old ContextCleaner finally unblocks from the referenceQueue.remove call. Because the old SparkContext was destroyed, the RDDs and broadcasts that it created may have become garbage-collected, which means that this referenceQueue.remove call might actually return an old broadcast variable cleanup task.
This cleanup task runs in doCleanupBroadcast(), which calls methods on the original SparkContext's components, namely broadcastManager.unbroadcast(broadcastId, true, blocking).
Through a chain of calls, this ends up calling the static TorrentBroadcast.unpersist() method, which calls SparkEnv.get.blockManager.master.removeBroadcast, causing it to remove the broadcast's blocks from the new SparkContext. Recall that SparkEnv is (effectively) global.

This was a really subtle race condition.

JoshRosen · 2015-03-03T18:37:45Z

@andrewor14 Do you think that there's any risk of a cleanup task hanging indefinitely and thus preventing the SparkContext from being stopped? That's the only problem that I could anticipate here.

Overall, this fix looks good to me. Thanks for tracking down this race!

andrewor14 · 2015-03-03T20:56:10Z

@JoshRosen and I discussed more about this offline. sc.stop() actually will not hang indefinitely because we use the Akka timeout when awaiting the BlockManager* messages. Semantically it is also more correct to make sure the context cleaner thread does not outlive the SparkContext that created it.

We considered a few alternatives:

(1) Use a global broadcast ID counter per JVM instead of per application. This ensures no conflict in broadcast ID spaces across applications. This doesn't actually fix the root cause of the problem, which is that the context cleaner thread is still leaked across applications. This also changes the semantics of the broadcast ID, and this may not be a safe change to back port to older branches.

(2) Refactor the broadcast factories to take in a SparkEnv instead of getting it from SparkEnv.get. This requires breaking a developer API, which precludes us from back porting any potential fix to older branches.

(3) Introduce some identifier in SparkEnv, and clean only if the env is the one that we expect. This requires passing the identifier everywhere in the code and seems like an invasive change. As with (1), it doesn't fix the root cause.

In summary, we will merge this current patch as is since it fixes flaky test suites that have been failing throughout the project and slowing development.

JoshRosen · 2015-03-03T21:12:01Z

There aren't great alternatives here because the root problem is that we have a bunch of global shared state, so it's kind of hard to avoid synchronization here without doing a huge refactoring. Therefore, this looks good to me.

I think a short hang during SparkContext.stop() is pretty unlikely to happen in practice; if it does turn out to be a problem during testing, then we can revisit this and try to consider more involved approaches to safely interrupting active cleanup tasks.

andrewor14 · 2015-03-03T21:43:49Z

Alright, I'm going to merge this into master since tests are still failing non-deterministically. I will back port it to 1.3 later after the release. I will also back port this to older branches eventually, but I'd like to see how it behaves in master for a little while first.

The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`. JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce). Author: Andrew Or <[email protected]> Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits: 29168c0 [Andrew Or] Synchronize ContextCleaner stop

andrewor14 · 2015-03-26T01:44:18Z

Just to give @JoshRosen and myself a pat on our own backs, we haven't seen a single failure of JavaAPISuite in the past week, whereas before it used to be one of the most common flaky tests. It's most likely that this patch did in fact fix the underlying issue!

srowen · 2015-03-26T12:47:17Z

Build and test fixes warm my heart. Excellent!

Synchronize ContextCleaner stop

29168c0

andrewor14 changed the title ~~[SPARK-6132] ContextCleaner may live across SparkContexts~~ [SPARK-6132] ContextCleaner race condition across SparkContexts Mar 3, 2015

asfgit closed this in fe63e82 Mar 3, 2015

andrewor14 deleted the cleaner-masquerade branch March 3, 2015 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-6132] ContextCleaner race condition across SparkContexts #4869

[SPARK-6132] ContextCleaner race condition across SparkContexts #4869

Uh oh!

andrewor14 commented Mar 3, 2015

Uh oh!

SparkQA commented Mar 3, 2015

Uh oh!

SparkQA commented Mar 3, 2015

Uh oh!

AmplabJenkins commented Mar 3, 2015

Uh oh!

andrewor14 commented Mar 3, 2015

Uh oh!

JoshRosen commented Mar 3, 2015

Uh oh!

JoshRosen commented Mar 3, 2015

Uh oh!

andrewor14 commented Mar 3, 2015

Uh oh!

JoshRosen commented Mar 3, 2015

Uh oh!

andrewor14 commented Mar 3, 2015

Uh oh!

andrewor14 commented Mar 26, 2015

Uh oh!

srowen commented Mar 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-6132] ContextCleaner race condition across SparkContexts #4869

[SPARK-6132] ContextCleaner race condition across SparkContexts #4869

Uh oh!

Conversation

andrewor14 commented Mar 3, 2015

Uh oh!

SparkQA commented Mar 3, 2015

Uh oh!

SparkQA commented Mar 3, 2015

Uh oh!

AmplabJenkins commented Mar 3, 2015

Uh oh!

andrewor14 commented Mar 3, 2015

Uh oh!

JoshRosen commented Mar 3, 2015

Uh oh!

JoshRosen commented Mar 3, 2015

Uh oh!

andrewor14 commented Mar 3, 2015

Uh oh!

JoshRosen commented Mar 3, 2015

Uh oh!

andrewor14 commented Mar 3, 2015

Uh oh!

andrewor14 commented Mar 26, 2015

Uh oh!

srowen commented Mar 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants