Skip to content

Conversation

@andrewor14
Copy link
Contributor

The problem is that ContextCleaner may clean variables that belong to a different SparkContext. This can happen if the SparkContext to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through SparkEnv.get.blockManager, which could be one that belongs to a different SparkContext.

@JoshRosen and I suspect that this is the cause of many flaky tests, most notably the JavaAPISuite. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce).

@SparkQA
Copy link

SparkQA commented Mar 3, 2015

Test build #28224 has started for PR 4869 at commit 29168c0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 3, 2015

Test build #28224 has finished for PR 4869 at commit 29168c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28224/
Test PASSed.

@andrewor14 andrewor14 changed the title [SPARK-6132] ContextCleaner may live across SparkContexts [SPARK-6132] ContextCleaner race condition across SparkContexts Mar 3, 2015
@andrewor14
Copy link
Contributor Author

I just confirmed locally that this fix is effective. I ran the JavaAPISuite 100 times back-to-back on my local machine. Before this patch the fail count was 5 runs and the MTBF was 17 runs. All failed runs involved the Failed to get broadcast_0_piece0 of broadcast_0 exception, which was the first broadcast block fetched in each test. After this patch we no longer observed any such exception and the fail count was 0.

@JoshRosen
Copy link
Contributor

In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through SparkEnv.get.blockManager, which could be one that belongs to a different SparkContext.

In the old code, the race could happen even if we weren't in the middle of a cleanup task when SparkContext was stopped; there's about a 100 millisecond window where this race can occur. One potential race looks something like this:

  • The original SparkContext's ContextCleaner thread is blocked in a referenceQueue.remove() call. This is called with a 100ms timeout, hence the 100ms window for a race.
  • SparkContext.stop() is called on the original context
  • A new SparkContext is created
  • A job with begins running with the new SparkContext and creates new broadcast variables. These broadcast variables' ids can overlap with the ones created by the old context, since broadcast ids are only unique within a SparkContext and not globally-unique or unique within a JVM.
  • The old ContextCleaner finally unblocks from the referenceQueue.remove call. Because the old SparkContext was destroyed, the RDDs and broadcasts that it created may have become garbage-collected, which means that this referenceQueue.remove call might actually return an old broadcast variable cleanup task.
  • This cleanup task runs in doCleanupBroadcast(), which calls methods on the original SparkContext's components, namely broadcastManager.unbroadcast(broadcastId, true, blocking).
  • Through a chain of calls, this ends up calling the static TorrentBroadcast.unpersist() method, which calls SparkEnv.get.blockManager.master.removeBroadcast, causing it to remove the broadcast's blocks from the new SparkContext. Recall that SparkEnv is (effectively) global.

This was a really subtle race condition.

@JoshRosen
Copy link
Contributor

@andrewor14 Do you think that there's any risk of a cleanup task hanging indefinitely and thus preventing the SparkContext from being stopped? That's the only problem that I could anticipate here.

Overall, this fix looks good to me. Thanks for tracking down this race!

@andrewor14
Copy link
Contributor Author

@JoshRosen and I discussed more about this offline. sc.stop() actually will not hang indefinitely because we use the Akka timeout when awaiting the BlockManager* messages. Semantically it is also more correct to make sure the context cleaner thread does not outlive the SparkContext that created it.

We considered a few alternatives:

(1) Use a global broadcast ID counter per JVM instead of per application. This ensures no conflict in broadcast ID spaces across applications. This doesn't actually fix the root cause of the problem, which is that the context cleaner thread is still leaked across applications. This also changes the semantics of the broadcast ID, and this may not be a safe change to back port to older branches.

(2) Refactor the broadcast factories to take in a SparkEnv instead of getting it from SparkEnv.get. This requires breaking a developer API, which precludes us from back porting any potential fix to older branches.

(3) Introduce some identifier in SparkEnv, and clean only if the env is the one that we expect. This requires passing the identifier everywhere in the code and seems like an invasive change. As with (1), it doesn't fix the root cause.

In summary, we will merge this current patch as is since it fixes flaky test suites that have been failing throughout the project and slowing development.

@JoshRosen
Copy link
Contributor

There aren't great alternatives here because the root problem is that we have a bunch of global shared state, so it's kind of hard to avoid synchronization here without doing a huge refactoring. Therefore, this looks good to me.

I think a short hang during SparkContext.stop() is pretty unlikely to happen in practice; if it does turn out to be a problem during testing, then we can revisit this and try to consider more involved approaches to safely interrupting active cleanup tasks.

@andrewor14
Copy link
Contributor Author

Alright, I'm going to merge this into master since tests are still failing non-deterministically. I will back port it to 1.3 later after the release. I will also back port this to older branches eventually, but I'd like to see how it behaves in master for a little while first.

@asfgit asfgit closed this in fe63e82 Mar 3, 2015
@andrewor14 andrewor14 deleted the cleaner-masquerade branch March 3, 2015 21:45
asfgit pushed a commit that referenced this pull request Mar 13, 2015
The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`.

JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce).

Author: Andrew Or <[email protected]>

Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits:

29168c0 [Andrew Or] Synchronize ContextCleaner stop
asfgit pushed a commit that referenced this pull request Mar 22, 2015
The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`.

JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce).

Author: Andrew Or <[email protected]>

Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits:

29168c0 [Andrew Or] Synchronize ContextCleaner stop
asfgit pushed a commit that referenced this pull request Mar 22, 2015
The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`.

JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce).

Author: Andrew Or <[email protected]>

Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits:

29168c0 [Andrew Or] Synchronize ContextCleaner stop
@andrewor14
Copy link
Contributor Author

Just to give @JoshRosen and myself a pat on our own backs, we haven't seen a single failure of JavaAPISuite in the past week, whereas before it used to be one of the most common flaky tests. It's most likely that this patch did in fact fix the underlying issue!

@srowen
Copy link
Member

srowen commented Mar 26, 2015

Build and test fixes warm my heart. Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants