-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-6132] ContextCleaner race condition across SparkContexts #4869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #28224 has started for PR 4869 at commit
|
|
Test build #28224 has finished for PR 4869 at commit
|
|
Test PASSed. |
|
I just confirmed locally that this fix is effective. I ran the |
In the old code, the race could happen even if we weren't in the middle of a cleanup task when SparkContext was stopped; there's about a 100 millisecond window where this race can occur. One potential race looks something like this:
This was a really subtle race condition. |
|
@andrewor14 Do you think that there's any risk of a cleanup task hanging indefinitely and thus preventing the SparkContext from being stopped? That's the only problem that I could anticipate here. Overall, this fix looks good to me. Thanks for tracking down this race! |
|
@JoshRosen and I discussed more about this offline. We considered a few alternatives: (1) Use a global broadcast ID counter per JVM instead of per application. This ensures no conflict in broadcast ID spaces across applications. This doesn't actually fix the root cause of the problem, which is that the context cleaner thread is still leaked across applications. This also changes the semantics of the broadcast ID, and this may not be a safe change to back port to older branches. (2) Refactor the broadcast factories to take in a (3) Introduce some identifier in In summary, we will merge this current patch as is since it fixes flaky test suites that have been failing throughout the project and slowing development. |
|
There aren't great alternatives here because the root problem is that we have a bunch of global shared state, so it's kind of hard to avoid synchronization here without doing a huge refactoring. Therefore, this looks good to me. I think a short hang during |
|
Alright, I'm going to merge this into master since tests are still failing non-deterministically. I will back port it to 1.3 later after the release. I will also back port this to older branches eventually, but I'd like to see how it behaves in master for a little while first. |
The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`. JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce). Author: Andrew Or <[email protected]> Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits: 29168c0 [Andrew Or] Synchronize ContextCleaner stop
The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`. JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce). Author: Andrew Or <[email protected]> Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits: 29168c0 [Andrew Or] Synchronize ContextCleaner stop
The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`. JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce). Author: Andrew Or <[email protected]> Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits: 29168c0 [Andrew Or] Synchronize ContextCleaner stop
|
Just to give @JoshRosen and myself a pat on our own backs, we haven't seen a single failure of |
|
Build and test fixes warm my heart. Excellent! |
The problem is that
ContextCleanermay clean variables that belong to a differentSparkContext. This can happen if theSparkContextto which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so throughSparkEnv.get.blockManager, which could be one that belongs to a differentSparkContext.@JoshRosen and I suspect that this is the cause of many flaky tests, most notably the
JavaAPISuite. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce).