Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This patch addresses a race condition in DAGScheduler by properly synchronizing accesses to its cacheLocs map.

This map is accessed by the getCacheLocs and clearCacheLocs() methods, which can be called by separate threads, since DAGScheduler's getPreferredLocs() method is called by SparkContext and indirectly calls getCacheLocs(). If this map is cleared by the DAGScheduler event processing thread while a user thread is submitting a job and computing preferred locations, then this can cause the user thread to throw "NoSuchElementException: key not found" errors.

Most accesses to DAGScheduler's internal state do not need synchronization because that state is only accessed from the event processing loop's thread. An alternative approach to fixing this bug would be to refactor this code so that SparkContext sends the DAGScheduler a message in order to get the list of preferred locations. However, this would involve more extensive changes to this code and would be significantly harder to backport to maintenance branches since some of the related code has undergone significant refactoring (e.g. the introduction of EventLoop). Since cacheLocs is the only state that's accessed in this way, adding simple synchronization seems like a better short-term fix.

See #3345 for additional context.

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27651 has started for PR 4660 at commit 12d64ba.

  • This patch merges cleanly.

@JoshRosen JoshRosen changed the title [SPARK-4544] Properly synchronize accesses to DAGScheduler cacheLocs map [SPARK-4454] Properly synchronize accesses to DAGScheduler cacheLocs map Feb 17, 2015
@JoshRosen
Copy link
Contributor Author

/cc @pwendell @markhamstra

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27651 has finished for PR 4660 at commit 12d64ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27651/
Test PASSed.

@pwendell
Copy link
Contributor

This LGTM Josh - thanks for giving this such a thorough look.

asfgit pushed a commit that referenced this pull request Feb 18, 2015
This patch addresses a race condition in DAGScheduler by properly synchronizing accesses to its `cacheLocs` map.

This map is accessed by the `getCacheLocs` and `clearCacheLocs()` methods, which can be called by separate threads, since DAGScheduler's `getPreferredLocs()` method is called by SparkContext and indirectly calls `getCacheLocs()`.  If this map is cleared by the DAGScheduler event processing thread while a user thread is submitting a job and computing preferred locations, then this can cause the user thread to throw "NoSuchElementException: key not found" errors.

Most accesses to DAGScheduler's internal state do not need synchronization because that state is only accessed from the event processing loop's thread.  An alternative approach to fixing this bug would be to refactor this code so that SparkContext sends the DAGScheduler a message in order to get the list of preferred locations.  However, this would involve more extensive changes to this code and would be significantly harder to backport to maintenance branches since some of the related code has undergone significant refactoring (e.g. the introduction of EventLoop).  Since `cacheLocs` is the only state that's accessed in this way, adding simple synchronization seems like a better short-term fix.

See #3345 for additional context.

Author: Josh Rosen <[email protected]>

Closes #4660 from JoshRosen/SPARK-4454 and squashes the following commits:

12d64ba [Josh Rosen] Properly synchronize accesses to DAGScheduler cacheLocs map.

(cherry picked from commit d46d624)
Signed-off-by: Patrick Wendell <[email protected]>
@asfgit asfgit closed this in d46d624 Feb 18, 2015
@JoshRosen
Copy link
Contributor Author

I pushed a a51fc7e hotfix to roll back a bit of code cleanup that I did which added a getOrElseUpdate call, since this might have a performance impact as this code is called O(num tasks) times.

@JoshRosen JoshRosen deleted the SPARK-4454 branch February 18, 2015 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants