[SPARK-28917][CORE] Synchronize access to RDD mutable state. #25951

squito · 2019-09-27T14:30:45Z

RDD dependencies and partitions can be simultaneously
accessed and mutated by user threads and spark's scheduler threads, so
access must be thread-safe. In particular, as partitions and
dependencies are lazily-initialized, before this change they could get
initialized multiple times, which would lead to the scheduler having an
inconsistent view of the pendings stages and get stuck.

Tested with existing unit tests.

RDD dependencies, partitions, and storageLevel can be simultaneously accessed and mutated by user threads and spark's scheduler threads, so access must be thread-safe. In particular, as partitions and dependencies are lazily-initiliazed, before this change they could get initialized multiple times, which would lead to the scheduler having an inconsistent view of the pendings stages and get stuck.

dongjoon-hyun · 2019-09-27T17:53:30Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

+    def debugSelf(rdd: RDD[_]): Seq[String] = stateLock.synchronized {
      import Utils.bytesToString

      val persistence = if (storageLevel != StorageLevel.NONE) storageLevel.description else ""


Hi, @squito . Shall we use getStorageLevel instead of accessing storageLevel here? Then, it seems that we don't need stateLock.synchronized for this debugSelf.

good point, thanks

vanzin · 2019-09-27T19:52:56Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala


  /** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
-  def getStorageLevel: StorageLevel = storageLevel
+  def getStorageLevel: StorageLevel = stateLock.synchronized { storageLevel }


Synchronizing calls on simple getters always raises flags for me. Is this really safe?

For example, in one of the places that calls this (getOrCompute): what happens if the storage level changes after this value is returned, but before the call to blockManager.getOrElseUpdate actually does its thing?

I will be honest, I don't really have a good understanding of how the mutability of storageLevel could be an actual problem. I do know of an instance where moving a .cache() before the creation of a Future which submitted a job fixed a hanging job. Unfortunately I have very little info about what else was going on in that job.

I have audited the uses of storageLevel, likegetOrCompute, and I think its OK -- there its passing in a local copy, and anyway that whole function is running on the executor where the storageLevel is immutable.

another tricky one is DAGScheduler.getCacheLocs, which reads both storage level and rdd.partitions. that gets exposed via SparkContext.getPreferredLocs, and used in CoalescedRDD and PartitionAwareUnionRDD. but I still can't come up with an explanation for the behavior we see. I will try to poke some more.

I could also submit a change which just covers partitions and dependencies for now, since that is clear, though I'd like to understand the problem w/ storageLevel as well.

Yeah, what is this racing with? by itself, I'm not sure this forces something to set this before something else needs to read it.

SparkQA · 2019-09-27T21:18:08Z

Test build #111485 has finished for PR 25951 at commit f83d1a7.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/rdd/RDD.scala

srowen · 2019-09-28T15:37:49Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

      // Kind of ugly: need to register RDDs with the cache and map output tracker here
      // since we can't do it in the RDD constructor because # of partitions is unknown
-      logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
+      logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + s") as input to " +


You could use interpolation for consistency

srowen · 2019-09-28T15:40:20Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala


  /** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
-  def getStorageLevel: StorageLevel = storageLevel
+  def getStorageLevel: StorageLevel = stateLock.synchronized { storageLevel }


Yeah, what is this racing with? by itself, I'm not sure this forces something to set this before something else needs to read it.

dongjoon-hyun · 2019-10-01T04:09:12Z

Retest this please.

SparkQA · 2019-10-01T10:54:28Z

Test build #111633 has finished for PR 25951 at commit f83d1a7.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-10-02T17:43:43Z

I spent a while trying to figure out where a race on storageLevel could cause issues, but I couldn't think of anything. Sadly, I don't think I'll get more specifics from the case where moving a call to cache() helped; I suspect it was an indirect affect somehow. So I decided to undo all the changes related to storageLevel.

SparkQA · 2019-10-03T00:22:52Z

Test build #111698 has finished for PR 25951 at commit 34d794c.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-03T06:47:45Z

Test build #111713 has finished for PR 25951 at commit f0a662f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* The only purpose of this class is to have something to lock (like a generic Object) but that

core/src/main/scala/org/apache/spark/rdd/RDD.scala

SparkQA · 2019-10-03T21:56:38Z

Test build #111751 has finished for PR 25951 at commit 432dc22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Looks good.

vanzin · 2019-10-04T16:44:53Z

core/src/test/scala/org/apache/spark/DistributedSuite.scala

+  test("reference partitions inside a task") {
+    // Run a simple job which just makes sure there is no failure if we touch rdd.partitions
+    // inside a task.  This requires the stateLock to be serializable.  This is very convoluted
+    // use case, its just a check for backwards-compatibility after the fix for SPARK-28917.


SparkQA · 2019-10-04T23:42:49Z

Test build #111788 has finished for PR 25951 at commit b08bf0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-10-08T18:35:10Z

Merging to master / 2.4.

RDD dependencies and partitions can be simultaneously accessed and mutated by user threads and spark's scheduler threads, so access must be thread-safe. In particular, as partitions and dependencies are lazily-initialized, before this change they could get initialized multiple times, which would lead to the scheduler having an inconsistent view of the pendings stages and get stuck. Tested with existing unit tests. Closes #25951 from squito/SPARK-28917. Authored-by: Imran Rashid <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit 0da667d) Signed-off-by: Marcelo Vanzin <[email protected]>

RDD dependencies and partitions can be simultaneously accessed and mutated by user threads and spark's scheduler threads, so access must be thread-safe. In particular, as partitions and dependencies are lazily-initialized, before this change they could get initialized multiple times, which would lead to the scheduler having an inconsistent view of the pendings stages and get stuck. Tested with existing unit tests. Closes apache#25951 from squito/SPARK-28917. Authored-by: Imran Rashid <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

MaxGekk · 2020-01-30T06:59:40Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

+   * The use of Integer is simply so this is serializable -- executors may reference the shared
+   * fields (though they should never mutate them, that only happens on the driver).
+   */
+  private val stateLock = new Integer(0)


The Integer constructor has been deprecated already, see https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html . This yields the warning:

RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information.

Is it possible to replace it by something else?

I tried to eliminate the warning in #27399

dongjoon-hyun added the SPARK CORE label Sep 27, 2019

dongjoon-hyun reviewed Sep 27, 2019

View reviewed changes

vanzin reviewed Sep 27, 2019

View reviewed changes

feedback

7bb86df

srowen reviewed Sep 28, 2019

View reviewed changes

dongjoon-hyun added SCHEDULER and removed SPARK CORE labels Sep 29, 2019

squito added 3 commits October 2, 2019 11:02

log msg

831d080

Merge branch 'master' into SPARK-28917

c5ec8e9

updates

34d794c

stateLock needs to be serializable

f0a662f

srowen reviewed Oct 3, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/rdd/RDD.scala Outdated Show resolved Hide resolved

feedback and test

432dc22

vanzin reviewed Oct 4, 2019

View reviewed changes

grammar

b08bf0a

vanzin closed this in 0da667d Oct 8, 2019

ajithme mentioned this pull request Jan 16, 2020

[SPARK-23626][CORE] DAGScheduler blocked due to JobSubmitted event #27234

Closed

MaxGekk reviewed Jan 30, 2020

View reviewed changes

[SPARK-28917][CORE] Synchronize access to RDD mutable state. #25951

[SPARK-28917][CORE] Synchronize access to RDD mutable state. #25951

Uh oh!

Conversation

squito commented Sep 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 27, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 1, 2019

Uh oh!

SparkQA commented Oct 1, 2019

Uh oh!

squito commented Oct 2, 2019

Uh oh!

SparkQA commented Oct 3, 2019

Uh oh!

SparkQA commented Oct 3, 2019

Uh oh!

Uh oh!

SparkQA commented Oct 3, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 4, 2019

Uh oh!

vanzin commented Oct 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

squito commented Sep 27, 2019 •

edited

Loading