[SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2 #14467

holdenk · 2016-08-02T21:57:58Z

What changes were proposed in this pull request?

Move the internals of the PySpark accumulator API from the old deprecated API on top of the new accumulator API.

How was this patch tested?

The existing PySpark accumulator tests (both unit tests and doc tests at the start of accumulator.py).

This reverts commit 3c1ea65.

…deprecated-accumulator-api

SparkQA · 2016-08-03T00:10:55Z

Test build #63140 has finished for PR 14467 at commit 736f6ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Least(children: Seq[Expression]) extends Expression
- case class Greatest(children: Seq[Expression]) extends Expression

holdenk · 2016-08-03T18:43:44Z

cc @davies (python). @rxin (new accumulator API) & @srowen (accumulator API cleanup) - not that it necessarily needs all of you to look at it but just to make sure everyone is aware if interested.

srowen · 2016-08-03T19:11:58Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

-      : JList[Array[Byte]] = synchronized {
+  override def copy(): PythonAccumulatorV2 = {
+    val newAcc = new PythonAccumulatorV2(serverHost, serverPort)
+    newAcc._acc = this._acc


Probably a dumb question but does this need to be copied/cloned?

So the new accumulator API uses the copy method method inside of the copyAndReset which is used inside of writeReplace during serialization. More generally you want to ship a "clean" accumulator to the workers rather than something which is already potentially have some accumulated values inside of it.

OK but this keeps the original accumulator's data -- it even makes both point to the same instance of the data. I suspect I'm missing something.

Yah copy() could be used elsewhere besides inside of copyAndReset() - although I don't see any blindingly obvious places where it is used inside of Spark. I'll switch this to be more like the CollectionAccumulator and have a specialized copyAndReset and do a deep copy on copy.

… syncrhonized. Not that I think thats coming anytime soon - but it isn't gauranteed not to happen in the new API.

…oid the deep copy when not needed

SparkQA · 2016-08-03T21:42:09Z

Test build #63182 has finished for PR 14467 at commit 4756853.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-03T22:07:56Z

Test build #63186 has finished for PR 14467 at commit 4b1b872.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-08-03T22:51:41Z

Failure seems unrelated, Jenkins retest this please.

SparkQA · 2016-08-03T23:08:40Z

Test build #63190 has finished for PR 14467 at commit a4d87e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-04T00:33:12Z

Test build #63195 has finished for PR 14467 at commit a4d87e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-04T05:53:02Z

Test build #63205 has finished for PR 14467 at commit cc5f435.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-08-04T23:04:46Z

jenkins retest this please.

SparkQA · 2016-08-05T01:25:04Z

Test build #63248 has finished for PR 14467 at commit cc5f435.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2016-08-15T22:15:56Z

python/pyspark/context.py

-            self._jvm.java.util.ArrayList(),
-            self._jvm.PythonAccumulatorParam(host, port))
+        self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
+        self._jsc.sc().register(self._javaAccumulator)


I cannot fully understand why an accumulator is created for every instance of SparkContext . I see it is used when the attribute _jrdd is called but that still does not clear things :(

So in general you would have one SparkContext and many RDDs. The accumulator here doesn't represent a specific accumulator rather a general mechanism for all of the Python accumulators are built on top of. The design is certainly a bit confusing if you try and think of it as a regular accumulator - I found it helped to look at how the scala side "merge" is implemented.

holdenk · 2016-08-17T22:21:18Z

ping @srowen & @rxin & @squito

srowen · 2016-08-18T07:39:21Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

      val out = new DataOutputStream(new BufferedOutputStream(socket.getOutputStream, bufferSize))
-      out.writeInt(val2.size)
-      for (array <- val2.asScala) {
+      out.writeInt(otherPythonAccumulator._acc.size)


This is looking good. There is one more possible thread-safety issue here. Here we access the size of another (synchronized) list, then iterate over it. Both could be a problem if the list is changed somewhere during this process. I think we'd want to explicitly synchronize on otherPythonAccumulator._acc for this whole block, because that's what Collections.synchronizedList is doing in all other cases to guard access to it.

Either that or confirm that this would never be called in a multi-threaded context.

So this code path is only taken during merging on the driver side - and there is no reason to merge the same accumulated value into two different accumulators at the same time. You can also see the merge logic inside of DAGScheduler.scala & TaskMetrics (although not applicable here since the Python accumulator isn't a task metric) and verify that the updates are merged in one at a time.

I see, I'm wondering about the value of future-proofing this with more consistent synchronization, vs the downside (code complexity, performance). It doesn't make the code more complex, and if this is really only accessed by a single thread, the lock overhead is virtually 0 in this context. Is it better to just synchronize for future correctness?

Sure I can do that - just on the other hand we already don't do this inside of CollectionAccumulator (it uses a raw ArrayList without any syncrhonization) - so if this assumption were to change we would also break all collection accumulators.

I've gone ahead and added the requested synchronization :)

SparkQA · 2016-08-19T03:18:30Z

Test build #64030 has finished for PR 14467 at commit 04a1d37.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-08-19T17:48:49Z

jenkins retest this please

SparkQA · 2016-08-19T20:17:58Z

Test build #64087 has finished for PR 14467 at commit 04a1d37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-08-19T23:53:01Z

Just a heads up - going on some vacation next week so will be slower responding to comments.

holdenk · 2016-09-08T22:00:29Z

Back from vacation if anyone has review bandwidth I'd love to get this cleanup in :)

SparkQA · 2016-09-09T00:15:10Z

Test build #65119 has finished for PR 14467 at commit 2f0af6a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-09-12T18:55:28Z

Ping @MLnick / @srowen ?

SparkQA · 2016-09-12T21:53:28Z

Test build #65274 has finished for PR 14467 at commit 6169c3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-13T08:23:00Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

-private class PythonAccumulatorParam(@transient private val serverHost: String, serverPort: Int)
-  extends AccumulatorParam[JList[Array[Byte]]] {
+private[spark] class PythonAccumulatorV2(@transient private val serverHost: String, serverPort: Int)
+  extends AccumulatorV2[JList[Array[Byte]], JList[Array[Byte]]] {


Ah, can we use CollectionAccumulator? it's a specialization that's for accumulating a list of things, so might be good to leverage

So not really - the type signatures don't quite match. We do an add of a List of values on the worker threads but CollectionAccumulator expects the add to be of the type being accumulated into the list. We could use merge or add a separate add function, but it would be the only place where merge is called on the workers and I'm hesitant to do that since at that point we really aren't looking much like the accumulators.

ping @srowen

Yeah, CollectionAccumulator would need you to add Array[Byte] to get out List[Array[Byte]]. But actually, the only thing that's added is a singleton list of one Array[Byte] so the usage is already expecting to add one element, not a list.

Then you don't need a custom implementation at all. Just CollectionAccumulator[Array[Byte]]?

I gave this a shot locally -- haven't tested it though. Does this make sense or do you see a reason this won't work?

srowen@f13034b

That would probably work - but at that point we aren't really getting anything from using the CollectionAccumulator base trait are we?

Also we do still need most of the custom logic (since its for copying it from the JVM back to Python during the "merge" step).

You're not having to reimplement all the other methods. That seems like a win? This thing is fundamentally accumulating a collection of things too. @davies ?

I don't know about the merge logic. I assume that something here is required to send the data back to the Python driver process in order for the accumulator to work, but I don't know this well. At least, that can stay as-is for now. I didn't actually change it much at all in the branch above, it's mostly indentation changes.

As you pointed out it isn't a big change so I'll go ahead and swap it.

SparkQA · 2016-09-14T22:14:41Z

Test build #65394 has finished for PR 14467 at commit b29d8cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-20T22:05:43Z

Test build #65678 has finished for PR 14467 at commit 45ec1ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

keypointt · 2016-09-21T16:56:14Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+  /**
+   * Value function - not expected to be called for Python.
+   */
+  def value: Unit = {


sorry being naive, I'm not getting it why an empty function here?

SparkQA · 2016-09-21T20:59:44Z

Test build #65728 has finished for PR 14467 at commit 76f1fac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-22T10:39:48Z

That LGTM. I'll leave it open a bit for any more comments.

srowen · 2016-09-23T08:44:53Z

Merged to master

holdenk added 5 commits August 1, 2016 12:47

one attempt

3c1ea65

Revert "one attempt"

1d538fe

This reverts commit 3c1ea65.

Start switching the Python Accumulator backing to the V2 API

4bc43c0

Start switching the Python side code to match the JVM side code

46fa97d

Merge branch 'master' into SPARK-16775-reduce-internal-warnings-from-…

736f6ce

…deprecated-accumulator-api

holdenk changed the title ~~[SPARK-16861][PYSPARK][CORE][WIP] Refactor PySpark accumulator API on top of Accumulator V2~~ [SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2 Aug 3, 2016

srowen reviewed Aug 3, 2016
View reviewed changes

holdenk added 3 commits August 3, 2016 12:44

If we start merging on the worker go through the same add path so its…

4756853

… syncrhonized. Not that I think thats coming anytime soon - but it isn't gauranteed not to happen in the new API.

Do a deep copy on copy and implemented specialized copyAndReset to av…

4b1b872

…oid the deep copy when not needed

Use Collections.synchronizedList for safety

a4d87e8

Merge branch 'master' into SPARK-16861-refactor-pyspark-accumulator-api

cc5f435

MLnick mentioned this pull request Aug 5, 2016

[SPARK-16580][CORE] class Accumulator in package spark is deprecated: use AccumulatorV2 #14226

Closed

MechCoder reviewed Aug 15, 2016
View reviewed changes

srowen reviewed Aug 18, 2016
View reviewed changes

holdenk added 2 commits August 18, 2016 16:55

Merge branch 'master' into SPARK-16861-refactor-pyspark-accumulator-api

5fcaa5a

synchronized on otherPythonAccumulator during merge step

04a1d37

Merge branch 'master' into SPARK-16861-refactor-pyspark-accumulator-api

2f0af6a

Merge branch 'master' into SPARK-16861-refactor-pyspark-accumulator-api

6169c3c

srowen reviewed Sep 13, 2016
View reviewed changes

Merge branch 'master' into SPARK-16861-refactor-pyspark-accumulator-api

b29d8cd

holdenk added 2 commits September 20, 2016 10:22

Merge branch 'master' into SPARK-16861-refactor-pyspark-accumulator-api

fca20c0

Simplify to AccumulatorV2[JList[Array[Byte]], Unit]

45ec1ef

keypointt reviewed Sep 21, 2016

View reviewed changes

Use the CollectionAccumulator base trait for Python Accumulator

76f1fac

asfgit closed this in 90d5754 Sep 23, 2016

[SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2 #14467

[SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2 #14467

Uh oh!

Conversation

holdenk commented Aug 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 3, 2016

Uh oh!

holdenk commented Aug 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 3, 2016

Uh oh!

SparkQA commented Aug 3, 2016

Uh oh!

holdenk commented Aug 3, 2016

Uh oh!

SparkQA commented Aug 3, 2016

Uh oh!

SparkQA commented Aug 4, 2016

Uh oh!

SparkQA commented Aug 4, 2016

Uh oh!

holdenk commented Aug 4, 2016

Uh oh!

SparkQA commented Aug 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Aug 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

holdenk commented Aug 19, 2016

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

holdenk commented Aug 19, 2016

Uh oh!

holdenk commented Sep 8, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

holdenk commented Sep 12, 2016

Uh oh!

SparkQA commented Sep 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!