[SPARK-13186][Streaming]Migrate away from SynchronizedMap #11104

huaxingao · 2016-02-06T23:11:06Z

trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead.

huaxingao · 2016-02-06T23:16:20Z

@holdenk
Could you please review? Thanks!!

holdenk · 2016-02-06T23:54:31Z

external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala

 import org.apache.spark.streaming.{Milliseconds, StreamingContext}

+import java.util.concurrent.ConcurrentHashMap
+import scala.collection.convert.decorateAsScala._


I think this needs improved import ordering https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

What is decorateAsScala needed for?

So based on my reading, I think decorateAsScala is being used in place of the standard Java collection conversions to allow the updates to the underlying Java backing type - however this breaks the concurrency guarantees so its doesn't really buy us anything. (but if we were in a situation where concurrency didn't matter and we just wanted to interact with some Java types it could be nice).

of course @huaxingao feel free to correct my understanding if I'm off base :)

@holdenk Thanks for your comments. Yes, that's why I have decorateAsScala there.

I am working on changing the code to use Java API for +=, put and getOrElseUpdate. Do we also need concurrency guarantee for ++ and --?

I haven't dug into the code enough to say there is any place where it would be safe to note have concurrency guarantees, its probably easier to just use the safe methods: for ++ you can use addAll and there is remove as well.

holdenk · 2016-02-07T00:01:47Z

Thanks for the PR and getting started on this :)

So the first minor thing we can update easily is import ordering in many of the files should follow the style guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

The next is the asScala conversion thats used in many places is removing the concurrency gaurantees:

If the wrapped map is synchronized (e.g. from java.util.Collections.synchronizedMap),
it is your responsibility to wrap all
non-atomic operations with underlying.synchronized.
This includes get, as java.util.Map's API does not allow for an
atomic get when null values may be present.

I think rather than wrapping in underlying.synchronized it might be easier to have the operation on the Java API as we did in #11059 (although its a bit more painful to code this way).

We should as coordinate with @ted-yu who I believe is working on corresponding scala style rules to prevent people from using the unsafe scala version.

holdenk · 2016-02-07T00:02:36Z

extras/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala

    val awsCredentials = KinesisTestUtils.getAWSCredentials()
-    val collectedData = new mutable.HashMap[Time, (Array[SequenceNumberRanges], Seq[Int])]
-      with mutable.SynchronizedMap[Time, (Array[SequenceNumberRanges], Seq[Int])]
+    val collectedData = new ConcurrentHashMap[Time, (Array[SequenceNumberRanges], Seq[Int])].asScala


This conversion doesn't result in a thread safe hashmap sadly. See the comment in the PR for more details.

holdenk · 2016-02-07T00:03:41Z

@huaxingao did a first quick pass :)

huaxingao · 2016-02-08T19:21:00Z

@holdenk
Could you please review one more time?
I changed to java api except the getOrElseUpdate in KafkaStreamSuite.scala. I can't find a java equivalent that can be done in one line. So I used the synchronized block.
Thank you very much for your help!!

holdenk · 2016-02-08T19:48:51Z

Sure I'll take another look.

holdenk · 2016-02-08T19:51:02Z

external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala

-        val count = result.getOrElseUpdate(kv._1, 0) + kv._2
-        result.put(kv._1, count)
+        result.synchronized {
+          val count = result.getOrElseUpdate(kv._1, 0) + kv._2


I think putIfAbsent on the underlying Java type might do what you are looking for here. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentHashMap.html#putIfAbsent(K,%20V)

@holdenk
Thanks for your quick reply.
I initially changed
val count = result.getOrElseUpdate(kv._1, 0) + kv._2
to
result.putIfAbsent(kv._1, 0)
val count = result.get(kv._1) + kv.2
but the test failed for me. I guess a different thread can come in between of the two lines and the concurrency is not guaranteed any more. So I used synchronized block instead.

So you would probably want to try val count = result.putIfAbsent(kv.1, 0) + kv._2 - although looking at the original code it had a race condition. If were going to put a synchronized block around the update we could just use a regular mutable.HashMap.

+1 for using synchronized + mutable.HashMap. In addition, toMap in ret.toMap.foreach can be removed. Hence I would recommend changing codes to

val result = new mutable.HashMap[String, Long]() stream.map(_._2).countByValue().foreachRDD { r => r.collect().foreach { kv => result.synchronized { val count = result.getOrElseUpdate(kv._1, 0) + kv._2 result.put(kv._1, count) } } }

And also change assert to assert(result.synchronized { sent === result })

@holdenk @zsxwing
I tried _val count = result.putIfAbsent(kv.1, 0) + kv.2, but the test failed for me. So I will change to mutable.HashMap and put in synchronized block.
Is it OK to use mutable.HashMap and synchronized block in this file only, but use java.util.concurrent.ConcurrentHashMap in other files(StreamingListenerSuite, KinesisStreamTests and FileInputDStream)? Or is it better to to use mutable.HashMap and synchronized block for all the files that has SynchronizedMap?

Using ConcurrentHashMap in other files looks fine. I don't see any potential issues.

holdenk · 2016-02-08T19:59:45Z

So I think the putIfAbsent API might do what your looking for there.

zsxwing · 2016-02-09T00:01:53Z

By the way, is there any PR removing SynchronizedSet?

huaxingao · 2016-02-09T00:07:06Z

@zsxwing
Thanks for the comments. I didn't see a PR for removing SynchronizedSet. I will work on this.

zsxwing · 2016-02-09T00:12:47Z

streaming/src/test/scala/org/apache/spark/streaming/StreamingListenerSuite.scala

-import scala.collection.mutable.{ArrayBuffer, HashMap, SynchronizedBuffer, SynchronizedMap}
+import java.util.concurrent.ConcurrentHashMap
+
+import scala.collection.convert.decorateAsScala._


Please use import scala.collection.JavaConverters._

zsxwing · 2016-02-09T00:24:22Z

LGTM except some nits. Thanks, @huaxingao

huaxingao · 2016-02-09T00:53:00Z

Fixed the problems. Thank you all very much for your help!!

zsxwing · 2016-02-09T00:59:49Z

ok to test

SparkQA · 2016-02-09T01:18:18Z

Test build #50949 has finished for PR 11104 at commit a56f280.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2016-02-09T01:44:45Z

Sorry for the file line length problem. Fixed.

SparkQA · 2016-02-09T01:53:02Z

Test build #50950 has finished for PR 11104 at commit 5668a79.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-02-09T08:56:32Z

@huaxingao one more style issue and it also needs to be updated with the latest master since there are now merge conflicts.

SparkQA · 2016-02-10T04:52:37Z

Test build #51018 has finished for PR 11104 at commit cde889e.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

srowen · 2016-02-12T10:43:44Z

@huaxingao can you rebase this?

huaxingao · 2016-02-12T18:10:37Z

@srowen
Will do. I have my local branch messed up. If i can't figure out how to fix it, I will close this PR and submit a new one. Also, one of the python streaming test failed with java.net.BindException. I am still trying to figure out the problem.

srowen · 2016-02-17T12:00:16Z

@huaxingao OK, do you want to close this and try another PR? I can try to take it on too.
I think this is mostly correct except we'll need more synchronization in places where the wrapped map is iterated over

huaxingao · 2016-02-18T08:52:09Z

@srowen @holdenk
I will close this PR and submit a new one. Thanks!

[SPARK-13186][Streaming]Migrate away from SynchronizedMap

24e75ae

holdenk reviewed Feb 6, 2016
View reviewed changes

holdenk reviewed Feb 7, 2016
View reviewed changes

huaxingao added 3 commits February 7, 2016 23:11

fix import ordering and concurrency issue

782020b

fix import and concurrency issue

a56f280

fix file line length problem

5668a79

holdenk reviewed Feb 8, 2016
View reviewed changes

zsxwing reviewed Feb 9, 2016
View reviewed changes

fix style error

cde889e

huaxingao closed this Feb 18, 2016

[SPARK-13186][Streaming]Migrate away from SynchronizedMap #11104

[SPARK-13186][Streaming]Migrate away from SynchronizedMap #11104

Uh oh!

Conversation

huaxingao commented Feb 6, 2016

Uh oh!

huaxingao commented Feb 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Feb 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Feb 7, 2016

Uh oh!

huaxingao commented Feb 8, 2016

Uh oh!

holdenk commented Feb 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Feb 8, 2016

Uh oh!

zsxwing commented Feb 9, 2016

Uh oh!

huaxingao commented Feb 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Feb 9, 2016

Uh oh!

huaxingao commented Feb 9, 2016

Uh oh!

zsxwing commented Feb 9, 2016

Uh oh!

SparkQA commented Feb 9, 2016

Uh oh!

huaxingao commented Feb 9, 2016

Uh oh!

SparkQA commented Feb 9, 2016

Uh oh!

holdenk commented Feb 9, 2016

Uh oh!

SparkQA commented Feb 10, 2016

Uh oh!

srowen commented Feb 12, 2016

Uh oh!

huaxingao commented Feb 12, 2016

Uh oh!

srowen commented Feb 17, 2016

Uh oh!

huaxingao commented Feb 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects