Skip to content

Conversation

@huaxingao
Copy link
Contributor

trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead.

@huaxingao
Copy link
Contributor Author

@holdenk
Could you please review? Thanks!!

import org.apache.spark.streaming.{Milliseconds, StreamingContext}

import java.util.concurrent.ConcurrentHashMap
import scala.collection.convert.decorateAsScala._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is decorateAsScala needed for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on my reading, I think decorateAsScala is being used in place of the standard Java collection conversions to allow the updates to the underlying Java backing type - however this breaks the concurrency guarantees so its doesn't really buy us anything. (but if we were in a situation where concurrency didn't matter and we just wanted to interact with some Java types it could be nice).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course @huaxingao feel free to correct my understanding if I'm off base :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk Thanks for your comments. Yes, that's why I have decorateAsScala there.

I am working on changing the code to use Java API for +=, put and getOrElseUpdate. Do we also need concurrency guarantee for ++ and --?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't dug into the code enough to say there is any place where it would be safe to note have concurrency guarantees, its probably easier to just use the safe methods: for ++ you can use addAll and there is remove as well.

@holdenk
Copy link
Contributor

holdenk commented Feb 7, 2016

Thanks for the PR and getting started on this :)

So the first minor thing we can update easily is import ordering in many of the files should follow the style guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

The next is the asScala conversion thats used in many places is removing the concurrency gaurantees:

If the wrapped map is synchronized (e.g. from java.util.Collections.synchronizedMap),
it is your responsibility to wrap all
non-atomic operations with underlying.synchronized.
This includes get, as java.util.Map's API does not allow for an
atomic get when null values may be present.

I think rather than wrapping in underlying.synchronized it might be easier to have the operation on the Java API as we did in #11059 (although its a bit more painful to code this way).

We should as coordinate with @ted-yu who I believe is working on corresponding scala style rules to prevent people from using the unsafe scala version.

val awsCredentials = KinesisTestUtils.getAWSCredentials()
val collectedData = new mutable.HashMap[Time, (Array[SequenceNumberRanges], Seq[Int])]
with mutable.SynchronizedMap[Time, (Array[SequenceNumberRanges], Seq[Int])]
val collectedData = new ConcurrentHashMap[Time, (Array[SequenceNumberRanges], Seq[Int])].asScala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion doesn't result in a thread safe hashmap sadly. See the comment in the PR for more details.

@holdenk
Copy link
Contributor

holdenk commented Feb 7, 2016

@huaxingao did a first quick pass :)

@huaxingao
Copy link
Contributor Author

@holdenk
Could you please review one more time?
I changed to java api except the getOrElseUpdate in KafkaStreamSuite.scala. I can't find a java equivalent that can be done in one line. So I used the synchronized block.
Thank you very much for your help!!

@holdenk
Copy link
Contributor

holdenk commented Feb 8, 2016

Sure I'll take another look.

val count = result.getOrElseUpdate(kv._1, 0) + kv._2
result.put(kv._1, count)
result.synchronized {
val count = result.getOrElseUpdate(kv._1, 0) + kv._2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putIfAbsent on the underlying Java type might do what you are looking for here. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentHashMap.html#putIfAbsent(K,%20V)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk
Thanks for your quick reply.
I initially changed
val count = result.getOrElseUpdate(kv._1, 0) + kv._2
to
result.putIfAbsent(kv._1, 0)
val count = result.get(kv._1) + kv.2
but the test failed for me. I guess a different thread can come in between of the two lines and the concurrency is not guaranteed any more. So I used synchronized block instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would probably want to try val count = result.putIfAbsent(kv.1, 0) + kv._2 - although looking at the original code it had a race condition. If were going to put a synchronized block around the update we could just use a regular mutable.HashMap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using synchronized + mutable.HashMap. In addition, toMap in ret.toMap.foreach can be removed. Hence I would recommend changing codes to

    val result = new mutable.HashMap[String, Long]()
    stream.map(_._2).countByValue().foreachRDD { r =>
      r.collect().foreach { kv =>
        result.synchronized {
          val count = result.getOrElseUpdate(kv._1, 0) + kv._2
          result.put(kv._1, count)
        }
      }
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also change assert to assert(result.synchronized { sent === result })

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk @zsxwing
I tried _val count = result.putIfAbsent(kv.1, 0) + kv.2, but the test failed for me. So I will change to mutable.HashMap and put in synchronized block.
Is it OK to use mutable.HashMap and synchronized block in this file only, but use java.util.concurrent.ConcurrentHashMap in other files(StreamingListenerSuite, KinesisStreamTests and FileInputDStream)? Or is it better to to use mutable.HashMap and synchronized block for all the files that has SynchronizedMap?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ConcurrentHashMap in other files looks fine. I don't see any potential issues.

@holdenk
Copy link
Contributor

holdenk commented Feb 8, 2016

So I think the putIfAbsent API might do what your looking for there.

@zsxwing
Copy link
Member

zsxwing commented Feb 9, 2016

By the way, is there any PR removing SynchronizedSet?

@huaxingao
Copy link
Contributor Author

@zsxwing
Thanks for the comments. I didn't see a PR for removing SynchronizedSet. I will work on this.

import scala.collection.mutable.{ArrayBuffer, HashMap, SynchronizedBuffer, SynchronizedMap}
import java.util.concurrent.ConcurrentHashMap

import scala.collection.convert.decorateAsScala._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use import scala.collection.JavaConverters._

@zsxwing
Copy link
Member

zsxwing commented Feb 9, 2016

LGTM except some nits. Thanks, @huaxingao

@huaxingao
Copy link
Contributor Author

Fixed the problems. Thank you all very much for your help!!

@zsxwing
Copy link
Member

zsxwing commented Feb 9, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Feb 9, 2016

Test build #50949 has finished for PR 11104 at commit a56f280.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

Sorry for the file line length problem. Fixed.

@SparkQA
Copy link

SparkQA commented Feb 9, 2016

Test build #50950 has finished for PR 11104 at commit 5668a79.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Feb 9, 2016

@huaxingao one more style issue and it also needs to be updated with the latest master since there are now merge conflicts.

@SparkQA
Copy link

SparkQA commented Feb 10, 2016

Test build #51018 has finished for PR 11104 at commit cde889e.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 12, 2016

@huaxingao can you rebase this?

@huaxingao
Copy link
Contributor Author

@srowen
Will do. I have my local branch messed up. If i can't figure out how to fix it, I will close this PR and submit a new one. Also, one of the python streaming test failed with java.net.BindException. I am still trying to figure out the problem.

@srowen
Copy link
Member

srowen commented Feb 17, 2016

@huaxingao OK, do you want to close this and try another PR? I can try to take it on too.
I think this is mostly correct except we'll need more synchronization in places where the wrapped map is iterated over

@huaxingao
Copy link
Contributor Author

@srowen @holdenk
I will close this PR and submit a new one. Thanks!

@huaxingao huaxingao closed this Feb 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants