-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2251] fix concurrency issues in random sampler #1229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could dropping this implicit break source and binary compatiblity? I think we'd like to avoid asking people to make code changes to upgrade to a bug-fix release, even if the API's are marked as developer. Can you just leave the existing argument and just ignore it?
The following code is very likely to throw an exception: ~~~ val rdd = sc.parallelize(0 until 111, 10).sample(false, 0.1) rdd.zip(rdd).count() ~~~ because the same random number generator is used in compute partitions. Author: Xiangrui Meng <[email protected]> Closes apache#1229 from mengxr/fix-sample and squashes the following commits: f1ee3d7 [Xiangrui Meng] fix concurrency issues in random sampler
|
Hi all, I'm getting similar problem using kmeans clustering with Spark-1.5.1. The stacktrace is below. Any clue? Thank you in advance. Some good links are:
|
|
My RDD comes from a HBase table, which is growing. When I suspend the row insertion, the problem doesn't happen. The RDD is cached, should the problem occur? Is there any way to "freeze" the RDD in some point enable the use without troubles? regards. PA |
|
@pauloangelo sounds like your RDD is not immutable then, in which case many bets are off. RDDs are generally always the same whenever you compute them. |
The following code is very likely to throw an exception:
because the same random number generator is used in compute partitions.