Skip to content

Conversation

@darabos
Copy link
Contributor

@darabos darabos commented Jul 8, 2015

https://issues.apache.org/jira/browse/SPARK-8893

What does sc.parallelize(1 to 3).repartition(p).collect return? I would expect Array(1, 2, 3) regardless of p. But if p < 1, it returns Array(). I think instead it should throw an IllegalArgumentException.

I think the case is pretty clear for p < 0. But the behavior for p = 0 is also error prone. In fact that's how I found this strange behavior. I used rdd.repartition(a/b) with positive a and b, but a/b was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I think you could use require() here, the change itself LGTM. I don't see a reason to allow repartitioning to 0 partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I think you could use require() here, the change itself LGTM. I don't see a reason to allow repartitioning to 0 partitions.

Thanks! I've switched to require().

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #1009 has finished for PR 7285 at commit d5e3df8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@darabos
Copy link
Contributor Author

darabos commented Jul 8, 2015

org.apache.spark.rdd.PairRDDFunctionsSuite and org.apache.spark.JavaAPISuite trigger the checks. I'll try to do something.

@srowen
Copy link
Member

srowen commented Jul 8, 2015

Ah, I think this may have to be a check higher up, on the argument to repartition? this looks too low level. An RDD with 0 partitions is OK, just not repartitioning a (non-empty) RDD to 0 partitions.

[info] - zero-partition RDD *** FAILED *** (22 milliseconds)
[info]   java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
[info]   at scala.Predef$.require(Predef.scala:233)
[info]   at org.apache.spark.HashPartitioner.<init>(Partitioner.scala:79)
[info]   at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
[info]   at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
[info]   at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
[info]   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
[info]   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
[info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
[info]   at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289)
[info]   at org.apache.spark.rdd.PairRDDFunctionsSuite$$anonfun$27.apply$mcV$sp(PairRDDFunctionsSuite.scala:388)
[info]   at org.apache.spark.rdd.PairRDDFunctionsSuite$$anonfun$27.apply(PairRDDFunctionsSuite.scala:381)
[info]   at org.apache.spark.rdd.PairRDDFunctionsSuite$$anonfun$27.apply(PairRDDFunctionsSuite.scala:381)

@andrewor14
Copy link
Contributor

+1 to @srowen's suggestion

@srowen
Copy link
Member

srowen commented Jul 13, 2015

@darabos are you going to update this one? I think it can be an easy fix.

darabos added 2 commits July 13, 2015 13:27
There are valid cases where 0-size HashPartitioners are created (such as running groupByKey on an empty RDD). As long as they don't call getPartition there is nothing wrong with this. getPartition will try to divide by zero when it is called in this case, so there is no risk of silent mistakes. For negative partition counts getPartition would return bogus results, so the assertion against that remains.
@darabos
Copy link
Contributor Author

darabos commented Jul 13, 2015

Ah, I think this may have to be a check higher up, on the argument to repartition? this looks too low level. An RDD with 0 partitions is OK, just not repartitioning a (non-empty) RDD to 0 partitions.

repartition just calls coalesce, which just calls CoalescedRDD, and that is where I put the assertion. That assertion is fine I think, it was not triggered during the tests, and it does not interfere with zero-partition RDDs. (Okay, it would prevent repartitioning an empty RDD into zero partitions. I've added a condition now to allow that.)

The tests triggered the other assertion, in HashPartitioner. The failures make it clear that there are valid cases where zero-size HashPartitioners are created (such as running groupByKey on an empty RDD). As long as they don't call getPartition there is nothing wrong with this. getPartition will try to divide by zero when it is called in this case, so it will be detected without my assertion.

For negative partition counts getPartition would silently return bogus (positive) results though, so I kept the assertion against negative partitions counts. I admit it's a bit silly. Let me know what you think.

@srowen
Copy link
Member

srowen commented Jul 13, 2015

It might be able to be a little lower down than repartition, yes, it's that HashPartitioner should accept 0 partitions. Looks good for a re-test.

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #1058 has finished for PR 7285 at commit decba82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 13, 2015

I think this is OK, and an improvement. Hm: if an RDD is empty, should it be OK to repartition to 0 partitions? that seems theoretically OK. Maybe not worth specially allowing. I think this change would prohibit it.

@asfgit asfgit closed this in 0115516 Jul 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants