[SPARK-8893] Add runtime checks against non-positive number of partitions #7285

darabos · 2015-07-08T10:51:52Z

https://issues.apache.org/jira/browse/SPARK-8893

What does sc.parallelize(1 to 3).repartition(p).collect return? I would expect Array(1, 2, 3) regardless of p. But if p < 1, it returns Array(). I think instead it should throw an IllegalArgumentException.

I think the case is pretty clear for p < 0. But the behavior for p = 0 is also error prone. In fact that's how I found this strange behavior. I used rdd.repartition(a/b) with positive a and b, but a/b was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results.

srowen · 2015-07-08T11:20:21Z

core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala

Although I think you could use require() here, the change itself LGTM. I don't see a reason to allow repartitioning to 0 partitions.

Although I think you could use require() here, the change itself LGTM. I don't see a reason to allow repartitioning to 0 partitions.

Thanks! I've switched to require().

SparkQA · 2015-07-08T15:19:46Z

Test build #1009 has finished for PR 7285 at commit d5e3df8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

darabos · 2015-07-08T15:24:51Z

org.apache.spark.rdd.PairRDDFunctionsSuite and org.apache.spark.JavaAPISuite trigger the checks. I'll try to do something.

srowen · 2015-07-08T15:28:25Z

Ah, I think this may have to be a check higher up, on the argument to repartition? this looks too low level. An RDD with 0 partitions is OK, just not repartitioning a (non-empty) RDD to 0 partitions.

[info] - zero-partition RDD *** FAILED *** (22 milliseconds)
[info]   java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
[info]   at scala.Predef$.require(Predef.scala:233)
[info]   at org.apache.spark.HashPartitioner.<init>(Partitioner.scala:79)
[info]   at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
[info]   at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
[info]   at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
[info]   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
[info]   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
[info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
[info]   at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289)
[info]   at org.apache.spark.rdd.PairRDDFunctionsSuite$$anonfun$27.apply$mcV$sp(PairRDDFunctionsSuite.scala:388)
[info]   at org.apache.spark.rdd.PairRDDFunctionsSuite$$anonfun$27.apply(PairRDDFunctionsSuite.scala:381)
[info]   at org.apache.spark.rdd.PairRDDFunctionsSuite$$anonfun$27.apply(PairRDDFunctionsSuite.scala:381)

andrewor14 · 2015-07-10T16:51:50Z

+1 to @srowen's suggestion

srowen · 2015-07-13T11:00:44Z

@darabos are you going to update this one? I think it can be an easy fix.

There are valid cases where 0-size HashPartitioners are created (such as running groupByKey on an empty RDD). As long as they don't call getPartition there is nothing wrong with this. getPartition will try to divide by zero when it is called in this case, so there is no risk of silent mistakes. For negative partition counts getPartition would return bogus results, so the assertion against that remains.

darabos · 2015-07-13T11:39:07Z

Ah, I think this may have to be a check higher up, on the argument to repartition? this looks too low level. An RDD with 0 partitions is OK, just not repartitioning a (non-empty) RDD to 0 partitions.

repartition just calls coalesce, which just calls CoalescedRDD, and that is where I put the assertion. That assertion is fine I think, it was not triggered during the tests, and it does not interfere with zero-partition RDDs. (Okay, it would prevent repartitioning an empty RDD into zero partitions. I've added a condition now to allow that.)

The tests triggered the other assertion, in HashPartitioner. The failures make it clear that there are valid cases where zero-size HashPartitioners are created (such as running groupByKey on an empty RDD). As long as they don't call getPartition there is nothing wrong with this. getPartition will try to divide by zero when it is called in this case, so it will be detected without my assertion.

For negative partition counts getPartition would silently return bogus (positive) results though, so I kept the assertion against negative partitions counts. I admit it's a bit silly. Let me know what you think.

srowen · 2015-07-13T12:03:15Z

It might be able to be a little lower down than repartition, yes, it's that HashPartitioner should accept 0 partitions. Looks good for a re-test.

SparkQA · 2015-07-13T14:09:10Z

Test build #1058 has finished for PR 7285 at commit decba82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-07-13T14:15:44Z

I think this is OK, and an improvement. Hm: if an RDD is empty, should it be OK to repartition to 0 partitions? that seems theoretically OK. Maybe not worth specially allowing. I think this change would prohibit it.

darabos added 2 commits July 8, 2015 12:43

Require positive maxPartitions in CoalescedRDD

897c628

Require positive number of partitions in HashPartitioner

d5e3df8

srowen reviewed Jul 8, 2015
View reviewed changes

Use require() for simpler syntax.

f6ba5fb

darabos force-pushed the patch-1 branch from 71697df to f6ba5fb Compare July 8, 2015 13:25

darabos added 2 commits July 13, 2015 13:27

Allow repartitioning empty RDDs to zero partitions.

decba82

asfgit closed this in 0115516 Jul 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8893] Add runtime checks against non-positive number of partitions #7285

[SPARK-8893] Add runtime checks against non-positive number of partitions #7285

Uh oh!

darabos commented Jul 8, 2015

Uh oh!

srowen Jul 8, 2015

Uh oh!

darabos Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

darabos commented Jul 8, 2015

Uh oh!

srowen commented Jul 8, 2015

Uh oh!

andrewor14 commented Jul 10, 2015

Uh oh!

srowen commented Jul 13, 2015

Uh oh!

darabos commented Jul 13, 2015

Uh oh!

srowen commented Jul 13, 2015

Uh oh!

SparkQA commented Jul 13, 2015

Uh oh!

srowen commented Jul 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-8893] Add runtime checks against non-positive number of partitions #7285

[SPARK-8893] Add runtime checks against non-positive number of partitions #7285

Uh oh!

Conversation

darabos commented Jul 8, 2015

Uh oh!

srowen Jul 8, 2015

Choose a reason for hiding this comment

Uh oh!

darabos Jul 8, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

darabos commented Jul 8, 2015

Uh oh!

srowen commented Jul 8, 2015

Uh oh!

andrewor14 commented Jul 10, 2015

Uh oh!

srowen commented Jul 13, 2015

Uh oh!

darabos commented Jul 13, 2015

Uh oh!

srowen commented Jul 13, 2015

Uh oh!

SparkQA commented Jul 13, 2015

Uh oh!

srowen commented Jul 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants