Fix #SPARK-1149 Bad partitioners can cause Spark to hang #44

witgo · 2014-02-28T08:58:42Z

No description provided.

AmplabJenkins · 2014-02-28T08:58:51Z

Can one of the admins verify this patch?

AmplabJenkins · 2014-02-28T08:59:05Z

Can one of the admins verify this patch?

CodingCat · 2014-02-28T14:09:24Z

core/src/main/scala/org/apache/spark/SparkContext.scala

require(partitions.forall(rddPartitions.contains, "partition index out of range")), more intuitive?

require(partitions.forall(rddPartitions.contains(_)), "partition index out of range") is more readable?

Oops, I found that the parentheses are mismatching in my comments

mateiz · 2014-03-01T20:07:38Z

Jenkins, this is ok to test

AmplabJenkins · 2014-03-01T20:18:35Z

Merged build triggered.

AmplabJenkins · 2014-03-01T20:18:35Z

Merged build started.

AmplabJenkins · 2014-03-01T21:18:18Z

Merged build finished.

AmplabJenkins · 2014-03-01T21:18:18Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12944/

pwendell · 2014-03-01T22:50:26Z

core/src/main/scala/org/apache/spark/SparkContext.scala

Do you mind explaining a bit more the case where these two will not match? I'm just wondering if it make more sense to check this invariant inside of the getPartitions function of ShuffleRDD.scala - but maybe there are other code paths where this could get messed up that don't go through that.

val partitioner = new Partitioner { override def numPartitions: Int = 2 override def getPartition(key: Any): Int = key.hashCode() % 2 } val pairs = sc.parallelize(Array((1, 2), (3, 4), (5, 6), (-1, 7))) val shuffled = pairs.partitionBy(partitioner) shuffled.count

Here error, is looking forward to the results.

val pairs = sc.parallelize(Array((1, 2), (3, 4), (5, 6), (6, 7))) val shuffled = pairs.partitionBy(partitioner) shuffled.lookup(-1)

Although the log records the error, but Spark to hang

Just scanned the code, this issue (partitions does not match with rdd.partitions.map(_.index)) can only happen when you run the computation based on the correctness of partitioner

In current implementation, there are only two cases:

First is lookup, the computation is based on the correctness of getPartition()

def lookup(key: K): Seq[V] = { self.partitioner match { case Some(p) => val index = p.getPartition(key) def process(it: Iterator[(K, V)]): Seq[V] = { val buf = new ArrayBuffer[V] for ((k, v) <- it if k == key) { buf += v } buf } val res = self.context.runJob(self, process _, Array(index), false) res(0) case None => self.filter(_._1 == key).map(_._2).collect() } }

the other case is ShuffleMapTask

// Write the map output to its associated buckets. for (elem <- rdd.iterator(split, context)) { val pair = elem.asInstanceOf[Product2[Any, Any]] val bucketId = dep.partitioner.getPartition(pair._1) shuffle.writers(bucketId).write(pair) }

I'm not sure which fix option is better, add a checking condition in SparkContext, or we have a specific checking in these two places separately

I just felt that without looking at the code, I cannot get the idea why the partitions does not match rdd.partitions (if you look at how SparkContext run the job you will get more confusion, because partitions are exactly derived from "0 until rdd.partitions.size")

/** * Run a job on all partitions in an RDD and return the results in an array. */ def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = { runJob(rdd, func, 0 until rdd.partitions.size, false) }

Do not understand what you mean.
Caused by improper design Partitioner partition index out of range.

val partitioner = new Partitioner { override def numPartitions: Int = 2 override def getPartition(key: Any): Int = key.hashCode() % 2 }

partitioner.getPartition(-1) result -1

@witgo so in the case you mentioned, why not put this check in the constructor of ShuffleRDD? It seems more natural to check it there rather than inside of runjob.

Correctness of the partitioner is related to the input key. The current example, if key> = 0 is no problem. In the constructor can not be detected in.

rxin · 2014-03-02T08:21:17Z

Hi guys,

I think it is better to make sure Spark doesn't hang when an incorrect partition index is given, because there will be other code paths to run a job. Given the two places @CodingCat found, I think it shouldn't be too hard to fix those. @witgo do you mind doing that instead?

One thing for sure is we shouldn't add a check per key -- that can be too expensive.

witgo · 2014-03-02T08:40:25Z

Yes, add too much require statements is unwise.We guarantee throw an error when appropriate, and the rest to the developer to resolve.

AmplabJenkins · 2014-03-02T10:20:10Z

Merged build triggered.

AmplabJenkins · 2014-03-02T10:20:10Z

Merged build started.

AmplabJenkins · 2014-03-02T11:19:14Z

Merged build finished.

AmplabJenkins · 2014-03-02T11:19:14Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12951/

mateiz · 2014-03-02T23:09:04Z

core/src/main/scala/org/apache/spark/SparkContext.scala

This check as written is going to have quadratic complexity. If you have 100 partitions for example, you're going to create a list of length 100 at the top and then check for all 100 partitions whether they're in that list, getting 10,000 operations. Can't you just check that all the indices in partitions are between 0 and rdd.partitions.size? I don't think RDDs can have non-contiguous partition numbers, though there might have been some stuff in the past with partition pruning that I may be misremembering.

Sure, so it should be better.

require(partitions.toSet.diff(rdd.partitions.map(_.index).toSet).isEmpty, "partition index out of range")

or is this?

val partitionRange = (0 until rdd.partitions.size) require(partitions.forall(partitionRange.contains(_)), "partition index out of range")

AmplabJenkins · 2014-03-03T02:27:18Z

Merged build triggered.

AmplabJenkins · 2014-03-03T02:27:18Z

Merged build started.

AmplabJenkins · 2014-03-03T03:20:35Z

Merged build finished.

AmplabJenkins · 2014-03-03T03:20:36Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12958/

andrewor14 · 2014-03-04T00:37:03Z

core/src/main/scala/org/apache/spark/SparkContext.scala

nit: You could just do partitions.forall(partitionRange.contains)

I think my code more readable.

two questions

Does the Spark guarantee that RDD has continuous partition index? (I think so, as https://spark-project.atlassian.net/browse/SPARK-911 is still there)

Shall we put the check here or we want to check inside the specific APIs, (the current issue looks more like a bug in lookup() - forget to check the return value of getPartition before using it....)

Can be changed

require(partitions.toSet.diff(rdd.partitions.map(_.index).toSet).isEmpty, "partition index out of range")

Custom Partitioner causing a lot of problems

val partitioner = new Partitioner { override def numPartitions: Int = 2 override def getPartition(key: Any): Int = key.hashCode() % 2 } val pairs = sc.parallelize(Array((1, 2), (3, 4), (5, 6), (-1, 7))) val shuffled = pairs.partitionBy(partitioner) shuffled.count

this code will return IndexOutOfRange?

Take a look at use of PartitionPruningRDD ..
On Mar 4, 2014 9:33 AM, "LiGuoqiang" [email protected] wrote:

In core/src/main/scala/org/apache/spark/SparkContext.scala:

@@ -847,6 +847,8 @@ class SparkContext(
partitions: Seq[Int],
allowLocal: Boolean,
resultHandler: (Int, U) => Unit) {

val partitionRange = (0 until rdd.partitions.size)

require(partitions.forall(partitionRange.contains(_)), "partition index out of range")

Can be changed

require(partitions.toSet.diff(rdd.partitions.map(_.index).toSet).isEmpty, "partition index out of range")

2.

Custom Partitioner causing a lot of problems

val partitioner = new Partitioner {
override def numPartitions: Int = 2
override def getPartition(key: Any): Int = key.hashCode() % 2
}

val pairs = sc.parallelize(Array((1, 2), (3, 4), (5, 6), (-1, 7)))
val shuffled = pairs.partitionBy(partitioner)
shuffled.count

Reply to this email directly or view it on GitHubhttps://github.com//pull/44/files#r10242185
.

java.lang.ArrayIndexOutOfBoundsException

even the PartitionPruningRDD ensures the continuous index space...I think so...

class PruneDependency[T](rdd: RDD[T], @transient partitionFilterFunc: Int => Boolean) extends NarrowDependency[T](rdd) { @transient val partitions: Array[Partition] = rdd.partitions .filter(s => partitionFilterFunc(s.index)).zipWithIndex .map { case(split, idx) => new PartitionPruningRDDPartition(idx, split) : Partition } override def getParents(partitionId: Int) = { List(partitions(partitionId).asInstanceOf[PartitionPruningRDDPartition].parentSplit.index) } }

Yes, you're right. The code has been modified .

require(partitions.toSet.diff(rdd.partitions.map(_.index).toSet).isEmpty, "partition index out of range")

AmplabJenkins · 2014-03-04T02:22:33Z

Merged build triggered.

fix #SPARK-1149 Bad partitioners can cause Spark to hang

6bb725e

CodingCat reviewed Feb 28, 2014
View reviewed changes

make the code more readable

3a65903

pwendell reviewed Mar 1, 2014
View reviewed changes

liguoqiang added 2 commits March 2, 2014 17:55

add partition index check to submitJob

e68210a

Merge branch 'master' into SPARK-1149

61e5a87

mateiz reviewed Mar 2, 2014
View reviewed changes

Optimize performance for partitions check

3348619

Merge branch 'master' into SPARK-1149

1e3331e

andrewor14 reviewed Mar 4, 2014
View reviewed changes

liguoqiang added 2 commits March 4, 2014 09:39

Merge branch 'master' into SPARK-1149

db6ecc5

Added a unit test for PairRDDFunctions.lookup with bad partitioner

928e1e3

wangyum mentioned this pull request Jun 5, 2020

[SPARK-31809][SQL] Infer IsNotNull from some special equality join keys #28642

Closed

Fix #SPARK-1149 Bad partitioners can cause Spark to hang #44

Fix #SPARK-1149 Bad partitioners can cause Spark to hang #44

Uh oh!

Conversation

witgo commented Feb 28, 2014

Uh oh!

AmplabJenkins commented Feb 28, 2014

Uh oh!

AmplabJenkins commented Feb 28, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mateiz commented Mar 1, 2014

Uh oh!

AmplabJenkins commented Mar 1, 2014

Uh oh!

AmplabJenkins commented Mar 1, 2014

Uh oh!

AmplabJenkins commented Mar 1, 2014

Uh oh!

AmplabJenkins commented Mar 1, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Mar 2, 2014

Uh oh!

witgo commented Mar 2, 2014

Uh oh!

AmplabJenkins commented Mar 2, 2014

Uh oh!

AmplabJenkins commented Mar 2, 2014

Uh oh!

AmplabJenkins commented Mar 2, 2014

Uh oh!

AmplabJenkins commented Mar 2, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 3, 2014

Uh oh!

AmplabJenkins commented Mar 3, 2014

Uh oh!

AmplabJenkins commented Mar 3, 2014

Uh oh!

AmplabJenkins commented Mar 3, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!