From 0fc12d712a7544ec9a522a7fe7d53e697a3f91f2 Mon Sep 17 00:00:00 2001 From: Madhu Siddalingaiah Date: Thu, 20 Nov 2014 15:13:43 -0500 Subject: [PATCH 1/4] Documentation: add description for repartitionAndSortWithinPartitions --- docs/programming-guide.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 49f319ba775e..e90d4bffa19c 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -934,6 +934,12 @@ for details. Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network. + + repartitionAndSortWithinPartitions(partitioner) + Repartition the RDD according to the given partitioner and, within each resulting partition, + sort records by their keys. This is more efficient than calling repartition and then sorting within + each partition because it can push the sorting down into the shuffle machinery. + ### Actions From 332f7a29b3e8ad47dd5a083e6509bdb98e2e555f Mon Sep 17 00:00:00 2001 From: Madhu Siddalingaiah Date: Mon, 1 Dec 2014 08:47:16 -0500 Subject: [PATCH 2/4] Documentation: replace with --- docs/programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 5f44ba79aa92..ded7332a477d 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -935,7 +935,7 @@ for details. This always shuffles all data over the network. - repartitionAndSortWithinPartitions(partitioner) + repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery. From cbccbfe8b3872ca6885960d33e4a7b19046e6e0c Mon Sep 17 00:00:00 2001 From: Madhu Siddalingaiah Date: Mon, 1 Dec 2014 08:51:16 -0500 Subject: [PATCH 3/4] Documentation: replace with (again) --- docs/programming-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/programming-guide.md b/docs/programming-guide.md index ded7332a477d..5e0d5c15d706 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -935,9 +935,9 @@ for details. This always shuffles all data over the network. - repartitionAndSortWithinPartitions(partitioner) + repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the given partitioner and, within each resulting partition, - sort records by their keys. This is more efficient than calling repartition and then sorting within + sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery. From 79e679fd4aee8284c774267e4d21e764a39f62cb Mon Sep 17 00:00:00 2001 From: Madhu Siddalingaiah Date: Wed, 17 Dec 2014 10:00:02 -0500 Subject: [PATCH 4/4] [DOC]: improve documentation --- core/src/main/scala/org/apache/spark/Partition.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/Partition.scala b/core/src/main/scala/org/apache/spark/Partition.scala index 27892dbd2a0b..dd3f28e4197e 100644 --- a/core/src/main/scala/org/apache/spark/Partition.scala +++ b/core/src/main/scala/org/apache/spark/Partition.scala @@ -18,11 +18,11 @@ package org.apache.spark /** - * A partition of an RDD. + * An identifier for a partition in an RDD. */ trait Partition extends Serializable { /** - * Get the split's index within its parent RDD + * Get the partition's index within its parent RDD */ def index: Int