Skip to content

Commit 6d8c4c6

Browse files
gatorsmilerxin
authored andcommitted
[SPARK-11914][SQL] Support coalesce and repartition in Dataset APIs
This PR is to provide two common `coalesce` and `repartition` in Dataset APIs. After reading the comments of SPARK-9999, I am unclear about the plan for supporting re-partitioning in Dataset APIs. Currently, both RDD APIs and Dataframe APIs provide users such a flexibility to control the number of partitions. In most traditional RDBMS, they expose the number of partitions, the partitioning columns, the table partitioning methods to DBAs for performance tuning and storage planning. Normally, these parameters could largely affect the query performance. Since the actual performance depends on the workload types, I think it is almost impossible to automate the discovery of the best partitioning strategy for all the scenarios. I am wondering if Dataset APIs are planning to hide these APIs from users? Feel free to reject my PR if it does not match the plan. Thank you for your answers. marmbrus rxin cloud-fan Author: gatorsmile <[email protected]> Closes #9899 from gatorsmile/coalesce. (cherry picked from commit 238ae51) Signed-off-by: Reynold Xin <[email protected]>
1 parent 3f15ad7 commit 6d8c4c6

File tree

2 files changed

+34
-0
lines changed

2 files changed

+34
-0
lines changed

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,25 @@ class Dataset[T] private[sql](
152152
*/
153153
def count(): Long = toDF().count()
154154

155+
/**
156+
* Returns a new [[Dataset]] that has exactly `numPartitions` partitions.
157+
* @since 1.6.0
158+
*/
159+
def repartition(numPartitions: Int): Dataset[T] = withPlan {
160+
Repartition(numPartitions, shuffle = true, _)
161+
}
162+
163+
/**
164+
* Returns a new [[Dataset]] that has exactly `numPartitions` partitions.
165+
* Similar to coalesce defined on an [[RDD]], this operation results in a narrow dependency, e.g.
166+
* if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
167+
* the 100 new partitions will claim 10 of the current partitions.
168+
* @since 1.6.0
169+
*/
170+
def coalesce(numPartitions: Int): Dataset[T] = withPlan {
171+
Repartition(numPartitions, shuffle = false, _)
172+
}
173+
155174
/* *********************** *
156175
* Functional Operations *
157176
* *********************** */

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,21 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
5252
assert(ds.takeAsList(1).get(0) == item)
5353
}
5454

55+
test("coalesce, repartition") {
56+
val data = (1 to 100).map(i => ClassData(i.toString, i))
57+
val ds = data.toDS()
58+
59+
assert(ds.repartition(10).rdd.partitions.length == 10)
60+
checkAnswer(
61+
ds.repartition(10),
62+
data: _*)
63+
64+
assert(ds.coalesce(1).rdd.partitions.length == 1)
65+
checkAnswer(
66+
ds.coalesce(1),
67+
data: _*)
68+
}
69+
5570
test("as tuple") {
5671
val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")
5772
checkAnswer(

0 commit comments

Comments
 (0)