[SPARK-15110][SparkR] Implement repartitionByColumn for SparkR DataFrames #12887

NarineK · 2016-05-04T04:22:00Z

What changes were proposed in this pull request?

Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.

How was this patch tested?

Unit tests

SparkQA · 2016-05-04T04:30:29Z

Test build #57712 has finished for PR 12887 at commit 87c0a58.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-05-04T04:43:59Z

R/pkg/R/DataFrame.R

+#' df <- read.json(sqlContext, path)
+#' newDF <- repartitionByColumn(df, df$col1, df$col2)
+#'}
+setMethod("repartitionByColumn",


should this just be repartition with a Column parameter, instead of a different name?

Hi @felixcheung ,

thanks for your prompt response.
That was my first try too, however there already exist a definition of repartition and if I try the following:

setGeneric("repartition", function(x, col, ...) { standardGeneric("repartition") })

It fails saying:
unused argument (numPartitions = c("numeric", ""))
Error : unable to load R code in package ‘SparkR’

Basically you would need to remove that from the signature line and add default value in the function line, something like:

setMethod("repartition", signature(x = "SparkDataFrame"), function(x, numPartitions = NULL, col = NULL)

and then check for which one of numPartitions or col is set, that they are the right type (since types are not specified now in the signature), and that they are not both set and so on.

Yes, it is one of the possible options.
We are not forcing by signature, but we have to do some checks instead.
Whichever you prefer, I'm fine with it too.

I think @felixcheung's proposal is good - better to not introduce a new keyword if the existing one suffices

SparkQA · 2016-05-04T04:52:45Z

Test build #57714 has finished for PR 12887 at commit 5752f2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T05:28:44Z

Test build #57715 has finished for PR 12887 at commit 3ee277a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T20:13:47Z

Test build #57790 has finished for PR 12887 at commit 8e9e34c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T21:07:58Z

Test build #57801 has finished for PR 12887 at commit 01beaba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-05-04T21:11:47Z

R/pkg/R/DataFrame.R

 #'
-#' Return a new SparkDataFrame that has exactly numPartitions partitions.
+#' There are two different options for repartition
+#' Option 1


roxygen2 by default strip out all the whitespace and new line.
if you want you could put these into \items or trail by \cr
http://stackoverflow.com/questions/9267584/when-documenting-in-roxygen-how-do-i-make-an-itemized-list-in-details
http://r-pkgs.had.co.nz/man.html

shivaram · 2016-05-04T21:13:43Z

cc @davies to just confirm if this is what you proposed in #12836

davies · 2016-05-04T21:17:48Z

R/pkg/R/generics.R

 # @seealso coalesce
 # @export
-setGeneric("repartition", function(x, numPartitions) { standardGeneric("repartition") })
+setGeneric("repartition", function(x, ...) { standardGeneric("repartition") })


Can this be function(x, numPartitions, ...) ?

nvm, we want the numPartitions be optional.

SparkQA · 2016-05-04T23:22:55Z

Test build #57814 has finished for PR 12887 at commit 02f81db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T23:29:23Z

Test build #57816 has finished for PR 12887 at commit e48bb22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NarineK · 2016-05-04T23:34:58Z

@shivaram , @felixcheung , are you fine with default number of partitions: 200 or do you prefer an error message ?

SparkQA · 2016-05-04T23:47:25Z

Test build #57818 has finished for PR 12887 at commit 8852892.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-05-04T23:49:39Z

Lets do the same thing as the scala / python API

NarineK · 2016-05-04T23:55:50Z

This is what I see in python:
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L434

They raise an error!

SparkQA · 2016-05-05T00:23:14Z

Test build #57821 has finished for PR 12887 at commit cf54f09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-05-05T00:37:07Z

From the python comment and examples it looks like numPartitions is not required if the columns are specified.

Also made numPartitions optional if partitioning columns are specified.

Can we match that behavior ?

NarineK · 2016-05-05T00:42:21Z

Isn't this covering your point, @shivaram ?
https://github.com/NarineK/spark/blob/repartitionByColumns/R/pkg/R/DataFrame.R#L617
Test case:
https://github.com/apache/spark/pull/12887/files#diff-3d2a6b9d2b7d84ae179d7ea0f9eca696R2093

shivaram · 2016-05-05T01:22:27Z

Ah yes - I missed that. I think the logic is fine and matches the python API. LGTM. One minor thing: could we add test cases for all 3 scenarios detailed in the description ? I think we have it for the only column is specified case right now

NarineK · 2016-05-05T02:56:14Z

sure!

SparkQA · 2016-05-05T03:27:48Z

Test build #57833 has finished for PR 12887 at commit 9f85704.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-05-05T04:01:16Z

LGTM. Thanks @NarineK -- @davies Can you take one final look and then merge ?

felixcheung · 2016-05-05T04:04:24Z

LGTM

davies · 2016-05-05T04:20:37Z

R/pkg/R/DataFrame.R

+#'                      the given columns into `numPartitions`.}
+#'  \item{"Option 2"} {Return a new SparkDataFrame that has exactly `numPartitions`.}
+#'  \item{"Option 3"} {Return a new SparkDataFrame partitioned by the given columns,
+#'                      preserving the existing number of partitions.}


If numPartitions is not specified, the number of partition will be spark.sql.shuffle.partitions. Could you double check the doc on Scala and Python?

It seems that python is raising an error:
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L434

As far as I understand scala requires parameters by signature. I do not see repartition with empty or default parameter.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2162

We support:

repartition(N)
repartition(N, col1, col2)
repartition(col1, col2)

For the third case, the number of partition is spark.sql.shuffle.partitions, not preserving the existing number of partitions.

Have I misunderstood something?

I think @NarineK is referring to the scala doc for repartition which says

/** * Returns a new [[Dataset]] partitioned by the given partitioning expressions preserving * the existing number of partitions. The resulting Datasetis hash partitioned. ... * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL). */ def repartition(partitionExprs: Column*): Dataset[T] = withTypedPlan {

This one is not correct, please update it. see the doc for RepartitionByExpression:

/** * This method repartitions data using [[Expression]]s into `numPartitions`, and receives * information about the number of partitions during execution. Used when a specific ordering or * distribution is expected by the consumer of the query result. Use [[Repartition]] for RDD-like * `coalesce` and `repartition`. * If `numPartitions` is not specified, the number of partitions will be the number set by * `spark.sql.shuffle.partitions`. */

>>> spark.range(0, 100, 1, 1).repartition(col("id")).rdd.getNumPartitions() 200

Yes, @shivaram , I did refer to scala doc in Dataset.scala.

In the reality for the case - repartition(col1, col2) - internally in the logical plan spark.sql.shuffle.partitions is being used

/** * This method repartitions data using [[Expression]]s into `numPartitions`, and receives * information about the number of partitions during execution. Used when a specific ordering or * distribution is expected by the consumer of the query result. Use [[Repartition]] for RDD-like * `coalesce` and `repartition`. * If `numPartitions` is not specified, the number of partitions will be the number set by * `spark.sql.shuffle.partitions`. */ case class RepartitionByExpression( partitionExpressions: Seq[Expression], child: LogicalPlan, numPartitions: Option[Int] = None) extends RedistributeData { numPartitions match { case Some(n) => require(n > 0, "numPartitions must be greater than 0.") case None => // Ok } }

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/partitioning.scala#L38

NarineK · 2016-05-05T04:58:42Z

So, do you want me to update the Dataset.scala too or only the doc in R ?

davies · 2016-05-05T05:10:50Z

@NarineK It will be great if could update that too.

NarineK · 2016-05-05T05:12:18Z

sure!

NarineK · 2016-05-05T06:21:29Z

Dear Jenkins, please test!

shivaram · 2016-05-05T06:35:33Z

Jenkins, retest this please

SparkQA · 2016-05-05T08:00:20Z

Test build #57854 has finished for PR 12887 at commit 700f886.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-05-05T18:41:48Z

LGTM

davies · 2016-05-05T19:00:32Z

Merging this into master and 2.0 branch, thanks!

…rames ## What changes were proposed in this pull request? Implement repartitionByColumn on DataFrame. This will allow us to run R functions on each partition identified by column groups with dapply() method. ## How was this patch tested? Unit tests Author: NarineK <[email protected]> Closes #12887 from NarineK/repartitionByColumns. (cherry picked from commit 22226fc) Signed-off-by: Davies Liu <[email protected]>

SparkQA · 2016-05-05T19:35:51Z

Test build #57906 has finished for PR 12887 at commit cad7ce8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Implement repartitionByColumns for SparkR DataFrames

87c0a58

rename repartitionByColumns to repartitionByColumn

5752f2d

felixcheung reviewed May 4, 2016
View reviewed changes

added an example with dapply

3ee277a

moving repartitionByColumn to repartition

8e9e34c

remove repartitionByColumn

01beaba

felixcheung reviewed May 4, 2016
View reviewed changes

davies reviewed May 4, 2016
View reviewed changes

NarineK added 2 commits May 4, 2016 16:02

Adding an option to include both number of partitions and the cols

02f81db

small typo fix in RDD.R

e48bb22

use is.numeric

8852892

Raise error instead of default

cf54f09

Add test cases for all possible scenarios

9f85704

davies reviewed May 5, 2016
View reviewed changes

fixed comment on number of default partitions

700f886

Minor comment change

cad7ce8

asfgit closed this in 22226fc May 5, 2016

[SPARK-15110][SparkR] Implement repartitionByColumn for SparkR DataFrames #12887

[SPARK-15110][SparkR] Implement repartitionByColumn for SparkR DataFrames #12887

Uh oh!

Conversation

NarineK commented May 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented May 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

NarineK commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

shivaram commented May 4, 2016

Uh oh!

NarineK commented May 4, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

shivaram commented May 5, 2016

Uh oh!

NarineK commented May 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram commented May 5, 2016

Uh oh!

NarineK commented May 5, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

shivaram commented May 5, 2016

Uh oh!

felixcheung commented May 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK May 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK May 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

NarineK commented May 5, 2016 •

edited

Loading

NarineK May 5, 2016 •

edited

Loading

NarineK May 5, 2016 •

edited

Loading

davies commented May 5, 2016 •

edited

Loading