-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15110][SparkR] Implement repartitionByColumn for SparkR DataFrames #12887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #57712 has finished for PR 12887 at commit
|
R/pkg/R/DataFrame.R
Outdated
| #' df <- read.json(sqlContext, path) | ||
| #' newDF <- repartitionByColumn(df, df$col1, df$col2) | ||
| #'} | ||
| setMethod("repartitionByColumn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just be repartition with a Column parameter, instead of a different name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @felixcheung ,
thanks for your prompt response.
That was my first try too, however there already exist a definition of repartition and if I try the following:
setGeneric("repartition",
function(x, col, ...) {
standardGeneric("repartition")
})
It fails saying:
unused argument (numPartitions = c("numeric", ""))
Error : unable to load R code in package ‘SparkR’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically you would need to remove that from the signature line and add default value in the function line, something like:
setMethod("repartition",
signature(x = "SparkDataFrame"),
function(x, numPartitions = NULL, col = NULL)
and then check for which one of numPartitions or col is set, that they are the right type (since types are not specified now in the signature), and that they are not both set and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is one of the possible options.
We are not forcing by signature, but we have to do some checks instead.
Whichever you prefer, I'm fine with it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @felixcheung's proposal is good - better to not introduce a new keyword if the existing one suffices
|
Test build #57714 has finished for PR 12887 at commit
|
|
Test build #57715 has finished for PR 12887 at commit
|
|
Test build #57790 has finished for PR 12887 at commit
|
|
Test build #57801 has finished for PR 12887 at commit
|
R/pkg/R/DataFrame.R
Outdated
| #' | ||
| #' Return a new SparkDataFrame that has exactly numPartitions partitions. | ||
| #' There are two different options for repartition | ||
| #' Option 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roxygen2 by default strip out all the whitespace and new line.
if you want you could put these into \items or trail by \cr
http://stackoverflow.com/questions/9267584/when-documenting-in-roxygen-how-do-i-make-an-itemized-list-in-details
http://r-pkgs.had.co.nz/man.html
| # @seealso coalesce | ||
| # @export | ||
| setGeneric("repartition", function(x, numPartitions) { standardGeneric("repartition") }) | ||
| setGeneric("repartition", function(x, ...) { standardGeneric("repartition") }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be function(x, numPartitions, ...) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, we want the numPartitions be optional.
|
Test build #57814 has finished for PR 12887 at commit
|
|
Test build #57816 has finished for PR 12887 at commit
|
|
@shivaram , @felixcheung , are you fine with default number of partitions: 200 or do you prefer an error message ? |
|
Test build #57818 has finished for PR 12887 at commit
|
|
Lets do the same thing as the scala / python API |
|
This is what I see in python: They raise an error! |
|
Test build #57821 has finished for PR 12887 at commit
|
|
From the python comment and examples it looks like numPartitions is not required if the columns are specified. Can we match that behavior ? |
|
Ah yes - I missed that. I think the logic is fine and matches the python API. LGTM. One minor thing: could we add test cases for all 3 scenarios detailed in the description ? I think we have it for the only column is specified case right now |
|
sure! |
|
Test build #57833 has finished for PR 12887 at commit
|
|
LGTM |
R/pkg/R/DataFrame.R
Outdated
| #' the given columns into `numPartitions`.} | ||
| #' \item{"Option 2"} {Return a new SparkDataFrame that has exactly `numPartitions`.} | ||
| #' \item{"Option 3"} {Return a new SparkDataFrame partitioned by the given columns, | ||
| #' preserving the existing number of partitions.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If numPartitions is not specified, the number of partition will be spark.sql.shuffle.partitions. Could you double check the doc on Scala and Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that python is raising an error:
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L434
As far as I understand scala requires parameters by signature. I do not see repartition with empty or default parameter.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2162
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support:
repartition(N)
repartition(N, col1, col2)
repartition(col1, col2)
For the third case, the number of partition is spark.sql.shuffle.partitions, not preserving the existing number of partitions.
Have I misunderstood something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @NarineK is referring to the scala doc for repartition which says
/**
* Returns a new [[Dataset]] partitioned by the given partitioning expressions preserving
* the existing number of partitions. The resulting Datasetis hash partitioned.
...
* This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
*/
def repartition(partitionExprs: Column*): Dataset[T] = withTypedPlan {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not correct, please update it. see the doc for RepartitionByExpression:
/**
* This method repartitions data using [[Expression]]s into `numPartitions`, and receives
* information about the number of partitions during execution. Used when a specific ordering or
* distribution is expected by the consumer of the query result. Use [[Repartition]] for RDD-like
* `coalesce` and `repartition`.
* If `numPartitions` is not specified, the number of partitions will be the number set by
* `spark.sql.shuffle.partitions`.
*/
>>> spark.range(0, 100, 1, 1).repartition(col("id")).rdd.getNumPartitions()
200
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @shivaram , I did refer to scala doc in Dataset.scala.
In the reality for the case - repartition(col1, col2) - internally in the logical plan spark.sql.shuffle.partitions is being used
/**
* This method repartitions data using [[Expression]]s into `numPartitions`, and receives
* information about the number of partitions during execution. Used when a specific ordering or
* distribution is expected by the consumer of the query result. Use [[Repartition]] for RDD-like
* `coalesce` and `repartition`.
* If `numPartitions` is not specified, the number of partitions will be the number set by
* `spark.sql.shuffle.partitions`.
*/
case class RepartitionByExpression(
partitionExpressions: Seq[Expression],
child: LogicalPlan,
numPartitions: Option[Int] = None) extends RedistributeData {
numPartitions match {
case Some(n) => require(n > 0, "numPartitions must be greater than 0.")
case None => // Ok
}
}
|
So, do you want me to update the Dataset.scala too or only the doc in R ? |
|
@NarineK It will be great if could update that too. |
|
sure! |
|
Dear Jenkins, please test! |
|
Jenkins, retest this please |
|
Test build #57854 has finished for PR 12887 at commit
|
|
LGTM |
|
Merging this into master and 2.0 branch, thanks! |
…rames ## What changes were proposed in this pull request? Implement repartitionByColumn on DataFrame. This will allow us to run R functions on each partition identified by column groups with dapply() method. ## How was this patch tested? Unit tests Author: NarineK <[email protected]> Closes #12887 from NarineK/repartitionByColumns. (cherry picked from commit 22226fc) Signed-off-by: Davies Liu <[email protected]>
|
Test build #57906 has finished for PR 12887 at commit
|
What changes were proposed in this pull request?
Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.
How was this patch tested?
Unit tests