[SPARK-12922][SparkR][WIP] Implement gapply() on DataFrame in SparkR #12836

NarineK · 2016-05-02T06:38:32Z

What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

How was this patch tested?

Unit tests.

Primitive test with different column types
Add a boolean column
Compute average by a group

SparkQA · 2016-05-02T06:42:36Z

Test build #57511 has finished for PR 12836 at commit 19dcb2d.

This patch fails some tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MapGroupsR(
- case class MapPartitionsInR(

gatorsmile · 2016-05-02T06:43:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala

    child: LogicalPlan) extends UnaryNode with ObjectProducer

+
+object MapPartitionsInR {


Please move it back.

Thanks, will fix!

gatorsmile · 2016-05-02T06:50:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala

+    inputSchema: StructType,
+    outputSchema: StructType) extends ((Any, Iterator[Any]) => TraversableOnce[Any]) {
+
+  def apply(key: Any, iter: Iterator[Any]): TraversableOnce[Any] = {


Weird style. Need to follow the common style.

SparkQA · 2016-05-02T06:55:32Z

Test build #57512 has finished for PR 12836 at commit 9c5473f.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-05-02T06:56:36Z

R/pkg/R/DataFrame.R


+#' gapply
+#'
+#' Apply a function to each group of a DataFrame.


In the description, we need to explain what is a group; Otherwise, users will not know how to use it.

gatorsmile · 2016-05-02T07:20:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

      case logical.MapPartitionsInR(f, p, b, is, os, objAttr, child) =>
        execution.MapPartitionsExec(
          execution.r.MapPartitionsRWrapper(f, p, b, is, os), objAttr, planLater(child)) :: Nil
+      case logical.MapGroupsR(f, p, b, is, os, key, value, grouping, data, objAttr, child) =>


MapGroupsInR?

Renamed MapGroupsR to MapGroupsPartitionsInR.
Or maybe MapGroupsInR is better. Not sure. @sun-rui ?

gatorsmile · 2016-05-02T07:25:32Z

sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala

+        SERIALIZED_R_DATA_SCHEMA
+      } else {
+        schema
+      }


In order to keep it consistent with dapply, I haven't made it one line:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L142

But we can make it one line, maybe in both cases ?

I think you should change both

SparkQA · 2016-05-02T08:50:22Z

Test build #57517 has finished for PR 12836 at commit 66ca64e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-13T09:41:55Z

Test build #60391 has finished for PR 12836 at commit 1aa368d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-13T09:44:15Z

Test build #60392 has finished for PR 12836 at commit 91e1944.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NarineK · 2016-06-14T22:15:28Z

Addressed your comments @sun-rui, please let me know if you have more comments.

sun-rui · 2016-06-15T13:37:27Z

@NarineK, there is one comment left un-addressed

SparkQA · 2016-06-15T18:34:36Z

Test build #60574 has finished for PR 12836 at commit 4d1cc6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-15T22:54:42Z

R/pkg/R/DataFrame.R

+#'             column of the SparkDataFrame. The function `func` takes as argument
+#'             a key - grouping columns and a data frame - a local R data.frame.
+#'             The output of `func` is a local R data.frame.
+#' @param schema The schema of the resulting SparkDataFrame after the function is applied.


minor comment: It will be good to clarify how this schema can be constructed. i.e. something like The output schema is usually the the schema for the key along with the schema of the output R data frame. We can also highlight this in the programming guide

The output schema is purely based on the output dataframe, if key is included in the output then we need to include the key in the schema.
Basically, the schema has to match to what we want to output. If we want to output only the average in the above example, we could have:
schema <- structType(structField("avg", "double")),
what really matters is the data-type - it has to be double in above example, it cannot be string or character .... unless otherwise we explicitly convert it into e.g. string in the R function. The name doesn't matter either. I could have "hello", instead "avg'.

I could have in the documentation smth like:

"The schema has to correspond to output SparkDataFrame. It has to be defined for each output column with preferred output column name and corresponding data type."

How does this sound ?

Yeah thats fine. Also in the example below where we construct schema you can add a comment line which looks like Here our output contains 2 columns, the key which is a integer and the mean which is a double.

shivaram · 2016-06-15T23:01:13Z

@NarineK Thanks again for the updates to this PR and thanks @sun-rui for reviewing. The code changes LGTM -- the refactoring of worker.R is especially useful for readability.

I just had a couple of minor questions on the API, examples. Also, since we are close to RC1, my vote would be to merge this PR right now and continue making any updates to examples / docs in follow up PRs.
@sun-rui Let me know if this sounds good and I can merge this later today / tomm.

@NarineK Would you be able to update the programming guide for gapply ? #13660 is doing it for dapply but we can do gapply in a separate PR.

NarineK · 2016-06-15T23:33:10Z

Thanks, @shivaram and @sun-rui. Yes, I can work on programming guide for gapply.

sun-rui · 2016-06-16T01:31:15Z

@shivaram, LGTM

SparkQA · 2016-06-16T04:22:05Z

Test build #60621 has finished for PR 12836 at commit fe36d24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-16T04:39:43Z

Merging this to master and branch-2.0

## What changes were proposed in this pull request? gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API. Please, let me know what do you think and if you have any ideas to improve it. Thank you! ## How was this patch tested? Unit tests. 1. Primitive test with different column types 2. Add a boolean column 3. Compute average by a group Author: Narine Kokhlikyan <[email protected]> Author: NarineK <[email protected]> Closes #12836 from NarineK/gapply2. (cherry picked from commit 7c6c692) Signed-off-by: Shivaram Venkataraman <[email protected]>

vectorijk · 2016-06-17T08:40:05Z

@NarineK Which way do you want to include programming guide for gapply, in separate PR or in #13660?

NarineK · 2016-06-17T08:45:50Z

Hi @vectorijk,
Thanks for asking, i think in a separate PR. Do you think including in #13660 would be better ?

vectorijk · 2016-06-17T08:49:10Z

@NarineK Cool~ I think it is better to open a separate PR to track gapply programming guide.

NarineK · 2016-06-19T20:30:13Z

@vectorijk, should I do the pull request for the same jira - https://issues.apache.org/jira/browse/SPARK-15672, or should I create a new jira for gapply's programming guide?

vectorijk · 2016-06-19T20:39:56Z

@NarineK I am not quite sure. Maybe you could create a new JIRA for gapply's programming guide.

NarineK · 2016-06-19T20:45:54Z

Thanks for the quick response. I'll create one.

NarineK · 2016-07-18T18:42:46Z

@shivaram, @sun-rui , I was wondering if someone created a jira for the issue described here:
#12836 (comment)

shivaram · 2016-07-18T21:09:31Z

@NarineK Not as far as I know

sun-rui · 2016-07-19T02:09:05Z

no, go ahead to submit one:)

First commit gapply

19dcb2d

gatorsmile reviewed May 2, 2016
View reviewed changes

Fixed Roxigen issue

9c5473f

gatorsmile reviewed May 2, 2016
View reviewed changes

NarineK changed the title ~~[SPARK-12922][SparkR] Implement gapply() on DataFrame in SparkR [WIP]~~ [SPARK-12922][SparkR][WIP] Implement gapply() on DataFrame in SparkR May 2, 2016

gatorsmile reviewed May 2, 2016
View reviewed changes

Fix test cases

66ca64e

gatorsmile reviewed May 2, 2016
View reviewed changes

fixed ordering for MapGroupsR and MapPartitionsInR

6bc882b

RelationalGroupedData.flatMapGroupsInR use passed arguments

91e1944

removed unnecessary comment

4d1cc6b

shivaram reviewed Jun 15, 2016
View reviewed changes

Updated examples' doc

fe36d24

asfgit closed this in 7c6c692 Jun 16, 2016

		child: LogicalPlan) extends UnaryNode with ObjectProducer


		object MapPartitionsInR {

[SPARK-12922][SparkR][WIP] Implement gapply() on DataFrame in SparkR #12836

[SPARK-12922][SparkR][WIP] Implement gapply() on DataFrame in SparkR #12836

Uh oh!

Conversation

NarineK commented May 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 2, 2016

Uh oh!

SparkQA commented Jun 13, 2016

Uh oh!

SparkQA commented Jun 13, 2016

Uh oh!

NarineK commented Jun 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sun-rui commented Jun 15, 2016

Uh oh!

SparkQA commented Jun 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK Jun 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK Jun 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Jun 15, 2016

Uh oh!

NarineK commented Jun 15, 2016

Uh oh!

sun-rui commented Jun 16, 2016

Uh oh!

SparkQA commented Jun 16, 2016

Uh oh!

shivaram commented Jun 16, 2016

Uh oh!

vectorijk commented Jun 17, 2016

Uh oh!

NarineK commented Jun 17, 2016

Uh oh!

vectorijk commented Jun 17, 2016

Uh oh!

NarineK commented Jun 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vectorijk commented Jun 19, 2016

Uh oh!

NarineK commented Jun 19, 2016

Uh oh!

NarineK commented Jul 18, 2016

NarineK commented May 2, 2016 •

edited

Loading

NarineK commented Jun 14, 2016 •

edited

Loading

NarineK Jun 15, 2016 •

edited

Loading

NarineK Jun 15, 2016 •

edited

Loading

NarineK commented Jun 19, 2016 •

edited

Loading