Skip to content

Conversation

@NarineK
Copy link
Contributor

@NarineK NarineK commented May 2, 2016

What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

How was this patch tested?

Unit tests.

  1. Primitive test with different column types
  2. Add a boolean column
  3. Compute average by a group

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57511 has finished for PR 12836 at commit 19dcb2d.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class MapGroupsR(
    • case class MapPartitionsInR(

child: LogicalPlan) extends UnaryNode with ObjectProducer


object MapPartitionsInR {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will fix!

inputSchema: StructType,
outputSchema: StructType) extends ((Any, Iterator[Any]) => TraversableOnce[Any]) {

def apply(key: Any, iter: Iterator[Any]): TraversableOnce[Any] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird style. Need to follow the common style.

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57512 has finished for PR 12836 at commit 9c5473f.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


#' gapply
#'
#' Apply a function to each group of a DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the description, we need to explain what is a group; Otherwise, users will not know how to use it.

@NarineK NarineK changed the title [SPARK-12922][SparkR] Implement gapply() on DataFrame in SparkR [WIP] [SPARK-12922][SparkR][WIP] Implement gapply() on DataFrame in SparkR May 2, 2016
case logical.MapPartitionsInR(f, p, b, is, os, objAttr, child) =>
execution.MapPartitionsExec(
execution.r.MapPartitionsRWrapper(f, p, b, is, os), objAttr, planLater(child)) :: Nil
case logical.MapGroupsR(f, p, b, is, os, key, value, grouping, data, objAttr, child) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapGroupsInR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed MapGroupsR to MapGroupsPartitionsInR.
Or maybe MapGroupsInR is better. Not sure. @sun-rui ?

SERIALIZED_R_DATA_SCHEMA
} else {
schema
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to keep it consistent with dapply, I haven't made it one line:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L142

But we can make it one line, maybe in both cases ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should change both

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57517 has finished for PR 12836 at commit 66ca64e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2016

Test build #60391 has finished for PR 12836 at commit 1aa368d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2016

Test build #60392 has finished for PR 12836 at commit 91e1944.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@NarineK
Copy link
Contributor Author

NarineK commented Jun 14, 2016

Addressed your comments @sun-rui, please let me know if you have more comments.

@sun-rui
Copy link
Contributor

sun-rui commented Jun 15, 2016

@NarineK, there is one comment left un-addressed

@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #60574 has finished for PR 12836 at commit 4d1cc6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' column of the SparkDataFrame. The function `func` takes as argument
#' a key - grouping columns and a data frame - a local R data.frame.
#' The output of `func` is a local R data.frame.
#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment: It will be good to clarify how this schema can be constructed. i.e. something like The output schema is usually the the schema for the key along with the schema of the output R data frame. We can also highlight this in the programming guide

Copy link
Contributor Author

@NarineK NarineK Jun 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output schema is purely based on the output dataframe, if key is included in the output then we need to include the key in the schema.
Basically, the schema has to match to what we want to output. If we want to output only the average in the above example, we could have:
schema <- structType(structField("avg", "double")),
what really matters is the data-type - it has to be double in above example, it cannot be string or character .... unless otherwise we explicitly convert it into e.g. string in the R function. The name doesn't matter either. I could have "hello", instead "avg'.

Copy link
Contributor Author

@NarineK NarineK Jun 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have in the documentation smth like:

"The schema has to correspond to output SparkDataFrame. It has to be defined for each output column with preferred output column name and corresponding data type."

How does this sound ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thats fine. Also in the example below where we construct schema you can add a comment line which looks like Here our output contains 2 columns, the key which is a integer and the mean which is a double.

@shivaram
Copy link
Contributor

@NarineK Thanks again for the updates to this PR and thanks @sun-rui for reviewing. The code changes LGTM -- the refactoring of worker.R is especially useful for readability.

I just had a couple of minor questions on the API, examples. Also, since we are close to RC1, my vote would be to merge this PR right now and continue making any updates to examples / docs in follow up PRs.
@sun-rui Let me know if this sounds good and I can merge this later today / tomm.

@NarineK Would you be able to update the programming guide for gapply ? #13660 is doing it for dapply but we can do gapply in a separate PR.

@NarineK
Copy link
Contributor Author

NarineK commented Jun 15, 2016

Thanks, @shivaram and @sun-rui. Yes, I can work on programming guide for gapply.

@sun-rui
Copy link
Contributor

sun-rui commented Jun 16, 2016

@shivaram, LGTM

@SparkQA
Copy link

SparkQA commented Jun 16, 2016

Test build #60621 has finished for PR 12836 at commit fe36d24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Merging this to master and branch-2.0

@asfgit asfgit closed this in 7c6c692 Jun 16, 2016
asfgit pushed a commit that referenced this pull request Jun 16, 2016
## What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group

Author: Narine Kokhlikyan <[email protected]>
Author: NarineK <[email protected]>

Closes #12836 from NarineK/gapply2.

(cherry picked from commit 7c6c692)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@vectorijk
Copy link
Contributor

@NarineK Which way do you want to include programming guide for gapply, in separate PR or in #13660?

@NarineK
Copy link
Contributor Author

NarineK commented Jun 17, 2016

Hi @vectorijk,
Thanks for asking, i think in a separate PR. Do you think including in #13660 would be better ?

@vectorijk
Copy link
Contributor

@NarineK Cool~ I think it is better to open a separate PR to track gapply programming guide.

@NarineK
Copy link
Contributor Author

NarineK commented Jun 19, 2016

@vectorijk, should I do the pull request for the same jira - https://issues.apache.org/jira/browse/SPARK-15672, or should I create a new jira for gapply's programming guide?

@vectorijk
Copy link
Contributor

@NarineK I am not quite sure. Maybe you could create a new JIRA for gapply's programming guide.

@NarineK
Copy link
Contributor Author

NarineK commented Jun 19, 2016

Thanks for the quick response. I'll create one.

@NarineK
Copy link
Contributor Author

NarineK commented Jul 18, 2016

@shivaram, @sun-rui , I was wondering if someone created a jira for the issue described here:
#12836 (comment)

@shivaram
Copy link
Contributor

@NarineK Not as far as I know

@sun-rui
Copy link
Contributor

sun-rui commented Jul 19, 2016

no, go ahead to submit one:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.