-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12922][SparkR][WIP] Implement gapply() on DataFrame in SparkR #12836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
19dcb2d
9c5473f
66ca64e
6bc882b
43f8ec3
f8caa70
1b2f5c1
caefc71
84fe176
4067be7
f5aab7d
da1bfea
a1c33ec
8b8ec8c
da7bb2b
0b1b255
7e58472
b3ed805
07bbbd2
0928740
9cacd4d
f8c994f
b6cd08a
4532102
52c9f6d
6b91858
aca5395
a0425c1
7b5767a
10f99d1
7e1f7c2
cbde29a
b4fef85
46df2ee
afa7e4e
0a22042
249568e
e4fa8e6
00a091e
afa385d
20a1c37
e07f41b
0ca74fd
d51441f
1aa368d
91e1944
4d1cc6b
fe36d24
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -61,6 +61,7 @@ exportMethods("arrange", | |
| "filter", | ||
| "first", | ||
| "freqItems", | ||
| "gapply", | ||
| "group_by", | ||
| "groupBy", | ||
| "head", | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1180,7 +1180,7 @@ dapplyInternal <- function(x, func, schema) { | |
| #' func should have only one parameter, to which a data.frame corresponds | ||
| #' to each partition will be passed. | ||
| #' The output of func should be a data.frame. | ||
| #' @param schema The schema of the resulting DataFrame after the function is applied. | ||
| #' @param schema The schema of the resulting SparkDataFrame after the function is applied. | ||
| #' It must match the output of func. | ||
| #' @family SparkDataFrame functions | ||
| #' @rdname dapply | ||
|
|
@@ -1266,6 +1266,86 @@ setMethod("dapplyCollect", | |
| ldf | ||
| }) | ||
|
|
||
| #' gapply | ||
| #' | ||
| #' Group the SparkDataFrame using the specified columns and apply the R function to each | ||
| #' group. | ||
| #' | ||
| #' @param x A SparkDataFrame | ||
| #' @param cols Grouping columns | ||
| #' @param func A function to be applied to each group partition specified by grouping | ||
| #' column of the SparkDataFrame. The function `func` takes as argument | ||
| #' a key - grouping columns and a data frame - a local R data.frame. | ||
| #' The output of `func` is a local R data.frame. | ||
| #' @param schema The schema of the resulting SparkDataFrame after the function is applied. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor comment: It will be good to clarify how this schema can be constructed. i.e. something like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The output schema is purely based on the output dataframe, if key is included in the output then we need to include the key in the schema.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could have in the documentation smth like: "The schema has to correspond to output SparkDataFrame. It has to be defined for each output column with preferred output column name and corresponding data type." How does this sound ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah thats fine. Also in the example below where we construct |
||
| #' The schema must match to output of `func`. It has to be defined for each | ||
| #' output column with preferred output column name and corresponding data type. | ||
| #' @family SparkDataFrame functions | ||
| #' @rdname gapply | ||
| #' @name gapply | ||
| #' @export | ||
| #' @examples | ||
| #' | ||
| #' \dontrun{ | ||
| #' Computes the arithmetic mean of the second column by grouping | ||
| #' on the first and third columns. Output the grouping values and the average. | ||
| #' | ||
| #' df <- createDataFrame ( | ||
| #' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)), | ||
| #' c("a", "b", "c", "d")) | ||
| #' | ||
| #' Here our output contains three columns, the key which is a combination of two | ||
| #' columns with data types integer and string and the mean which is a double. | ||
| #' schema <- structType(structField("a", "integer"), structField("c", "string"), | ||
| #' structField("avg", "double")) | ||
| #' df1 <- gapply( | ||
| #' df, | ||
| #' list("a", "c"), | ||
| #' function(key, x) { | ||
| #' y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE) | ||
| #' }, | ||
| #' schema) | ||
| #' collect(df1) | ||
| #' | ||
| #' Result | ||
| #' ------ | ||
| #' a c avg | ||
| #' 3 3 3.0 | ||
| #' 1 1 1.5 | ||
| #' | ||
| #' Fits linear models on iris dataset by grouping on the 'Species' column and | ||
| #' using 'Sepal_Length' as a target variable, 'Sepal_Width', 'Petal_Length' | ||
| #' and 'Petal_Width' as training features. | ||
| #' | ||
| #' df <- createDataFrame (iris) | ||
| #' schema <- structType(structField("(Intercept)", "double"), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to above, do the column names also have to match ? i.e. is
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The names do not have to match, we can give any name we want. Instead of "(Intercept)" I could have "(MyIntercept)". The datatype is important. |
||
| #' structField("Sepal_Width", "double"),structField("Petal_Length", "double"), | ||
| #' structField("Petal_Width", "double")) | ||
| #' df1 <- gapply( | ||
| #' df, | ||
| #' list(df$"Species"), | ||
| #' function(key, x) { | ||
| #' m <- suppressWarnings(lm(Sepal_Length ~ | ||
| #' Sepal_Width + Petal_Length + Petal_Width, x)) | ||
| #' data.frame(t(coef(m))) | ||
| #' }, schema) | ||
| #' collect(df1) | ||
| #' | ||
| #'Result | ||
| #'--------- | ||
| #' Model (Intercept) Sepal_Width Petal_Length Petal_Width | ||
| #' 1 0.699883 0.3303370 0.9455356 -0.1697527 | ||
| #' 2 1.895540 0.3868576 0.9083370 -0.6792238 | ||
| #' 3 2.351890 0.6548350 0.2375602 0.2521257 | ||
| #' | ||
| #'} | ||
| setMethod("gapply", | ||
| signature(x = "SparkDataFrame"), | ||
| function(x, cols, func, schema) { | ||
| grouped <- do.call("groupBy", c(x, cols)) | ||
| gapply(grouped, func, schema) | ||
| }) | ||
|
|
||
| ############################## RDD Map Functions ################################## | ||
| # All of the following functions mirror the existing RDD map functions, # | ||
| # but allow for use with DataFrames by first converting to an RRDD before calling # | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2145,6 +2145,71 @@ test_that("repartition by columns on DataFrame", { | |
| expect_equal(nrow(df1), 2) | ||
| }) | ||
|
|
||
| test_that("gapply() on a DataFrame", { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You need to write a new test case for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a new test which was used for our previous group-apply showcase for customers.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This new test case is reasonable. |
||
| df <- createDataFrame ( | ||
| list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)), | ||
| c("a", "b", "c", "d")) | ||
| expected <- collect(df) | ||
| df1 <- gapply(df, list("a"), function(key, x) { x }, schema(df)) | ||
| actual <- collect(df1) | ||
| expect_identical(actual, expected) | ||
|
|
||
| # Computes the sum of second column by grouping on the first and third columns | ||
| # and checks if the sum is larger than 2 | ||
| schema <- structType(structField("a", "integer"), structField("e", "boolean")) | ||
| df2 <- gapply( | ||
| df, | ||
| list(df$"a", df$"c"), | ||
| function(key, x) { | ||
| y <- data.frame(key[1], sum(x$b) > 2) | ||
| }, | ||
| schema) | ||
| actual <- collect(df2)$e | ||
| expected <- c(TRUE, TRUE) | ||
| expect_identical(actual, expected) | ||
|
|
||
| # Computes the arithmetic mean of the second column by grouping | ||
| # on the first and third columns. Output the groupping value and the average. | ||
| schema <- structType(structField("a", "integer"), structField("c", "string"), | ||
| structField("avg", "double")) | ||
| df3 <- gapply( | ||
| df, | ||
| list("a", "c"), | ||
| function(key, x) { | ||
| y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE) | ||
| }, | ||
| schema) | ||
| actual <- collect(df3) | ||
| actual <- actual[order(actual$a), ] | ||
| rownames(actual) <- NULL | ||
| expected <- collect(select(df, "a", "b", "c")) | ||
| expected <- data.frame(aggregate(expected$b, by = list(expected$a, expected$c), FUN = mean)) | ||
| colnames(expected) <- c("a", "c", "avg") | ||
| expected <- expected[order(expected$a), ] | ||
| rownames(expected) <- NULL | ||
| expect_identical(actual, expected) | ||
|
|
||
| irisDF <- suppressWarnings(createDataFrame (iris)) | ||
| schema <- structType(structField("Sepal_Length", "double"), structField("Avg", "double")) | ||
| # Groups by `Sepal_Length` and computes the average for `Sepal_Width` | ||
| df4 <- gapply( | ||
| cols = list("Sepal_Length"), | ||
| irisDF, | ||
| function(key, x) { | ||
| y <- data.frame(key, mean(x$Sepal_Width), stringsAsFactors = FALSE) | ||
| }, | ||
| schema) | ||
| actual <- collect(df4) | ||
| actual <- actual[order(actual$Sepal_Length), ] | ||
| rownames(actual) <- NULL | ||
| agg_local_df <- data.frame(aggregate(iris$Sepal.Width, by = list(iris$Sepal.Length), FUN = mean), | ||
| stringsAsFactors = FALSE) | ||
| colnames(agg_local_df) <- c("Sepal_Length", "Avg") | ||
| expected <- agg_local_df[order(agg_local_df$Sepal_Length), ] | ||
| rownames(expected) <- NULL | ||
| expect_identical(actual, expected) | ||
| }) | ||
|
|
||
| test_that("Window functions on a DataFrame", { | ||
| setHiveContext(sc) | ||
| df <- createDataFrame(list(list(1L, "1"), list(2L, "2"), list(1L, "1"), list(2L, "2")), | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment: It would be good to say what the function will get as its input. Right now its the key and a dataframe with the grouping columns ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the key and the Dataframe with the grouping columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!