[SPARK-7264][ML] Parallel lapply for sparkR #12426

thunterdb · 2016-04-15T21:07:53Z

What changes were proposed in this pull request?

This PR adds a new function in SparkR called sparkLapply(list, function). This function implements a distributed version of lapply using Spark as a backend.

TODO:

check documentation
check tests

Trivial example in SparkR:

sparkLapply(1:5, function(x) { 2 * x })

Output:

[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 6

[[4]]
[1] 8

[[5]]
[1] 10

Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.

library("MASS")
data(menarche)
families <- c("gaussian", "poisson")
train <- function(family){glm(Menarche ~ Age  , family=family, data=menarche)}
results <- sparkLapply(families, train)

How was this patch tested?

This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.

cc @falaki @davies

thunterdb · 2016-04-15T21:08:20Z

Some other changes got merged, removing them

SparkQA · 2016-04-15T21:26:26Z

Test build #55960 has finished for PR 12426 at commit 651954f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-04-16T07:41:06Z

R/pkg/R/context.R

+#'
+#' @param list the list of elements
+#' @param func a function that takes one argument.
+#' @noRd


if this is an "exported" function then it should not have @noRd - please see something like this

Sorry, I missed this comment

felixcheung · 2016-04-16T07:44:24Z

could we have some test for this?

thunterdb · 2016-04-18T17:24:51Z

@felixcheung do you have an example I could follow for testing in R?

thunterdb · 2016-04-18T18:07:29Z

Forget my last comment, I found the other tests.

SparkQA · 2016-04-18T18:38:04Z

Test build #56111 has finished for PR 12426 at commit 2f7c60f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-04-18T18:45:44Z

R/pkg/inst/tests/testthat/test_context.R

+test_that("sparkLapply should perform simple transforms", {
+  doubled <- sparkLapply(1:10, function(x){2 * x})
+  expect_equal(doubled, as.list(2 * 1:10))
+})


mengxr · 2016-04-18T20:42:20Z

@thunterdb What is the story about function serialization? If there are limitations, we should document them.

felixcheung · 2016-04-18T20:46:44Z

@thunterdb please check our my earlier comment on code doc format, thanks

shivaram · 2016-04-19T17:45:02Z

@mengxr Regarding function serialization there is a subsection in https://docs.google.com/document/d/1oegI3OjmK_a-ME4m7sdL4ZlzY7wkXzfaX69GqQqK0VI/edit#heading=h.ei763k8tkz8o that discusses what we assume at a high level. I think that might useful to add in the documentation (See also the notes about some known issues / bugs)

felixcheung · 2016-04-19T18:17:02Z

R/pkg/R/context.R

+#'}
+sparkLapply <- function(list, func) {
+  sc <- get(".sparkRjsc", envir = .sparkREnv)
+  rdd <- parallelize(sc, list, length(list))


I'm guess people could possibly get confused about when to call this vs when to call the newly proposed dapply (#12493) Perhaps we need to explain this more and check for class(list) in the event someone is passing in a Spark DataFrame to this function.

dapply and spark.lapply have different schematics. No need to check class(list) here as a DataFrame can be treated as a list of columns. parallelize() will issue warning for DataFrame at here: https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L110

It actually fails here instead https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L116
Spark DataFrame is not is.data.frame

SparkQA · 2016-04-25T17:18:09Z

Test build #56903 has finished for PR 12426 at commit 2ad7b89.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-04-25T19:05:34Z

FWIW the error from Jenkins is Error in namespaceExport(ns, exports) : undefined exports: sparkLapply

dongjoon-hyun · 2016-04-25T19:58:28Z

R/pkg/R/context.R

+#' @examples
+#' Here is a trivial example that double the values in a list
+#'\dontrun{
+#' doubled <- sparkLapply(1:10, function(x){2 * x})


felixcheung · 2016-04-25T23:14:19Z

@thunterdb re: roxygen2 doc - please add:
@rdname spark.lapply
@return for return type and value
@export

SparkQA · 2016-04-27T21:36:32Z

Test build #57178 has finished for PR 12426 at commit 2433f25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-04-27T21:55:39Z

R/pkg/R/context.R

+#' doubled <- spark.lapply(1:10, function(x){2 * x})
+#'}
+spark.lapply <- function(list, func) {
+  sc <- get(".sparkRjsc", envir = .sparkREnv)


One minor thing: All the existing functions like parallelize take in a Spark context as the first argument. We've discussed removing this in the past (See #9192) but we didn't reach a resolution on it.

So to be consistent it'd be better to take in sc as the first argument here ?

Sure, I thought it was part of the design but I am happy to do that as it simplifies that piece of code.

SparkQA · 2016-04-29T00:07:17Z

Test build #57286 has finished for PR 12426 at commit 378b437.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-04-29T05:19:18Z

This looks pretty good to me. @mengxr @felixcheung any other comments ?

mengxr · 2016-04-29T05:43:21Z

LGTM2. Merged into master. Thanks!

shivaram · 2016-04-29T05:55:36Z

@mengxr - We should add details about this in SparkR programming guide. Can you add this to the QA/docs JIRA we have for 2.0 ?

thunterdb · 2016-04-29T17:21:40Z

@shivaram @felixcheung @dongjoon-hyun thank you for your comments on my first R pull request!

Also, I put a note in the ticket about updating the documentation

thunterdb added 8 commits April 12, 2016 11:38

work

bd73c5b

documentation and fixes

0ca1094

style issue

0643df2

comments addressed

0299d8b

jsonify the other parameters

745a103

style

a824d90

initial commit

cc86264

not unlisting

1df83cb

Merge remote-tracking branch 'upstream/master' into 7264

651954f

felixcheung reviewed Apr 16, 2016
View reviewed changes

adding a simple test

2f7c60f

davies reviewed Apr 18, 2016
View reviewed changes

felixcheung reviewed Apr 19, 2016
View reviewed changes

thunterdb added 2 commits April 25, 2016 10:13

comments

1a2daaf

Merge remote-tracking branch 'upstream/master' into 7264

2ad7b89

dongjoon-hyun reviewed Apr 25, 2016
View reviewed changes

thunterdb added 3 commits April 27, 2016 13:57

Merge remote-tracking branch 'upstream/master' into 7264

a97f4df

changes

6aa61d2

no fancy text

2433f25

shivaram reviewed Apr 27, 2016
View reviewed changes

thunterdb added 2 commits April 28, 2016 16:36

Merge remote-tracking branch 'upstream/master' into 7264

9ca6e15

comments

378b437

asfgit closed this in 769a909 Apr 29, 2016

[SPARK-7264][ML] Parallel lapply for sparkR #12426

[SPARK-7264][ML] Parallel lapply for sparkR #12426

Uh oh!

Conversation

thunterdb commented Apr 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

thunterdb commented Apr 15, 2016

Uh oh!

SparkQA commented Apr 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Apr 16, 2016

Uh oh!

thunterdb commented Apr 18, 2016

Uh oh!

thunterdb commented Apr 18, 2016

Uh oh!

SparkQA commented Apr 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 18, 2016

Uh oh!

felixcheung commented Apr 18, 2016

Uh oh!

shivaram commented Apr 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Apr 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 25, 2016

Uh oh!

shivaram commented Apr 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Apr 25, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

shivaram commented Apr 29, 2016

Uh oh!

mengxr commented Apr 29, 2016

Uh oh!

shivaram commented Apr 29, 2016

Uh oh!

thunterdb commented Apr 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

thunterdb commented Apr 15, 2016 •

edited

Loading

felixcheung Apr 20, 2016 •

edited

Loading