Skip to content

Conversation

@thunterdb
Copy link
Contributor

@thunterdb thunterdb commented Apr 15, 2016

What changes were proposed in this pull request?

This PR adds a new function in SparkR called sparkLapply(list, function). This function implements a distributed version of lapply using Spark as a backend.

TODO:

  • check documentation
  • check tests

Trivial example in SparkR:

sparkLapply(1:5, function(x) { 2 * x })

Output:

[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 6

[[4]]
[1] 8

[[5]]
[1] 10

Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.

library("MASS")
data(menarche)
families <- c("gaussian", "poisson")
train <- function(family){glm(Menarche ~ Age  , family=family, data=menarche)}
results <- sparkLapply(families, train)

How was this patch tested?

This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.

cc @falaki @davies

@thunterdb
Copy link
Contributor Author

Some other changes got merged, removing them

@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55960 has finished for PR 12426 at commit 651954f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#'
#' @param list the list of elements
#' @param func a function that takes one argument.
#' @noRd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is an "exported" function then it should not have @noRd - please see something like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this comment

@felixcheung
Copy link
Member

could we have some test for this?

@thunterdb
Copy link
Contributor Author

@felixcheung do you have an example I could follow for testing in R?

@thunterdb
Copy link
Contributor Author

Forget my last comment, I found the other tests.

@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56111 has finished for PR 12426 at commit 2f7c60f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test_that("sparkLapply should perform simple transforms", {
doubled <- sparkLapply(1:10, function(x){2 * x})
expect_equal(doubled, as.list(2 * 1:10))
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mengxr
Copy link
Contributor

mengxr commented Apr 18, 2016

@thunterdb What is the story about function serialization? If there are limitations, we should document them.

@felixcheung
Copy link
Member

@thunterdb please check our my earlier comment on code doc format, thanks

@shivaram
Copy link
Contributor

@mengxr Regarding function serialization there is a subsection in https://docs.google.com/document/d/1oegI3OjmK_a-ME4m7sdL4ZlzY7wkXzfaX69GqQqK0VI/edit#heading=h.ei763k8tkz8o that discusses what we assume at a high level. I think that might useful to add in the documentation (See also the notes about some known issues / bugs)

#'}
sparkLapply <- function(list, func) {
sc <- get(".sparkRjsc", envir = .sparkREnv)
rdd <- parallelize(sc, list, length(list))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guess people could possibly get confused about when to call this vs when to call the newly proposed dapply (#12493) Perhaps we need to explain this more and check for class(list) in the event someone is passing in a Spark DataFrame to this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dapply and spark.lapply have different schematics. No need to check class(list) here as a DataFrame can be treated as a list of columns. parallelize() will issue warning for DataFrame at here: https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L110

Copy link
Member

@felixcheung felixcheung Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually fails here instead https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L116
Spark DataFrame is not is.data.frame

@SparkQA
Copy link

SparkQA commented Apr 25, 2016

Test build #56903 has finished for PR 12426 at commit 2ad7b89.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

FWIW the error from Jenkins is Error in namespaceExport(ns, exports) : undefined exports: sparkLapply

#' @examples
#' Here is a trivial example that double the values in a list
#'\dontrun{
#' doubled <- sparkLapply(1:10, function(x){2 * x})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, too.

@felixcheung
Copy link
Member

@thunterdb re: roxygen2 doc - please add:
@rdname spark.lapply
@return for return type and value
@export

@SparkQA
Copy link

SparkQA commented Apr 27, 2016

Test build #57178 has finished for PR 12426 at commit 2433f25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' doubled <- spark.lapply(1:10, function(x){2 * x})
#'}
spark.lapply <- function(list, func) {
sc <- get(".sparkRjsc", envir = .sparkREnv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor thing: All the existing functions like parallelize take in a Spark context as the first argument. We've discussed removing this in the past (See #9192) but we didn't reach a resolution on it.

So to be consistent it'd be better to take in sc as the first argument here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I thought it was part of the design but I am happy to do that as it simplifies that piece of code.

@SparkQA
Copy link

SparkQA commented Apr 29, 2016

Test build #57286 has finished for PR 12426 at commit 378b437.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

This looks pretty good to me. @mengxr @felixcheung any other comments ?

@mengxr
Copy link
Contributor

mengxr commented Apr 29, 2016

LGTM2. Merged into master. Thanks!

@asfgit asfgit closed this in 769a909 Apr 29, 2016
@shivaram
Copy link
Contributor

@mengxr - We should add details about this in SparkR programming guide. Can you add this to the QA/docs JIRA we have for 2.0 ?

@thunterdb
Copy link
Contributor Author

@shivaram @felixcheung @dongjoon-hyun thank you for your comments on my first R pull request!

Also, I put a note in the ticket about updating the documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants