-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7264][ML] Parallel lapply for sparkR #12426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Some other changes got merged, removing them |
|
Test build #55960 has finished for PR 12426 at commit
|
R/pkg/R/context.R
Outdated
| #' | ||
| #' @param list the list of elements | ||
| #' @param func a function that takes one argument. | ||
| #' @noRd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is an "exported" function then it should not have @noRd - please see something like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this comment
|
could we have some test for this? |
|
@felixcheung do you have an example I could follow for testing in R? |
|
Forget my last comment, I found the other tests. |
|
Test build #56111 has finished for PR 12426 at commit
|
| test_that("sparkLapply should perform simple transforms", { | ||
| doubled <- sparkLapply(1:10, function(x){2 * x}) | ||
| expect_equal(doubled, as.list(2 * 1:10)) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@thunterdb What is the story about function serialization? If there are limitations, we should document them. |
|
@thunterdb please check our my earlier comment on code doc format, thanks |
|
@mengxr Regarding function serialization there is a subsection in https://docs.google.com/document/d/1oegI3OjmK_a-ME4m7sdL4ZlzY7wkXzfaX69GqQqK0VI/edit#heading=h.ei763k8tkz8o that discusses what we assume at a high level. I think that might useful to add in the documentation (See also the notes about some known issues / bugs) |
| #'} | ||
| sparkLapply <- function(list, func) { | ||
| sc <- get(".sparkRjsc", envir = .sparkREnv) | ||
| rdd <- parallelize(sc, list, length(list)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guess people could possibly get confused about when to call this vs when to call the newly proposed dapply (#12493) Perhaps we need to explain this more and check for class(list) in the event someone is passing in a Spark DataFrame to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dapply and spark.lapply have different schematics. No need to check class(list) here as a DataFrame can be treated as a list of columns. parallelize() will issue warning for DataFrame at here: https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually fails here instead https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L116
Spark DataFrame is not is.data.frame
|
Test build #56903 has finished for PR 12426 at commit
|
|
FWIW the error from Jenkins is |
R/pkg/R/context.R
Outdated
| #' @examples | ||
| #' Here is a trivial example that double the values in a list | ||
| #'\dontrun{ | ||
| #' doubled <- sparkLapply(1:10, function(x){2 * x}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, too.
|
@thunterdb re: roxygen2 doc - please add: |
|
Test build #57178 has finished for PR 12426 at commit
|
R/pkg/R/context.R
Outdated
| #' doubled <- spark.lapply(1:10, function(x){2 * x}) | ||
| #'} | ||
| spark.lapply <- function(list, func) { | ||
| sc <- get(".sparkRjsc", envir = .sparkREnv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor thing: All the existing functions like parallelize take in a Spark context as the first argument. We've discussed removing this in the past (See #9192) but we didn't reach a resolution on it.
So to be consistent it'd be better to take in sc as the first argument here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I thought it was part of the design but I am happy to do that as it simplifies that piece of code.
|
Test build #57286 has finished for PR 12426 at commit
|
|
This looks pretty good to me. @mengxr @felixcheung any other comments ? |
|
LGTM2. Merged into master. Thanks! |
|
@mengxr - We should add details about this in SparkR programming guide. Can you add this to the QA/docs JIRA we have for 2.0 ? |
|
@shivaram @felixcheung @dongjoon-hyun thank you for your comments on my first R pull request! Also, I put a note in the ticket about updating the documentation |
What changes were proposed in this pull request?
This PR adds a new function in SparkR called
sparkLapply(list, function). This function implements a distributed version oflapplyusing Spark as a backend.TODO:
Trivial example in SparkR:
Output:
Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.
How was this patch tested?
This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.
cc @falaki @davies