-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. #12493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #56206 has finished for PR 12493 at commit
|
|
Test build #56209 has finished for PR 12493 at commit
|
| #' @family DataFrame functions | ||
| #' @rdname dapply | ||
| #' @name dapply | ||
| #' @export |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add doc example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Test build #56320 has finished for PR 12493 at commit
|
|
@shivaram, there is already a test case for for where the schema is not specified. Do you mean adding more? |
|
@davies should take a detailed look at this. This looks pretty good based on my very very quick glance. |
| * A function wrapper that applies the given R function to each partition. | ||
| */ | ||
| private[sql] case class MapPartitionsRWrapper( | ||
| func: Array[Byte], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
BTW one observation: FWIW, I think the dapplyCollect method will be a lot more useful, because that's the one that can be used for training models, etc. |
|
Test build #56322 has finished for PR 12493 at commit
|
|
The test failure is weird: The unit tests passed in my machine. anyone has idea? |
|
@rxin, I will implement dapplyCollect and collect() on DataFrame of serialized R data in a following PR. |
|
@sun-rui Regarding the unit tests could it be related to the R version or the version of testthat we are using on Jenkins ? |
|
When I rebased this PR to master, I found a bug in Catalyst optimizer. I submitted a PR for it #12575. I have to wait for it to be fixed. |
| #' @examples | ||
| #' \dontrun{ | ||
| #' df <- createDataFrame (sqlContext, mtcars) | ||
| #' df1 <- dapply(df, function(x) { x }, schema(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have an more elaborate example to explain how func should expect or handle "each partition of the DataFrame"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
Test build #56674 has finished for PR 12493 at commit
|
|
Test build #56684 has finished for PR 12493 at commit
|
|
Test build #56663 has finished for PR 12493 at commit
|
|
Jenkins, retest this please |
|
Test build #57110 has finished for PR 12493 at commit
|
|
@sun-rui I poked around this a little bit more today. It looks like what is happening is that somehow we are creating |
|
@shivaram, is the R version on Jenkins 3.1.1? seems I need to test with it |
|
Yeah the version on Jenkins is |
|
I am using R 3.2.4. I just re-ran the test again with success. OK let me try some old versions |
|
Aha - I think the option didn't exist before. From https://cran.r-project.org/src/base/NEWS |
|
I think the best workaround is to do something like i.e. set the global option before calling rbind and then reset it to the previous value |
|
I am not sure if it is necessary to add "stringsAsFactors" as FALSE. just add for safety. Remove it for now? |
|
and add a comment for future revisit |
|
Yeah adding a comment to revisit in future sounds good. |
|
FWIW I tried the 4 lines I wrote above and it works on my machine. The code in worker.R looks something like |
|
Yes, I tried, "stringsAsFactors" must be FALSE, as our SerDe does not support factor for now |
|
@shivaram, changed the code. let's wait for the testing result:) |
|
Test build #57201 has finished for PR 12493 at commit
|
| * Returns a new [[DataFrame]] that contains the result of applying a serialized R function | ||
| * `func` to each partition. | ||
| * | ||
| * @group func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add @SInCE attribute in the comment ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark 2.0 is a good chance for add "since" for SparkR API methods. But I think we can do it consistently for all methods at one. I will submit a new JIRA issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
LGTM over all. There are still a few of the change that are not needed by this PR (for example, SERIALIZED_R_DATA_SCHEMA), are these kept for future? |
|
@davies, yes, those changes are deliberately kept for future PRs, like applyCollect() |
|
Test build #57296 has finished for PR 12493 at commit
|
|
Jenkins, retest this please |
|
Test build #57312 has finished for PR 12493 at commit
|
|
Merging this to master |
What changes were proposed in this pull request?
dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
The function signature is:
R function input: local data.frame from the partition on local node
R function output: local data.frame
Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
How was this patch tested?
SparkR unit tests.