[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. #12493

sun-rui · 2016-04-19T07:32:36Z

What changes were proposed in this pull request?

dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.

The function signature is:

dapply(df, function(localDF) {}, schema = NULL)

R function input: local data.frame from the partition on local node
R function output: local data.frame

Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().

How was this patch tested?

SparkR unit tests.

SparkQA · 2016-04-19T07:38:47Z

Test build #56206 has finished for PR 12493 at commit 00a8c1c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2016-04-19T07:44:15Z

@rxin, @davies, @NarineK, @shivaram, please help to review it so that it can catch spark 2.0

SparkQA · 2016-04-19T09:46:22Z

Test build #56209 has finished for PR 12493 at commit e6b67b0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-04-19T20:53:57Z

R/pkg/R/DataFrame.R

+#' @family DataFrame functions
+#' @rdname dapply
+#' @name dapply
+#' @export


pls add doc example

shivaram · 2016-04-19T21:35:29Z

Thanks @sun-rui for the change. I did a first pass over it. It would be good to add some more test cases for where the schema is not specified as well.

Also I think we need somebody from the SQL side to look at this (cc @rxin @davies)

SparkQA · 2016-04-20T05:02:40Z

Test build #56320 has finished for PR 12493 at commit 480dec9.

This patch fails some tests.
This patch does not merge cleanly.
This patch adds no public classes.

sun-rui · 2016-04-20T06:12:22Z

@shivaram, there is already a test case for for where the schema is not specified. Do you mean adding more?

rxin · 2016-04-20T07:00:32Z

@davies should take a detailed look at this.

This looks pretty good based on my very very quick glance.

davies · 2016-04-20T07:07:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala

+ * A function wrapper that applies the given R function to each partition.
+ */
+private[sql] case class MapPartitionsRWrapper(
+   func: Array[Byte],


rxin · 2016-04-20T07:36:34Z

BTW one observation: FWIW, I think the dapplyCollect method will be a lot more useful, because that's the one that can be used for training models, etc.

SparkQA · 2016-04-20T08:00:27Z

Test build #56322 has finished for PR 12493 at commit 80da663.

This patch fails SparkR unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

sun-rui · 2016-04-20T11:47:15Z

The test failure is weird:

1. Failure (at test_sparkSQL.R#1973): dapply() on a DataFrame ------------------
expected is not identical to result. Differences: 
Attributes: < Component "row.names": Numeric: lengths (33, 32) differ >
Component "mpg": Numeric: lengths (33, 32) differ

The unit tests passed in my machine. anyone has idea?

sun-rui · 2016-04-20T12:04:45Z

@rxin, I will implement dapplyCollect and collect() on DataFrame of serialized R data in a following PR.

shivaram · 2016-04-20T18:19:42Z

@sun-rui Regarding the unit tests could it be related to the R version or the version of testthat we are using on Jenkins ?

sun-rui · 2016-04-21T13:16:53Z

When I rebased this PR to master, I found a bug in Catalyst optimizer. I submitted a PR for it #12575. I have to wait for it to be fixed.

felixcheung · 2016-04-22T00:47:24Z

R/pkg/R/DataFrame.R

+#' @examples
+#' \dontrun{
+#'   df <- createDataFrame (sqlContext, mtcars)
+#'   df1 <- dapply(df, function(x) { x }, schema(df))


could we have an more elaborate example to explain how func should expect or handle "each partition of the DataFrame"?

SparkQA · 2016-04-22T07:33:45Z

Test build #56674 has finished for PR 12493 at commit 76a6fd7.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-22T08:08:43Z

Test build #56684 has finished for PR 12493 at commit 481df69.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-22T08:42:30Z

Test build #56663 has finished for PR 12493 at commit 75dae85.

This patch fails SparkR unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

sun-rui · 2016-04-27T08:53:58Z

Jenkins, retest this please

SparkQA · 2016-04-27T10:59:16Z

Test build #57110 has finished for PR 12493 at commit b39466c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-04-27T19:53:45Z

@sun-rui I poked around this a little bit more today. It looks like what is happening is that somehow we are creating factor type objects when we have strings in our dataframe. I think the problem is in the line
data <- do.call(rbind.data.frame, c(data, stringsAsFactors = FALSE)) in worker.R -- I am not sure the stringsAsFactors = F is being passed correctly.

sun-rui · 2016-04-28T01:05:57Z

@shivaram, is the R version on Jenkins 3.1.1? seems I need to test with it

shivaram · 2016-04-28T01:10:07Z

Yeah the version on Jenkins is R version 3.1.1 (2014-07-10) and on my laptop is R version 3.2.1 (2015-06-18). I can see the error on my laptop as well

sun-rui · 2016-04-28T01:16:22Z

I am using R 3.2.4. I just re-ran the test again with success. OK let me try some old versions

shivaram · 2016-04-28T01:20:27Z

Aha - I think the option didn't exist before. From https://cran.r-project.org/src/base/NEWS

CHANGES IN R 3.2.4:
....
    The data.frame method of rbind() gains an optional argument
      stringsAsFactors (instead of only depending on
      getOption("stringsAsFactors")).
....

shivaram · 2016-04-28T01:23:20Z

I think the best workaround is to do something like

oldOpt <- getOption("stringsAsFactors")
options(stringsAsFactors=FALSE)
do.call(rbind.data.frame(data))
options(stringsAsFactors=oldOpt)

i.e. set the global option before calling rbind and then reset it to the previous value

sun-rui · 2016-04-28T01:30:46Z

I am not sure if it is necessary to add "stringsAsFactors" as FALSE. just add for safety. Remove it for now?

sun-rui · 2016-04-28T01:31:38Z

and add a comment for future revisit

shivaram · 2016-04-28T01:33:42Z

Yeah adding a comment to revisit in future sounds good.

shivaram · 2016-04-28T01:51:52Z

FWIW I tried the 4 lines I wrote above and it works on my machine. The code in worker.R looks something like

...
+    if (isDataFrame) {
+      if (deserializer == "row") {
+        # Transform the list of rows into a data.frame
+        oldOpt <- getOption("stringsAsFactors")
+        options(stringsAsFactors = FALSE)
+        data <- do.call(rbind.data.frame, data)
+        options(stringsAsFactors = oldOpt)
+        names(data) <- colNames
+      } else {
...

sun-rui · 2016-04-28T01:55:33Z

Yes, I tried, "stringsAsFactors" must be FALSE, as our SerDe does not support factor for now
so I am changing the code as your proposal

sun-rui · 2016-04-28T01:58:14Z

@shivaram, changed the code. let's wait for the testing result:)

shivaram · 2016-04-28T02:07:52Z

Cool R code LGTM. @davies / @rxin If one of you can take a final pass at the SQL changes this should be good to do.

SparkQA · 2016-04-28T03:55:59Z

Test build #57201 has finished for PR 12493 at commit 2264b57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NarineK · 2016-04-28T06:09:22Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * Returns a new [[DataFrame]] that contains the result of applying a serialized R function
+   * `func` to each partition.
+   *
+   * @group func


Maybe we can add @SInCE attribute in the comment ?

Spark 2.0 is a good chance for add "since" for SparkR API methods. But I think we can do it consistently for all methods at one. I will submit a new JIRA issue for it.

submitted https://issues.apache.org/jira/browse/SPARK-14995

davies · 2016-04-28T19:28:27Z

LGTM over all. There are still a few of the change that are not needed by this PR (for example, SERIALIZED_R_DATA_SCHEMA), are these kept for future?

sun-rui · 2016-04-29T00:20:53Z

@davies, yes, those changes are deliberately kept for future PRs, like applyCollect()

SparkQA · 2016-04-29T03:40:13Z

Test build #57296 has finished for PR 12493 at commit 3efe9f5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-04-29T05:15:05Z

Jenkins, retest this please

SparkQA · 2016-04-29T07:12:24Z

Test build #57312 has finished for PR 12493 at commit 3efe9f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-04-29T23:40:43Z

Merging this to master

felixcheung mentioned this pull request Apr 19, 2016

[SPARK-7264][ML] Parallel lapply for sparkR #12426

Closed

2 tasks

felixcheung reviewed Apr 19, 2016
View reviewed changes

davies reviewed Apr 20, 2016
View reviewed changes

felixcheung reviewed Apr 22, 2016
View reviewed changes

'stringsAsFactors'argument for rbind is not supported before R 3.2.4.

2264b57

NarineK reviewed Apr 28, 2016
View reviewed changes

Address comments.

3efe9f5

asfgit closed this in 4ae9fe0 Apr 29, 2016

[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. #12493

[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. #12493

Uh oh!

Conversation

sun-rui commented Apr 19, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

sun-rui commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

sun-rui commented Apr 20, 2016

Uh oh!

rxin commented Apr 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

sun-rui commented Apr 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sun-rui commented Apr 20, 2016

Uh oh!

shivaram commented Apr 20, 2016

Uh oh!

sun-rui commented Apr 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

sun-rui commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

shivaram commented Apr 27, 2016

Uh oh!

sun-rui commented Apr 28, 2016

Uh oh!

shivaram commented Apr 28, 2016

Uh oh!

sun-rui commented Apr 28, 2016

Uh oh!

shivaram commented Apr 28, 2016

Uh oh!

shivaram commented Apr 28, 2016

Uh oh!

sun-rui commented Apr 28, 2016

Uh oh!

sun-rui commented Apr 28, 2016

Uh oh!

shivaram commented Apr 28, 2016

Uh oh!

shivaram commented Apr 28, 2016

Uh oh!

sun-rui commented Apr 28, 2016

Uh oh!

sun-rui commented Apr 28, 2016

sun-rui commented Apr 20, 2016 •

edited

Loading

sun-rui Apr 29, 2016 •

edited

Loading

sun-rui commented Apr 29, 2016 •

edited

Loading