SPARK-16785 R dapply doesn't return array or raw columns #14783

clarkfitzg · 2016-08-24T06:21:45Z

What changes were proposed in this pull request?

Fixed bug in dapplyCollect by changing the compute function of worker.R to explicitly handle raw (binary) vectors.

cc @shivaram

How was this patch tested?

Unit tests

shivaram · 2016-08-24T06:23:34Z

Jenkins, ok to test

shivaram · 2016-08-24T06:23:43Z

Thanks @clarkfitzg -- I'll take a look at this tomorrow

clarkfitzg · 2016-08-24T06:25:39Z

My pleasure. Let me know if / when I should squash these commits or rebase.

Working on some before and after benchmarks now.

sun-rui · 2016-08-24T06:29:05Z

R/pkg/R/SQLContext.R

 createDataFrame.default <- function(data, schema = NULL, samplingRatio = 1.0) {
  sparkSession <- getSparkSession()
+
+  # Convert dataframes into a list of rows. Each row is a list


how about " If the data is a dataframe, convert it into ..."?

SparkQA · 2016-08-24T06:55:48Z

Test build #64335 has finished for PR 14783 at commit 5871257.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-24T07:48:06Z

Test build #64337 has finished for PR 14783 at commit 84ef4cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

clarkfitzg · 2016-08-24T07:59:10Z

This change doesn't appear to make any difference in speed.

# Wed Aug 24 14:12:12 KST 2016
# Benchmarking performance before and after dapplyCollect patch

# Downloaded data here:
# https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv

library(microbenchmark)

sparkR.session()

df <- read.csv("~/data/nycflights13.csv")

sdf <- createDataFrame(df)

# BEFORE: 7.27 seconds
# AFTER: 7.20 seconds
# The patch shouldn't change this at all
microbenchmark({sdf <- createDataFrame(df)}, times=1)

# BEFORE: 502 seconds
# AFTER: 508 seconds
microbenchmark({
    df2 <- dapplyCollect(sdf, function(x) x)
}, times=1)

clarkfitzg · 2016-08-25T08:09:28Z

Not sure why these timings are so bad. Found out today that by using bytes and calling directly into Java's org.apache.spark.api.r.RRDD these can be improved by 2 orders of magnitude.

clarkfitzg · 2016-08-25T08:12:11Z

Not completely sure though. I'll look into these timings a little further on Saturday to make sure I'm making a fair comparison.

clarkfitzg · 2016-08-29T08:51:16Z

Tried some more benchmarks today. Didn't see any difference in speed before / after patch. Observing the processes as they run I see the vast majority of time spent in the local R process, while just a couple seconds in the actual parallel evaluation of the functions.

clarkfitzg · 2016-08-31T02:41:36Z

@shivaram what do you think?

sun-rui · 2016-08-31T04:15:37Z

@clarkfitzg, your patch is for bug fix but not for performance improvement, right? If so, since there is no performance regression according to your benchmark, let's focus on the functionality. We can address performance issue in other JIRA issues.

clarkfitzg · 2016-08-31T06:12:52Z

Yes, this is only for a bug fix. @shivaram mentioned in a previous email exchange it would be good to see some performance benchmarks as well.

felixcheung · 2016-08-31T07:36:26Z

should we have a test against DataFrame with binary column?
or, this test_that("dapplyCollect() on dataframe with list columns" should say bytes column or binary column?

SparkQA · 2016-08-31T09:10:54Z

Test build #64712 has finished for PR 14783 at commit 0c2a215.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-31T23:36:14Z

Test build #64737 has finished for PR 14783 at commit 77fa9b4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-09-01T00:32:52Z

Sorry I think this was a break that I just fixed in #14904

Jenkins, retest this please.

shivaram · 2016-09-01T04:31:39Z

Jenkins, retest this please

SparkQA · 2016-09-01T05:12:53Z

Test build #64756 has finished for PR 14783 at commit 77fa9b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-01T06:25:34Z

LGTM

shivaram · 2016-09-01T06:46:53Z

@sun-rui Any other comments ?

clarkfitzg · 2016-09-07T01:20:20Z

I'm presenting something related to this on Thursday- it would be nice to tell the audience this patch made it in. Can I do anything to help this along?

shivaram · 2016-09-07T04:11:33Z

R/pkg/R/utils.R

+  row1 <- inputData[[1]]
+  rawcolumns <- ("raw" == sapply(row1, class))
+
+  listmatrix <- do.call(rbind, inputData)


Do you know what happens if we have a mixed set of columns here ? i.e. say one column with "raw", one with "integer" and one with "character" -- From reading some docs it looks like everything is converted to create a character matrix when we use rbind.

I think we have two choices if thats the case
(a) we apply the type conversions after rbind
(b) we only call this method when all columns are raw

> b = serialize(1:10, NULL) > inputData = list(list(1L, b, 'a'), list(2L, b, 'b')) # Mixed data types > listmatrix <- do.call(rbind, inputData) > listmatrix [,1] [,2] [,3] [1,] 1 Raw,62 "a" [2,] 2 Raw,62 "b" > class(listmatrix) [1] "matrix" > typeof(listmatrix) [1] "list" > is.character(listmatrix) [1] FALSE

A little unusual- it's a list matrix. Hence the name. Which docs are you referring to?

The test that's in here now does test for mixed columns, but it doesn't test for a single column of raws. I'll add that now.

I was looking at https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html specifically the section Value which says

The type of a matrix result determined from the highest type of any of the inputs in the hierarchy raw < logical < integer < double < complex < character < list .

I think the correct class is maintained:

> sapply(listmatrix, class) [1] "integer" "integer" "raw" "raw" "character" "character" > sapply(listmatrix, typeof) [1] "integer" "integer" "raw" "raw" "character" "character"

Ah I see - the types are inside the listmatrix. Thanks @clarkfitzg for clarifying. Let us know once you have added the test for a single column of raw as well.

Since everything in in inputData is a list this goes straight to the top of hierarchy- same as if you called rbind(list1, list2, ...).

shivaram · 2016-09-07T04:12:09Z

Sorry for the delay @clarkfitzg - The code change looks pretty good to me. I just had one question about mixed type columns.

clarkfitzg · 2016-09-07T06:06:58Z

R/pkg/inst/tests/testthat/test_utils.R

+  # Single binary column
+  input <- list(list(r1), list(r2), list(r3))
+  expected <- subset(expected, select = "V2")
+  result <- setNames(rbindRaws(input), "V2")


@shivaram Here's the new test. I made the other ones a bit more general also.

SparkQA · 2016-09-07T06:38:01Z

Test build #65027 has finished for PR 14783 at commit 91d69be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-09-07T06:40:05Z

Thanks for the update. LGTM. Merging this to master and branch-2.0

Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors. cc shivaram Unit tests Author: Clark Fitzgerald <[email protected]> Closes #14783 from clarkfitzg/SPARK-16785. (cherry picked from commit 9fccde4) Signed-off-by: Shivaram Venkataraman <[email protected]>

clarkfitzg · 2016-09-07T07:04:14Z

Thanks!

catlain · 2017-06-02T06:05:26Z

still have this issue when input data is an array column not having the same length on each vector, like:

head(test1)

               key              value
1 4dda7d68a202e9e3              1595297780
2  4e08f349deb7392              641991337
3 4e105531747ee00b              374773009
4 4f1d5ef7fdb4620a              2570136926
5 4f63a71e6dde04cd              2117602722
6 4fa2f96b689624fc              3489692062, 1344510747, 1095592237, 424510360, 3211239587

sparkR.stop()
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
spark_df = createDataFrame(sqlContext, test1)

# Fails
dapplyCollect(spark_df, function(x) x)

Caused by: org.apache.spark.SparkException: R computation failed with
 Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors())  : 
  invalid list argument: all variables should have the same length
	at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
	at org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
	at org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:186)
	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:183)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more

# Works fine
spark_df <- selectExpr(spark_df, "key", "explode(value) value") 
dapplyCollect(spark_df, function(x) x)

                key         value
1  4dda7d68a202e9e3 1595297780
2   4e08f349deb7392  641991337
3  4e105531747ee00b  374773009
4  4f1d5ef7fdb4620a 2570136926
5  4f63a71e6dde04cd 2117602722
6  4fa2f96b689624fc 3489692062
7  4fa2f96b689624fc 1344510747
8  4fa2f96b689624fc 1095592237
9  4fa2f96b689624fc  424510360
10 4fa2f96b689624fc 3211239587

felixcheung · 2017-06-13T05:23:06Z

@catlain could you please open a JIRA.
like this, set component to SparkR https://issues.apache.org/jira/browse/SPARK-21068?filter=12333531

catlain · 2017-06-13T11:30:24Z

done
jira

clarkfitzg · 2017-06-13T23:23:27Z

This patch only handled the raw columns, not the vector / array value columns. So maybe that original JIRA should still be open, or create another one specific to this.

clarkfitzg added 17 commits August 18, 2016 11:49

test for array and byte columns

f8e1920

R createDataFrame.default uses list of dataframes as rows

1336605

test is working now

a0e13ff

add simple collect() to test

5015233

document how tests fail

d044054

identified where patch went wrong

311b554

back to original code for creating data frame

2d2654d

first pass modifying worker.R

ff1a0d0

no change in error message, reverting

1e27ef3

an experiment modifying dapplyCollect directly

25d0ec1

dapplyCollect worked, it just nested things

77a9822

tests pass!

70b0d44

put rbind function in utils.R

b21a21d

rename to rbindRaws and put in utils.R

ba87b06

just whitespace

528fa6e

syntax error in worker.R

e0a3894

satisfy lintr

5871257

sun-rui reviewed Aug 24, 2016
View reviewed changes

address sun-rui's comments

84ef4cc

Change names from list_cols -> binary per felixcheung's feedback

0c2a215

fix R style fail

77fa9b4

shivaram reviewed Sep 7, 2016
View reviewed changes

tests for a single binary column

91d69be

clarkfitzg reviewed Sep 7, 2016
View reviewed changes

asfgit closed this in 9fccde4 Sep 7, 2016

SPARK-16785 R dapply doesn't return array or raw columns #14783

SPARK-16785 R dapply doesn't return array or raw columns #14783

Uh oh!

Conversation

clarkfitzg commented Aug 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

shivaram commented Aug 24, 2016

Uh oh!

shivaram commented Aug 24, 2016

Uh oh!

clarkfitzg commented Aug 24, 2016

Uh oh!

sun-rui Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 24, 2016

Uh oh!

SparkQA commented Aug 24, 2016

Uh oh!

clarkfitzg commented Aug 24, 2016

Uh oh!

clarkfitzg commented Aug 25, 2016

Uh oh!

clarkfitzg commented Aug 25, 2016

Uh oh!

clarkfitzg commented Aug 29, 2016

Uh oh!

clarkfitzg commented Aug 31, 2016

Uh oh!

sun-rui commented Aug 31, 2016

Uh oh!

clarkfitzg commented Aug 31, 2016

Uh oh!

felixcheung commented Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 31, 2016

Uh oh!

SparkQA commented Aug 31, 2016

Uh oh!

shivaram commented Sep 1, 2016

Uh oh!

shivaram commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

felixcheung commented Sep 1, 2016

Uh oh!

shivaram commented Sep 1, 2016

Uh oh!

clarkfitzg commented Sep 7, 2016

Uh oh!

shivaram Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

clarkfitzg Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

shivaram Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

shivaram Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

clarkfitzg Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

shivaram commented Sep 7, 2016

Uh oh!

clarkfitzg Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 7, 2016

Uh oh!

sun-rui Aug 24, 2016 •

edited

Loading

felixcheung commented Aug 31, 2016 •

edited

Loading

catlain commented Jun 2, 2017 •

edited

Loading