[SPARK-26830][SQL][R] Vectorized R dapply() implementation #23787

HyukjinKwon · 2019-02-14T09:43:59Z

What changes were proposed in this pull request?

This PR targets to add vectorized dapply() in R, Arrow optimization.

This can be tested as below:

$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true

df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))

Requirements

R 3.5.x

Arrow package 0.12+

Rscript -e 'remotes::install_github("apache/[email protected]", subdir = "r")'

Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

Benchmarks

Shall

sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g

sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g

R code

rdf <- read.csv("500000.csv")
df <- cache(createDataFrame(rdf))
count(df)


test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  count(cache(dapply(df, function(rdf) { rdf }, schema(df))))
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()

Data (350 MB):

object.size(read.csv("500000.csv"))
350379504 bytes

"500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

Results

Time difference of 13.42037 mins

Time difference of 30.64156 secs

The performance improvement was around 2627%.

Limitations

For now, Arrow optimization with R does not support when the data is raw, and when user explicitly gives float type in the schema. They produce corrupt values.
Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

How was this patch tested?

Unit tests were added, and manually tested.

R/pkg/R/DataFrame.R

HyukjinKwon · 2019-02-14T09:45:01Z

cc @BryanCutler, @viirya, @felixcheung, @icexelloss, @rxin, @gatorsmile, @shivaram, @falaki, @yanboliang

HyukjinKwon · 2019-02-14T09:47:54Z

FWIW, my todos on my current plate after this and #23760 are:

Set socket timeout consistently
Deduplicate schema checking
Timestamp / date support in regaulr gapply (without Arrow)
Refactor RRunner (and maybe with PythonRunner)
Refactor Auth
Arrow R CRAN fix (if Arrow is on CRAN).
SQLConf documentation
SparkR guide documentation
Writing a blog at Arrow community

These are correctly blocked by both this and #23760 mostly to avoid conflict hell..

I filed JIRA under https://issues.apache.org/jira/browse/SPARK-26759

HyukjinKwon · 2019-02-14T10:06:25Z

cc @cloud-fan too.

sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/r/ArrowRRunner.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala

R/pkg/tests/fulltests/test_sparkSQL.R

SparkQA · 2019-02-18T18:06:05Z

Test build #102477 has finished for PR 23787 at commit 5a5831d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-20T03:39:11Z

I am sure this is ready for a review. Many codes are similar with both Scalar vectorized UDF and vectorized gapply(). Should be quite safe to go.

HyukjinKwon · 2019-02-22T02:50:04Z

gentle ping

felixcheung

I reviewed the R side

R/pkg/R/DataFrame.R

BryanCutler

Thanks for doing this @HyukjinKwon ! Looks pretty good from what I can tell. Possibly adding a few more tests here or in a followup might be good for some edge cases.

R/pkg/tests/fulltests/test_sparkSQL.R

sql/core/src/main/scala/org/apache/spark/sql/execution/r/ArrowRRunner.scala

SparkQA · 2019-02-24T08:05:02Z

Test build #102719 has finished for PR 23787 at commit 530e26d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-24T09:04:56Z

retest this please

SparkQA · 2019-02-24T13:26:40Z

Test build #102721 has finished for PR 23787 at commit 530e26d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala

SparkQA · 2019-02-25T06:12:24Z

Test build #102730 has finished for PR 23787 at commit 5a124cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM from my end

HyukjinKwon · 2019-02-27T01:00:55Z

retest this please

SparkQA · 2019-02-27T05:28:14Z

Test build #102808 has finished for PR 23787 at commit 5a124cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-27T05:29:25Z

Merged to master.

I am going to resolve the followups one by one.

HyukjinKwon · 2019-02-27T05:30:14Z

Thanks, @felixcheung and @BryanCutler

This PR targets to add vectorized `dapply()` in R, Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` **R code** ```r rdf <- read.csv("500000.csv") df <- cache(createDataFrame(rdf)) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() count(cache(dapply(df, function(rdf) { rdf }, schema(df)))) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 13.42037 mins ``` ``` Time difference of 30.64156 secs ``` The performance improvement was around **2627%**. - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. Unit tests were added, and manually tested. Closes apache#23787 from HyukjinKwon/SPARK-26830-1. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

## What changes were proposed in this pull request? This PR targets to add vectorized `dapply()` in R, Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` **R code** ```r rdf <- read.csv("500000.csv") df <- cache(createDataFrame(rdf)) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() count(cache(dapply(df, function(rdf) { rdf }, schema(df)))) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 13.42037 mins ``` ``` Time difference of 30.64156 secs ``` The performance improvement was around **2627%**. ### Limitations - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. ## How was this patch tested? Unit tests were added, and manually tested. Closes apache#23787 from HyukjinKwon/SPARK-26830-1. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

This PR targets to add vectorized `dapply()` in R, Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` **R code** ```r rdf <- read.csv("500000.csv") df <- cache(createDataFrame(rdf)) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() count(cache(dapply(df, function(rdf) { rdf }, schema(df)))) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 13.42037 mins ``` ``` Time difference of 30.64156 secs ``` The performance improvement was around **2627%**. - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. Unit tests were added, and manually tested. Closes apache#23787 from HyukjinKwon/SPARK-26830-1. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon commented Feb 14, 2019

View reviewed changes

R/pkg/R/DataFrame.R Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

HyukjinKwon mentioned this pull request Feb 15, 2019

[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame #23760

Closed