-
Notifications
You must be signed in to change notification settings - Fork 28.9k
SPARK-16785 R dapply doesn't return array or raw columns #14783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f8e1920
1336605
a0e13ff
5015233
d044054
311b554
2d2654d
ff1a0d0
1e27ef3
25d0ec1
77a9822
70b0d44
b21a21d
ba87b06
528fa6e
e0a3894
5871257
84ef4cc
0c2a215
77fa9b4
91d69be
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -183,4 +183,28 @@ test_that("overrideEnvs", { | |
| expect_equal(config[["config_only"]], "ok") | ||
| }) | ||
|
|
||
| test_that("rbindRaws", { | ||
|
|
||
| # Mixed Column types | ||
| r <- serialize(1:5, connection = NULL) | ||
| r1 <- serialize(1, connection = NULL) | ||
| r2 <- serialize(letters, connection = NULL) | ||
| r3 <- serialize(1:10, connection = NULL) | ||
| inputData <- list(list(1L, r1, "a", r), list(2L, r2, "b", r), | ||
| list(3L, r3, "c", r)) | ||
| expected <- data.frame(V1 = 1:3) | ||
| expected$V2 <- list(r1, r2, r3) | ||
| expected$V3 <- c("a", "b", "c") | ||
| expected$V4 <- list(r, r, r) | ||
| result <- rbindRaws(inputData) | ||
| expect_equal(expected, result) | ||
|
|
||
| # Single binary column | ||
| input <- list(list(r1), list(r2), list(r3)) | ||
| expected <- subset(expected, select = "V2") | ||
| result <- setNames(rbindRaws(input), "V2") | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shivaram Here's the new test. I made the other ones a bit more general also. |
||
| expect_equal(expected, result) | ||
|
|
||
| }) | ||
|
|
||
| sparkR.session.stop() | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,7 +36,14 @@ compute <- function(mode, partition, serializer, deserializer, key, | |
| # available since R 3.2.4. So we set the global option here. | ||
| oldOpt <- getOption("stringsAsFactors") | ||
| options(stringsAsFactors = FALSE) | ||
| inputData <- do.call(rbind.data.frame, inputData) | ||
|
|
||
| # Handle binary data types | ||
| if ("raw" %in% sapply(inputData[[1]], class)) { | ||
| inputData <- SparkR:::rbindRaws(inputData) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as above. SparkR::: is not needed
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, but looking through the rest of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Running tests locally without it- appears it is necessary here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No it is not a preferred style in worker.R. It seems that they were some changes left slip under some previous code review. should be moved to the front of work.R, and thus SparkR::: can be removed. A lot of "SparkR:::" is annoying.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I remove So it looks like the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure. "SparkR:::" is needed for private functions. |
||
| } else { | ||
| inputData <- do.call(rbind.data.frame, inputData) | ||
| } | ||
|
|
||
| options(stringsAsFactors = oldOpt) | ||
|
|
||
| names(inputData) <- colNames | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what happens if we have a mixed set of columns here ? i.e. say one column with "raw", one with "integer" and one with "character" -- From reading some docs it looks like everything is converted to create a
charactermatrix when we userbind.I think we have two choices if thats the case
(a) we apply the type conversions after
rbind(b) we only call this method when all columns are
rawThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little unusual- it's a list matrix. Hence the name. Which docs are you referring to?
The test that's in here now does test for mixed columns, but it doesn't test for a single column of raws. I'll add that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html specifically the section
Valuewhich saysThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct class is maintained:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - the types are inside the
listmatrix. Thanks @clarkfitzg for clarifying. Let us know once you have added the test for a single column of raw as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since everything in in
inputDatais a list this goes straight to the top of hierarchy- same as if you calledrbind(list1, list2, ...).