-
Notifications
You must be signed in to change notification settings - Fork 28.9k
SPARK-16785 R dapply doesn't return array or raw columns #14783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, ok to test |
|
Thanks @clarkfitzg -- I'll take a look at this tomorrow |
|
My pleasure. Let me know if / when I should squash these commits or rebase. Working on some before and after benchmarks now. |
R/pkg/R/SQLContext.R
Outdated
| createDataFrame.default <- function(data, schema = NULL, samplingRatio = 1.0) { | ||
| sparkSession <- getSparkSession() | ||
|
|
||
| # Convert dataframes into a list of rows. Each row is a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about " If the data is a dataframe, convert it into ..."?
|
Test build #64335 has finished for PR 14783 at commit
|
|
Test build #64337 has finished for PR 14783 at commit
|
|
This change doesn't appear to make any difference in speed. |
|
Not sure why these timings are so bad. Found out today that by using bytes and calling directly into Java's |
|
Not completely sure though. I'll look into these timings a little further on Saturday to make sure I'm making a fair comparison. |
|
@shivaram what do you think? |
|
@clarkfitzg, your patch is for bug fix but not for performance improvement, right? If so, since there is no performance regression according to your benchmark, let's focus on the functionality. We can address performance issue in other JIRA issues. |
|
Yes, this is only for a bug fix. @shivaram mentioned in a previous email exchange it would be good to see some performance benchmarks as well. |
|
should we have a test against DataFrame with binary column? |
|
Test build #64712 has finished for PR 14783 at commit
|
|
Test build #64737 has finished for PR 14783 at commit
|
|
Sorry I think this was a break that I just fixed in #14904 Jenkins, retest this please. |
|
Jenkins, retest this please |
|
Test build #64756 has finished for PR 14783 at commit
|
|
LGTM |
|
@sun-rui Any other comments ? |
|
I'm presenting something related to this on Thursday- it would be nice to tell the audience this patch made it in. Can I do anything to help this along? |
| row1 <- inputData[[1]] | ||
| rawcolumns <- ("raw" == sapply(row1, class)) | ||
|
|
||
| listmatrix <- do.call(rbind, inputData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what happens if we have a mixed set of columns here ? i.e. say one column with "raw", one with "integer" and one with "character" -- From reading some docs it looks like everything is converted to create a character matrix when we use rbind.
I think we have two choices if thats the case
(a) we apply the type conversions after rbind
(b) we only call this method when all columns are raw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> b = serialize(1:10, NULL)
> inputData = list(list(1L, b, 'a'), list(2L, b, 'b')) # Mixed data types
> listmatrix <- do.call(rbind, inputData)
> listmatrix
[,1] [,2] [,3]
[1,] 1 Raw,62 "a"
[2,] 2 Raw,62 "b"
> class(listmatrix)
[1] "matrix"
> typeof(listmatrix)
[1] "list"
> is.character(listmatrix)
[1] FALSE
A little unusual- it's a list matrix. Hence the name. Which docs are you referring to?
The test that's in here now does test for mixed columns, but it doesn't test for a single column of raws. I'll add that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html specifically the section Value which says
The type of a matrix result determined from the highest type of any of the inputs in the hierarchy raw < logical < integer < double < complex < character < list .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct class is maintained:
> sapply(listmatrix, class)
[1] "integer" "integer" "raw" "raw" "character" "character"
> sapply(listmatrix, typeof)
[1] "integer" "integer" "raw" "raw" "character" "character"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - the types are inside the listmatrix. Thanks @clarkfitzg for clarifying. Let us know once you have added the test for a single column of raw as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since everything in in inputData is a list this goes straight to the top of hierarchy- same as if you called rbind(list1, list2, ...).
|
Sorry for the delay @clarkfitzg - The code change looks pretty good to me. I just had one question about mixed type columns. |
| # Single binary column | ||
| input <- list(list(r1), list(r2), list(r3)) | ||
| expected <- subset(expected, select = "V2") | ||
| result <- setNames(rbindRaws(input), "V2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram Here's the new test. I made the other ones a bit more general also.
|
Test build #65027 has finished for PR 14783 at commit
|
|
Thanks for the update. LGTM. Merging this to master and branch-2.0 |
Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors. cc shivaram Unit tests Author: Clark Fitzgerald <[email protected]> Closes #14783 from clarkfitzg/SPARK-16785. (cherry picked from commit 9fccde4) Signed-off-by: Shivaram Venkataraman <[email protected]>
|
Thanks! |
|
still have this issue when input data is an array column not having the same length on each vector, like: |
|
@catlain could you please open a JIRA. |
|
done |
|
This patch only handled the raw columns, not the vector / array value columns. So maybe that original JIRA should still be open, or create another one specific to this. |

What changes were proposed in this pull request?
Fixed bug in
dapplyCollectby changing thecomputefunction ofworker.Rto explicitly handle raw (binary) vectors.cc @shivaram
How was this patch tested?
Unit tests