-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR #19551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #82958 has finished for PR 19551 at commit
|
ffe751b to
145b60f
Compare
|
Test build #82959 has finished for PR 19551 at commit
|
falaki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this PR.
| expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE)) | ||
| }) | ||
|
|
||
| test_that("SPARK-17902: collect() with stringsAsFactors enabled", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please verify that factor orders are identical. I wonder if expect_equal really verifies order of values in a factor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> # Ordered vs unordered
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=TRUE)
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=FALSE)
> expect_equal(or, or1)
error: `or` not equal to `or1`.
Attributes: < Component “class”: Lengths (2, 1) differ (string compare on first 1) >
Attributes: < Component “class”: 1 string mismatch >> # level order mismatch
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"))
> expect_equal(or, or1)
error: `or` not equal to `or1`.
Attributes: < Component “levels”: 3 string mismatches ># Data order mismatch
> or <- factor(c("Lo", "Hi", "Med", "Med", "Hi"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> expect_equal(or, or1)
error: `or` not equal to `or1`.
4 string mismatches> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> expect_equal(or, or1)Would this test address your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: I think iris data frame all Species values are clustered together. That is why the test is passing (the new factor order ends up being identical to the existing order).
R/pkg/R/DataFrame.R
Outdated
| vec <- do.call(c, col) | ||
| stopifnot(class(vec) != "list") | ||
| class(vec) <- PRIMITIVE_TYPES[[colType]] | ||
| if (stringsAsFactors && is.character(vec)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance maybe it is better to reverse the order of checks: is.character(vec) && stringsAsFactors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks.
|
Test build #82982 has finished for PR 19551 at commit
|
|
LGTM |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, added this to SPARK-21616
## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in 71a138c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <[email protected]> Closes #19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5) Signed-off-by: hyukjinkwon <[email protected]>
## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in 71a138c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <[email protected]> Closes #19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5) Signed-off-by: hyukjinkwon <[email protected]>
|
Merged to master, branch-2.2 and branch-2.1. |
|
Thank you for review @falaki and @felixcheung. |
## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in apache@71a138c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <[email protected]> Closes apache#19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5) Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
This PR proposes to revive
stringsAsFactorsoption in collect API, which was mistakenly removed in 71a138c.Simply, it casts
charactortofactorif it meets the condition,stringsAsFactors && is.character(vec)in primitive type conversion.How was this patch tested?
Unit test in
R/pkg/tests/fulltests/test_sparkSQL.R.