-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20438][R] SparkR wrappers for split and repeat #17729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #76071 has finished for PR 17729 at commit
|
|
cc @felixcheung |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks!
| "rank", | ||
| "regexp_extract", | ||
| "regexp_replace", | ||
| "repeat_string", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call on these names!
R/pkg/R/functions.R
Outdated
| #' head(select(split_string(df$value, "\\s+"))) | ||
| #' } | ||
| #' @note split_string 2.3.0 | ||
| #' @note equivalent to \code{split} SQL function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note is somewhat hard to discover on the generated doc page, if you want this, you could put it as 2nd content paragraph like below and it will show up as the details section like here http://spark.apache.org/docs/latest/api/R/read.jdbc.html
#' split_string
#'
#' Splits string on regular expression.
#'
#' This is equivalent to \code{split} SQL function
(yes, through the magic of roxygen2)
Also, instead of \code{split} you might want to link to Spark Scala doc too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats cool :) I am not convince about the linking though. Scala docs are not very useful.
I considered adding expr or selectExpr version to examples:
selectExpr(df, "split(value, '@')")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's good to have but want to caution that we might forget to update it if it changes
R/pkg/R/functions.R
Outdated
| #' head(select(repeat_string(df$text, 3))) | ||
| #' } | ||
| #' @note repeat_string 2.3.0 | ||
| #' @note equivalent to \code{repeat} SQL function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto above
R/pkg/R/functions.R
Outdated
| #' @examples \dontrun{ | ||
| #' df <- createDataFame(data.frame( | ||
| #' text = c("foo", "bar") | ||
| #' )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this though would it be better with the read.text example than a fake 1 row like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but it is hard to find a good source at hand. We could use data/streaming/AFINN-111.txt which has nice and short lines, or README.md and just take head(., 1) (the rest is empty or longish.
| "abcabcabc" | ||
| ) | ||
| expect_equal( | ||
| collect(select(df5, repeat_string(df5$a, -1)))[1, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) ahh, -1 works?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right? I think we should keep it this way to avoid any confusion when users switch between SQL and DSL. If anything changes it will cause test failure and then we can add R side checks.
R/pkg/R/functions.R
Outdated
| setMethod("repeat_string", | ||
| signature(x = "Column", n = "numeric"), | ||
| function(x, n) { | ||
| jc <- callJStatic("org.apache.spark.sql.functions", "repeat", x@jc, as.integer(n)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good actually, may I introduce you to numToInt, an internal util
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's useful.
|
Test build #76109 has finished for PR 17729 at commit
|
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
merged to master |
|
Thanks @felixcheung |
## What changes were proposed in this pull request? Add wrappers for `o.a.s.sql.functions`: - `split` as `split_string` - `repeat` as `repeat_string` ## How was this patch tested? Existing tests, additional unit tests, `check-cran.sh` Author: zero323 <[email protected]> Closes apache#17729 from zero323/SPARK-20438.
What changes were proposed in this pull request?
Add wrappers for
o.a.s.sql.functions:splitassplit_stringrepeatasrepeat_stringHow was this patch tested?
Existing tests, additional unit tests,
check-cran.sh