Skip to content

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented Apr 22, 2017

What changes were proposed in this pull request?

Add wrappers for o.a.s.sql.functions:

  • split as split_string
  • repeat as repeat_string

How was this patch tested?

Existing tests, additional unit tests, check-cran.sh

@SparkQA
Copy link

SparkQA commented Apr 22, 2017

Test build #76071 has finished for PR 17729 at commit 255863a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Apr 22, 2017

cc @felixcheung

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!

"rank",
"regexp_extract",
"regexp_replace",
"repeat_string",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call on these names!

#' head(select(split_string(df$value, "\\s+")))
#' }
#' @note split_string 2.3.0
#' @note equivalent to \code{split} SQL function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note is somewhat hard to discover on the generated doc page, if you want this, you could put it as 2nd content paragraph like below and it will show up as the details section like here http://spark.apache.org/docs/latest/api/R/read.jdbc.html

#' split_string
#'
#' Splits string on regular expression.
#'
#' This is equivalent to \code{split} SQL function

(yes, through the magic of roxygen2)

Also, instead of \code{split} you might want to link to Spark Scala doc too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats cool :) I am not convince about the linking though. Scala docs are not very useful.

I considered adding expr or selectExpr version to examples:

selectExpr(df, "split(value, '@')")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's good to have but want to caution that we might forget to update it if it changes

#' head(select(repeat_string(df$text, 3)))
#' }
#' @note repeat_string 2.3.0
#' @note equivalent to \code{repeat} SQL function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto above

#' @examples \dontrun{
#' df <- createDataFame(data.frame(
#' text = c("foo", "bar")
#' ))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this though would it be better with the read.text example than a fake 1 row like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but it is hard to find a good source at hand. We could use data/streaming/AFINN-111.txt which has nice and short lines, or README.md and just take head(., 1) (the rest is empty or longish.

"abcabcabc"
)
expect_equal(
collect(select(df5, repeat_string(df5$a, -1)))[1, 1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) ahh, -1 works?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right? I think we should keep it this way to avoid any confusion when users switch between SQL and DSL. If anything changes it will cause test failure and then we can add R side checks.

setMethod("repeat_string",
signature(x = "Column", n = "numeric"),
function(x, n) {
jc <- callJStatic("org.apache.spark.sql.functions", "repeat", x@jc, as.integer(n))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good actually, may I introduce you to numToInt, an internal util

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's useful.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76109 has finished for PR 17729 at commit ce0c4b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@felixcheung
Copy link
Member

merged to master

@asfgit asfgit closed this in 8a272dd Apr 24, 2017
@zero323
Copy link
Member Author

zero323 commented Apr 24, 2017

Thanks @felixcheung

@zero323 zero323 deleted the SPARK-20438 branch April 26, 2017 13:17
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
## What changes were proposed in this pull request?

Add wrappers for `o.a.s.sql.functions`:

- `split` as `split_string`
- `repeat` as `repeat_string`

## How was this patch tested?

Existing tests, additional unit tests, `check-cran.sh`

Author: zero323 <[email protected]>

Closes apache#17729 from zero323/SPARK-20438.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants