-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19819][SparkR] Use concrete data in SparkR DataFrame examples #17161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| #' sparkR.session() | ||
| #' path <- "path/to/file.json" | ||
| #' df <- read.json(path) | ||
| #' df <- createDataFrame(mtcars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define mtcars to run this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mtcars is built in to R, like iris
|
This rings a bell to me. I opened a PR #16824 to make the examples in Python API doc as self-contained ones but closed as it seems not having much interests from other committers. I would like to see if other R committers support this and then reopen it if they like this. |
|
Test build #73896 has finished for PR 17161 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it, but these aren't meant to be runnable examples, but just documentation showing how a user might call something. They won't be starting with a built-in data set. I personally am neutral but what about @felixcheung ?
| #' sparkR.session() | ||
| #' path <- "path/to/file.json" | ||
| #' df <- read.json(path) | ||
| #' df <- createDataFrame(mtcars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mtcars is built in to R, like iris
|
In the current code, some examples are runnable but some are not. It's not very consistent. I think most examples in R packages are (supposed to be) runnable. Coming from a user perspective, I find it useful if I can run the examples directly and see what the function does in action. Since we already have the pseudo-code here, wouldn't it be better to change it to real data? Indeed, by making the examples runnable, I have found and fixed several issues with the pseudo example. For example, the original example in This is very hard to find without running real examples. |
| #' write.df(df, "myfile", "parquet", "overwrite") | ||
| #' saveDF(df, parquetPath2, "parquet", mode = saveMode, mergeSchema = mergeSchema) | ||
| #' df <- createDataFrame(mtcars) | ||
| #' write.df(df, tempfile(), "parquet", "overwrite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid having tempfile() as output path in example, as that might point users into the wrong direction - anything saved in tempfile will disappear as soon as the R session ends.
| #' df <- createDataFrame(mtcars) | ||
| #' collected <- collect(df) | ||
| #' firstName <- collected[[1]]$name | ||
| #' collected[[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, that seems rather unnecessary. any other idea on how to show it is a data.frame?
| #' path <- "path/to/file.json" | ||
| #' df <- read.json(path) | ||
| #' df <- createDataFrame(mtcars) | ||
| #' newDF <- coalesce(df, 1L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably not have coalesce in the example blob for repartition
| #' newDF <- repartition(df, numPartitions = 2L) | ||
| #' newDF <- repartition(df, col = df$"col1", df$"col2") | ||
| #' newDF <- repartition(df, 3L, col = df$"col1", df$"col2") | ||
| #' newDF <- repartition(df, col = df[[1]], df[[2]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
showing as an example column reference with $name is important too
| #' df2 <- read.df(path2, "parquet") | ||
| #' createOrReplaceTempView(df, "table1") | ||
| #' insertInto(df2, "table1", overwrite = TRUE) | ||
| #' df <- limit(createDataFrame(faithful), 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why limit?
|
Firstly, I see this as slightly different from Python, in that in R it is common to have built-in datasets and possibly users are used to having them and having examples using them. And as of now, many of our examples are not meant currently to be runnable and they are clearly indicated as such. I have done a pass on the changes in this PR and I'm happy with changing from non-existing json file to That said, I wonder about the verbosity of adding to examples like this, similarly as in the Python discussions, and, since we have more than 300 pages of API doc, this is not a simple task to change them all. But I do agree that not having broken or incorrect examples is very important. My concerns are:
Couple of random thoughts (would be interested to see how they look first!):
I suspect we would likely need a combination or subset of these techniques. Thoughts? |
|
@felixcheung, please let me leave my humble opinion. In general, I like this change. Maybe, it should not necessarily be always runnable (at least for now) but I believe it is still an improvement to make them runnable, for example, it allows users to directly copy and paste the example and easily test.
In case of Python examples, up to my knowledge, it is also common to document self-runnable examples (e.g., pandas and numpy). Moreover, they are tested as doctests so there is no overhead to check if they are runnable everytime. Maybe, let's talk about this in the PR or JIRA (I am sorry for dragging Python one into this).
Maybe, we could take out the examples that you are concerned of for now.. For the four concerns you described, I do agree in general. if it has a downside more than what it improves, I would disagree but to me it might sound the improvement might be better than the downsides about the change itself.
I guess the first one is about reviewing cost/backporting/conflict concern. I am willing to help verify the examples by manually running even if they are so many in terms of reviewing cost if the change is big.
For the second concern, if it makes the maintenance header, it'd be a important point but I think we are being conservative on the API changes from Spark 2.x. So, I guess we would less likely need to sweep these. I think it is okay even if it would be change in the near future but if the changes are not quite big.
For the third and fourth concerns, I think we could improve this by adding more examples, explaining and describing the details. For example, we could leave some comments about this such as .. "this is an example but in pratice blabla..". Maybe, we could do this in another PR maybe. Otherwise, we could leave the examples as they are with some proper comments. For other thoughts, I think it would be great if they are ran in the tests at lesat. |
What changes were proposed in this pull request?
Many examples in SparkDataFrame methods uses:
This is not directly runnable. Replace this with real numerical examples so that users can directly execute the examples.