[SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other DataFrame stats functions #13109

felixcheung · 2016-05-13T23:11:16Z

What changes were proposed in this pull request?

Doc only changes. Please see screenshots.

Before:
http://spark.apache.org/docs/latest/api/R/statfunctions.html

After

(please ignore the style differences - this is due to not having the css in my local copy)

This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function.

How was this patch tested?

Build doc

felixcheung · 2016-05-13T23:12:20Z

@shivaram @sun-rui

shivaram · 2016-05-13T23:19:15Z

LGTM. Thanks @felixcheung

SparkQA · 2016-05-13T23:27:07Z

Test build #58596 has finished for PR 13109 at commit 756262c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2016-05-14T14:20:50Z

This looks better. but the roxygen style is a little bit deviated. The previous is like:
#' function name
#' description

Current is like:
#' function name - description

We may need a consistent roxygen style documentation. At least for two styles:
one function for one RD
multiple functions for one RD

And also if you type '?corr' in R, only corr() for Column functions is displayed. Since R is function oriented, I think two corr() descriptions better to be displayed together in one page?

felixcheung · 2016-05-15T18:02:22Z

right - that's why my comment on SPARK-15237 is that we should have a different rd for each of "statsfunctions" instead of having all of them on one rd. To clarify, currently, we have

column function corr in its rd file
DataFrame function corr, crosstab, cov, freqItems in one rd file

What I think we should have instead is

DataFrame function corr in one rd file
DataFrame function crosstab in one rd file etc..
column function corr should group with other column functions (or least column stats functions) in one rd file with appropriate "multiple functions in one rd" formatting as you have suggested

I think it is rather confusing to put both DataFrame corr on the same pages as column function corr?
Thoughts?

sun-rui · 2016-05-16T01:24:42Z

My opinion is that since R supports generic function and a generic function can have multiple methods for it, it is natural to put both corr() in the same page. Is there a mechanism that descriptions can be aggregated even if methods of the same name are distributed in different RD files?

felixcheung · 2016-05-16T23:28:09Z

that's fine, I don't know the history of putting stats column function into one rd page though.
I agree it is fine to group function by the name corr DataFrame & corr column going to the same rd page. Are we ok with that approach?

shivaram · 2016-05-24T00:52:10Z

Chiming in a little late here -- from my R usage, I've definitely seen two patterns commonly used

sharing the same function for multiple generics and just documenting how they differ for each input (for example something like https://stat.ethz.ch/R-manual/R-devel/library/base/html/rowsum.html)
collating multiple related functions in the same rd file (example https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html)
So I am not sure that having a separate rd file for each (function, input type) pair is the right approach.

Right now my take is the following:

For more complex functions like corr (i.e. where we have DataFrame and column versions), lets group by function name and lets exclude them from the generic stats column rd file
For simple functions like mean or sum lets keep them in a stats functions rd.

Let me know what you think of this proposal.

sun-rui · 2016-05-24T01:50:55Z

@shivaram, I checked your two examples. It seems that the rule is that:
For a generic function, methods for it are documented together
For non-generic functions (isGeneric() returns FALSE), functions having related functionalities can be documented together.

I have no strong opinion on this. so +1

- this changes might happen in apache#13109

shivaram · 2016-06-20T19:03:46Z

@felixcheung Is this PR still relevant ?

felixcheung · 2016-06-20T19:07:35Z

I think it does. I will try to update this today.

felixcheung · 2016-06-21T02:12:22Z

Updated cov, corr into their rd page and kept others.

SparkQA · 2016-06-21T02:38:53Z

Test build #60897 has finished for PR 13109 at commit df0851c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-21T03:16:58Z

cc @dongjoon-hyun

dongjoon-hyun · 2016-06-21T03:39:55Z

R/pkg/R/stats.R

 setOldClass("jobj")

-#' crosstab
+#' @title SparkDataFrame statistic functions


Hi, @felixcheung .
When I use ./create-docs.sh, this breaks the page.
Should I do differently to generate the html page like yours?

what's the error you see? this works for me and Jenkins run create docs too

For me, the generated html file, file:///Users/dongjoon/spark/R/pkg/html/statfunctions.html has the following title.

SparkDataFrame statistic functions crosstab - Computes a pair-wise frequency table of the given columns Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned.

Also, index.html shows the above long string for all the stat functions like approxQuantile.

It's a little bit different from your screenshot (After).

I don't see the build error - but the page title doesn't come up right. A screenshot is at https://www.dropbox.com/s/sc1mrd7upr6t7mp/Screenshot%202016-06-20%2021.25.57.png?dl=0

Also we seem to have some functions like covar_samp, covar_pop that don't have a description ?

Fixed, sorry about that. roxygen2 is a bit.. stubborn.

cov, corr shouldn't be there - they are referenced in generics.R these bugs are also fixed in my other PR #13798 - there are quite a lot of them.

dongjoon-hyun · 2016-06-21T05:04:16Z

Oh, the root cause exists in generics.R. Nice catch!

felixcheung · 2016-06-21T05:04:59Z

It's nasty! 😄

dongjoon-hyun · 2016-06-21T05:12:02Z

In line 333 of functions.R, @rdname covar_pop -> @rdname cov?

#' covar_pop
#'
#' Compute the population covariance between two expressions.
#'
#' @rdname covar_pop
#' @name covar_pop

felixcheung · 2016-06-21T05:20:52Z

That's intentional - covar_pop has a separate page.
cov == covar_samp
covar_samp != covar_pop
putting covar_pop on the same rd would have 2 different descriptions there. (I think we really should avoid putting multiple things in one rd..)

dongjoon-hyun · 2016-06-21T05:26:12Z

R/pkg/R/generics.R

 setGeneric("covar_samp", function(col1, col2) {standardGeneric("covar_samp") })

-#' @rdname statfunctions
+#' @rdname cov


If then, here, cov -> covar_pop?

good catch, thx!

dongjoon-hyun · 2016-06-21T05:28:03Z

Yes. Indeed, we had better keep each function on own RD generally.

SparkQA · 2016-06-21T05:37:06Z

Test build #60905 has finished for PR 13109 at commit 9104bbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-21T05:53:30Z

R/pkg/R/stats.R

-#'
-#' Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
+#' approxQuantile - Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
 #'


I think we need to delete this line in between the approxQuantile line and the The result of this algorithm line. Otherwise the first line doesn't seem to show up in the rendered doc ?

SparkQA · 2016-06-21T06:06:37Z

Test build #60908 has finished for PR 13109 at commit 10a4dba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-06-21T06:24:08Z

I just checked the generated R doc and I felt that we shouldn't group many methods together. For example, in this PR, the DESCRIPTION section looks okay because we used crosstab - ... and freqItems - .... But the Arguments section becomes very messy and the Value section is even worse. The example code are chained together. See attached screenshots. summary is another example here. It might be better to only group methods are closely related to each other and share the same arguments.

shivaram · 2016-06-21T06:25:19Z

LGTM. This version looks good to me. Thanks for iterating on this. Will wait for Jenkins and then merge.

shivaram · 2016-06-21T06:28:38Z

@mengxr Yes - this is true and in #13798 we are making a few more of the methods into individual Rd files. At a high level there is a tradition in R to group together similar methods (https://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html for example) but with roxygen2 that leads to some issues.

I think we can even split the statfunctions ones as they are named differently -- For summary I think its more tricky as the function name is the same ?

mengxr · 2016-06-21T06:33:14Z

Methods documented in colSums share the same parameters and each was only documented once. Roxygen2 supports that if each param doc only appears once in the comment. That grouping looks okay to me, which is different from statfunctions or summary.

For summary, I'm thinking about documenting the corresponding summary methods in the fit methods such as spark.glm, spark.naiveBayes. And then put see alsos in the generic summary doc page.

SparkQA · 2016-06-21T06:39:35Z

Test build #60910 has finished for PR 13109 at commit 3760d03.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-21T06:43:03Z

@mengxr @felixcheung Can we open a new issue of the form Separate out rd files for SparkR functions ? We can then make a list there of everything thats sharing a rd file right now and see what is the way around things.

@mengxr - also let me know if you think this is good to merge. I think for 2.0 RC1 having this PR is better than not having it ?

shivaram · 2016-06-21T06:43:23Z

Jenkins, retest this please

felixcheung · 2016-06-21T06:44:01Z

Right @shivaram @mengxr I agree it would be nicer to put predict, summary or even write.ml to their corresponding mllib methods.

mengxr · 2016-06-21T06:56:40Z

Yes, we should merge this PR first and discuss the grouping later.

mengxr · 2016-06-21T06:57:53Z

Created https://issues.apache.org/jira/browse/SPARK-16090 to follow up.

SparkQA · 2016-06-21T07:18:13Z

Test build #60912 has finished for PR 13109 at commit 3760d03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-21T07:18:36Z

Cool. Thanks - LGTM. Merging this to master, branch-2.0

…DataFrame stats functions ## What changes were proposed in this pull request? Doc only changes. Please see screenshots. Before: http://spark.apache.org/docs/latest/api/R/statfunctions.html ![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png) After ![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png) (please ignore the style differences - this is due to not having the css in my local copy) This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function. ## How was this patch tested? Build doc Author: Felix Cheung <[email protected]> Author: felixcheung <[email protected]> Closes #13109 from felixcheung/rstatdoc. (cherry picked from commit 843a1eb) Signed-off-by: Shivaram Venkataraman <[email protected]>

felixcheung mentioned this pull request May 31, 2016

[SPARK-15490][R][DOC] SparkR 2.0 QA: New R APIs and API docs for non-MLib changes #13394

Closed

vectorijk added a commit to vectorijk/spark that referenced this pull request Jun 14, 2016

revert changes in R/pkg/R/stats.R

8ab88d7

- this changes might happen in apache#13109

felixcheung added 2 commits June 20, 2016 18:59

fix doc layout

13f96ee

update as per feedback

df0851c

felixcheung force-pushed the rstatdoc branch from 756262c to df0851c Compare June 21, 2016 02:11

dongjoon-hyun reviewed Jun 21, 2016
View reviewed changes

felixcheung added 2 commits June 20, 2016 21:57

fix bug

0ac89db

more fix

9104bbc

dongjoon-hyun reviewed Jun 21, 2016
View reviewed changes

fix one missed

10a4dba

shivaram reviewed Jun 21, 2016
View reviewed changes

more bug fix

3760d03

asfgit closed this in 843a1eb Jun 21, 2016

[SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other DataFrame stats functions #13109

[SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other DataFrame stats functions #13109

Uh oh!

Conversation

felixcheung commented May 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

felixcheung commented May 13, 2016

Uh oh!

shivaram commented May 13, 2016

Uh oh!

SparkQA commented May 13, 2016

Uh oh!

sun-rui commented May 14, 2016

Uh oh!

felixcheung commented May 15, 2016

Uh oh!

sun-rui commented May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung commented May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram commented May 24, 2016

Uh oh!

sun-rui commented May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram commented Jun 20, 2016

Uh oh!

felixcheung commented Jun 20, 2016

Uh oh!

felixcheung commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

shivaram commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Jun 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 21, 2016

Uh oh!

felixcheung commented Jun 21, 2016

Uh oh!

dongjoon-hyun commented Jun 21, 2016

Uh oh!

felixcheung commented Jun 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

mengxr commented Jun 21, 2016

Uh oh!

shivaram commented Jun 21, 2016

Uh oh!

shivaram commented Jun 21, 2016

felixcheung commented May 13, 2016 •

edited

Loading

sun-rui commented May 16, 2016 •

edited

Loading

felixcheung commented May 16, 2016 •

edited

Loading

sun-rui commented May 24, 2016 •

edited

Loading

felixcheung Jun 21, 2016 •

edited

Loading

felixcheung commented Jun 21, 2016 •

edited

Loading