Skip to content

Conversation

@felixcheung
Copy link
Member

@felixcheung felixcheung commented May 13, 2016

What changes were proposed in this pull request?

Doc only changes. Please see screenshots.

Before:
http://spark.apache.org/docs/latest/api/R/statfunctions.html
image

After
image
(please ignore the style differences - this is due to not having the css in my local copy)

This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function.

How was this patch tested?

Build doc

@felixcheung
Copy link
Member Author

@shivaram @sun-rui

@shivaram
Copy link
Contributor

LGTM. Thanks @felixcheung

@SparkQA
Copy link

SparkQA commented May 13, 2016

Test build #58596 has finished for PR 13109 at commit 756262c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor

sun-rui commented May 14, 2016

This looks better. but the roxygen style is a little bit deviated. The previous is like:
#' function name
#' description

Current is like:
#' function name - description

We may need a consistent roxygen style documentation. At least for two styles:
one function for one RD
multiple functions for one RD

And also if you type '?corr' in R, only corr() for Column functions is displayed. Since R is function oriented, I think two corr() descriptions better to be displayed together in one page?

@felixcheung
Copy link
Member Author

right - that's why my comment on SPARK-15237 is that we should have a different rd for each of "statsfunctions" instead of having all of them on one rd. To clarify, currently, we have

  1. column function corr in its rd file
  2. DataFrame function corr, crosstab, cov, freqItems in one rd file

What I think we should have instead is

  1. DataFrame function corr in one rd file
  2. DataFrame function crosstab in one rd file etc..
  3. column function corr should group with other column functions (or least column stats functions) in one rd file with appropriate "multiple functions in one rd" formatting as you have suggested

I think it is rather confusing to put both DataFrame corr on the same pages as column function corr?
Thoughts?

@sun-rui
Copy link
Contributor

sun-rui commented May 16, 2016

My opinion is that since R supports generic function and a generic function can have multiple methods for it, it is natural to put both corr() in the same page. Is there a mechanism that descriptions can be aggregated even if methods of the same name are distributed in different RD files?

@felixcheung
Copy link
Member Author

felixcheung commented May 16, 2016

that's fine, I don't know the history of putting stats column function into one rd page though.
I agree it is fine to group function by the name corr DataFrame & corr column going to the same rd page. Are we ok with that approach?

@shivaram
Copy link
Contributor

Chiming in a little late here -- from my R usage, I've definitely seen two patterns commonly used

Right now my take is the following:

  • For more complex functions like corr (i.e. where we have DataFrame and column versions), lets group by function name and lets exclude them from the generic stats column rd file
  • For simple functions like mean or sum lets keep them in a stats functions rd.

Let me know what you think of this proposal.

@sun-rui
Copy link
Contributor

sun-rui commented May 24, 2016

@shivaram, I checked your two examples. It seems that the rule is that:
For a generic function, methods for it are documented together
For non-generic functions (isGeneric() returns FALSE), functions having related functionalities can be documented together.

I have no strong opinion on this. so +1

vectorijk added a commit to vectorijk/spark that referenced this pull request Jun 14, 2016
 - this changes might happen in apache#13109
@shivaram
Copy link
Contributor

@felixcheung Is this PR still relevant ?

@felixcheung
Copy link
Member Author

I think it does. I will try to update this today.

@felixcheung
Copy link
Member Author

Updated cov, corr into their rd page and kept others.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60897 has finished for PR 13109 at commit df0851c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

cc @dongjoon-hyun

setOldClass("jobj")

#' crosstab
#' @title SparkDataFrame statistic functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @felixcheung .
When I use ./create-docs.sh, this breaks the page.
Should I do differently to generate the html page like yours?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the error you see? this works for me and Jenkins run create docs too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the generated html file, file:///Users/dongjoon/spark/R/pkg/html/statfunctions.html has the following title.

SparkDataFrame statistic functions crosstab - Computes a pair-wise frequency table of the given columns Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned.

Also, index.html shows the above long string for all the stat functions like approxQuantile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little bit different from your screenshot (After).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the build error - but the page title doesn't come up right. A screenshot is at https://www.dropbox.com/s/sc1mrd7upr6t7mp/Screenshot%202016-06-20%2021.25.57.png?dl=0

Also we seem to have some functions like covar_samp, covar_pop that don't have a description ?

Copy link
Member Author

@felixcheung felixcheung Jun 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, sorry about that. roxygen2 is a bit.. stubborn.

image

cov, corr shouldn't be there - they are referenced in generics.R these bugs are also fixed in my other PR #13798 - there are quite a lot of them.

@dongjoon-hyun
Copy link
Member

Oh, the root cause exists in generics.R. Nice catch!

@felixcheung
Copy link
Member Author

It's nasty! 😄

@dongjoon-hyun
Copy link
Member

In line 333 of functions.R, @rdname covar_pop -> @rdname cov?

#' covar_pop
#'
#' Compute the population covariance between two expressions.
#'
#' @rdname covar_pop
#' @name covar_pop

@felixcheung
Copy link
Member Author

felixcheung commented Jun 21, 2016

That's intentional - covar_pop has a separate page.
cov == covar_samp
covar_samp != covar_pop
putting covar_pop on the same rd would have 2 different descriptions there. (I think we really should avoid putting multiple things in one rd..)

setGeneric("covar_samp", function(col1, col2) {standardGeneric("covar_samp") })

#' @rdname statfunctions
#' @rdname cov
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If then, here, cov -> covar_pop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thx!

@dongjoon-hyun
Copy link
Member

Yes. Indeed, we had better keep each function on own RD generally.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60905 has finished for PR 13109 at commit 9104bbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/stats.R Outdated
#'
#' Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
#' approxQuantile - Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
#'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to delete this line in between the approxQuantile line and the The result of this algorithm line. Otherwise the first line doesn't seem to show up in the rendered doc ?

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60908 has finished for PR 13109 at commit 10a4dba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jun 21, 2016

I just checked the generated R doc and I felt that we shouldn't group many methods together. For example, in this PR, the DESCRIPTION section looks okay because we used crosstab - ... and freqItems - .... But the Arguments section becomes very messy and the Value section is even worse. The example code are chained together. See attached screenshots. summary is another example here. It might be better to only group methods are closely related to each other and share the same arguments.

screen shot 2016-06-20 at 11 21 48 pm

@shivaram
Copy link
Contributor

LGTM. This version looks good to me. Thanks for iterating on this. Will wait for Jenkins and then merge.

@shivaram
Copy link
Contributor

@mengxr Yes - this is true and in #13798 we are making a few more of the methods into individual Rd files. At a high level there is a tradition in R to group together similar methods (https://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html for example) but with roxygen2 that leads to some issues.

I think we can even split the statfunctions ones as they are named differently -- For summary I think its more tricky as the function name is the same ?

@mengxr
Copy link
Contributor

mengxr commented Jun 21, 2016

Methods documented in colSums share the same parameters and each was only documented once. Roxygen2 supports that if each param doc only appears once in the comment. That grouping looks okay to me, which is different from statfunctions or summary.

For summary, I'm thinking about documenting the corresponding summary methods in the fit methods such as spark.glm, spark.naiveBayes. And then put see alsos in the generic summary doc page.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60910 has finished for PR 13109 at commit 3760d03.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@mengxr @felixcheung Can we open a new issue of the form Separate out rd files for SparkR functions ? We can then make a list there of everything thats sharing a rd file right now and see what is the way around things.

@mengxr - also let me know if you think this is good to merge. I think for 2.0 RC1 having this PR is better than not having it ?

@shivaram
Copy link
Contributor

Jenkins, retest this please

@felixcheung
Copy link
Member Author

Right @shivaram @mengxr I agree it would be nicer to put predict, summary or even write.ml to their corresponding mllib methods.

@mengxr
Copy link
Contributor

mengxr commented Jun 21, 2016

Yes, we should merge this PR first and discuss the grouping later.

@mengxr
Copy link
Contributor

mengxr commented Jun 21, 2016

Created https://issues.apache.org/jira/browse/SPARK-16090 to follow up.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60912 has finished for PR 13109 at commit 3760d03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Cool. Thanks - LGTM. Merging this to master, branch-2.0

@asfgit asfgit closed this in 843a1eb Jun 21, 2016
asfgit pushed a commit that referenced this pull request Jun 21, 2016
…DataFrame stats functions

## What changes were proposed in this pull request?

Doc only changes. Please see screenshots.

Before:
http://spark.apache.org/docs/latest/api/R/statfunctions.html
![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png)

After
![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png)
(please ignore the style differences - this is due to not having the css in my local copy)

This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function.

## How was this patch tested?

Build doc

Author: Felix Cheung <[email protected]>
Author: felixcheung <[email protected]>

Closes #13109 from felixcheung/rstatdoc.

(cherry picked from commit 843a1eb)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants