Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -430,19 +430,19 @@ setGeneric("coltypes<-", function(x, value) { standardGeneric("coltypes<-") })
#' @export
setGeneric("columns", function(x) {standardGeneric("columns") })

#' @rdname statfunctions
#' @rdname cov
#' @export
setGeneric("cov", function(x, ...) {standardGeneric("cov") })

#' @rdname statfunctions
#' @rdname corr
#' @export
setGeneric("corr", function(x, ...) {standardGeneric("corr") })

#' @rdname statfunctions
#' @rdname cov
#' @export
setGeneric("covar_samp", function(col1, col2) {standardGeneric("covar_samp") })

#' @rdname statfunctions
#' @rdname covar_pop
#' @export
setGeneric("covar_pop", function(col1, col2) {standardGeneric("covar_pop") })

Expand Down
32 changes: 13 additions & 19 deletions R/pkg/R/stats.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@

setOldClass("jobj")

#' crosstab
#'
#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
#' @title SparkDataFrame statistic functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @felixcheung .
When I use ./create-docs.sh, this breaks the page.
Should I do differently to generate the html page like yours?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the error you see? this works for me and Jenkins run create docs too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the generated html file, file:///Users/dongjoon/spark/R/pkg/html/statfunctions.html has the following title.

SparkDataFrame statistic functions crosstab - Computes a pair-wise frequency table of the given columns Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned.

Also, index.html shows the above long string for all the stat functions like approxQuantile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little bit different from your screenshot (After).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the build error - but the page title doesn't come up right. A screenshot is at https://www.dropbox.com/s/sc1mrd7upr6t7mp/Screenshot%202016-06-20%2021.25.57.png?dl=0

Also we seem to have some functions like covar_samp, covar_pop that don't have a description ?

Copy link
Member Author

@felixcheung felixcheung Jun 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, sorry about that. roxygen2 is a bit.. stubborn.

image

cov, corr shouldn't be there - they are referenced in generics.R these bugs are also fixed in my other PR #13798 - there are quite a lot of them.


#' @description
#' crosstab - Computes a pair-wise frequency table of the given columns. Also known as a contingency
#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
#' non-zero pair frequencies will be returned.
#'
Expand Down Expand Up @@ -49,16 +50,14 @@ setMethod("crosstab",
collect(dataFrame(sct))
})

#' cov
#'
#' Calculate the sample covariance of two numerical columns of a SparkDataFrame.
#'
#' @param x A SparkDataFrame
#' @param col1 the name of the first column
#' @param col2 the name of the second column
#' @return the covariance of the two columns.
#'
#' @rdname statfunctions
#' @rdname cov
#' @name cov
#' @export
#' @examples
Expand All @@ -75,8 +74,6 @@ setMethod("cov",
callJMethod(statFunctions, "cov", col1, col2)
})

#' corr
#'
#' Calculates the correlation of two columns of a SparkDataFrame.
#' Currently only supports the Pearson Correlation Coefficient.
#' For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
Expand All @@ -88,7 +85,7 @@ setMethod("cov",
#' only "pearson" is allowed now.
#' @return The Pearson Correlation Coefficient as a Double.
#'
#' @rdname statfunctions
#' @rdname corr
#' @name corr
#' @export
#' @examples
Expand All @@ -106,9 +103,8 @@ setMethod("corr",
callJMethod(statFunctions, "corr", col1, col2, method)
})

#' freqItems
#'
#' Finding frequent items for columns, possibly with false positives.
#' @description
#' freqItems - Finding frequent items for columns, possibly with false positives.
#' Using the frequent element count algorithm described in
#' \url{http://dx.doi.org/10.1145/762471.762473}, proposed by Karp, Schenker, and Papadimitriou.
#'
Expand All @@ -134,10 +130,8 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols = "character"),
collect(dataFrame(sct))
})

#' approxQuantile
#'
#' Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
#'
#' @description
#' approxQuantile - Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this line is ignored. We need @description here, too.

#' @description
#' approxQuantile - Calculates the approximate quantiles of a numerical column of a SparkDataFrame.

After adding that, the description depth will look differently. I mean only approxQuantile has a detail description like the following.

crosstab - Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned.

freqItems - Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou.

approxQuantile - Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
The result of this algorithm has the following deterministic bound: If the SparkDataFrame has N elements and if we request the quantile at probability 'p' up to error 'err', then the algorithm will return a sample 'x' from the SparkDataFrame so that the *exact* rank of 'x' is close to (p * N). More precisely, floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.

sampleBy - Returns a stratified sample without replacement based on the fraction given on each stratum.

I'm not sure about balancing them by removing The result of this algorithm~~~Khanna.. If you think that is okay, we can keep that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, @shivaram already mentioned this. +1 for @shivaram 's opinion.
If we delete the details, @description is not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the above, LGTM!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung . I found that @description and @details works for me.

#' @description
#' approxQuantile - Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
#'
#' @details
#' The result o~~

If we keep the description, please try the above pair.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, sorry I missed one approxQuantile. thanks for testing it out.
I thought completely arbitrary do-not-have-newline would be hard for someone new to follow, so I added @description instead to make it clear in case of copy/paste and someone added a new empty line.

#' The result of this algorithm has the following deterministic bound:
#' If the SparkDataFrame has N elements and if we request the quantile at probability `p` up to
#' error `err`, then the algorithm will return a sample `x` from the SparkDataFrame so that the
Expand Down Expand Up @@ -174,9 +168,9 @@ setMethod("approxQuantile",
as.list(probabilities), relativeError)
})

#' sampleBy
#'
#' Returns a stratified sample without replacement based on the fraction given on each stratum.
#' @description
#' sampleBy - Returns a stratified sample without replacement based on the fraction given on each
#' stratum.
#'
#' @param x A SparkDataFrame
#' @param col column that defines strata
Expand Down