Skip to content

Commit f56e443

Browse files
committed
merge with master
2 parents 166a6ff + 4896411 commit f56e443

File tree

1,204 files changed

+30649
-12449
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,204 files changed

+30649
-12449
lines changed

.github/PULL_REQUEST_TEMPLATE

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
## What changes were proposed in this pull request?
2+
3+
(Please fill in changes proposed in this fix)
4+
5+
6+
## How was this patch tested?
7+
8+
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
9+
10+
11+
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
12+

LICENSE

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -249,14 +249,14 @@ The text of each license is also included at licenses/LICENSE-[project].txt.
249249
(Interpreter classes (all .scala files in repl/src/main/scala
250250
except for Main.Scala, SparkHelper.scala and ExecutorClassLoader.scala),
251251
and for SerializableMapWrapper in JavaUtils.scala)
252-
(BSD-like) Scala Actors library (org.scala-lang:scala-actors:2.10.5 - http://www.scala-lang.org/)
253-
(BSD-like) Scala Compiler (org.scala-lang:scala-compiler:2.10.5 - http://www.scala-lang.org/)
254-
(BSD-like) Scala Compiler (org.scala-lang:scala-reflect:2.10.5 - http://www.scala-lang.org/)
255-
(BSD-like) Scala Library (org.scala-lang:scala-library:2.10.5 - http://www.scala-lang.org/)
256-
(BSD-like) Scalap (org.scala-lang:scalap:2.10.5 - http://www.scala-lang.org/)
257-
(BSD-style) scalacheck (org.scalacheck:scalacheck_2.10:1.10.0 - http://www.scalacheck.org)
258-
(BSD-style) spire (org.spire-math:spire_2.10:0.7.1 - http://spire-math.org)
259-
(BSD-style) spire-macros (org.spire-math:spire-macros_2.10:0.7.1 - http://spire-math.org)
252+
(BSD-like) Scala Actors library (org.scala-lang:scala-actors:2.11.7 - http://www.scala-lang.org/)
253+
(BSD-like) Scala Compiler (org.scala-lang:scala-compiler:2.11.7 - http://www.scala-lang.org/)
254+
(BSD-like) Scala Compiler (org.scala-lang:scala-reflect:2.11.7 - http://www.scala-lang.org/)
255+
(BSD-like) Scala Library (org.scala-lang:scala-library:2.11.7 - http://www.scala-lang.org/)
256+
(BSD-like) Scalap (org.scala-lang:scalap:2.11.7 - http://www.scala-lang.org/)
257+
(BSD-style) scalacheck (org.scalacheck:scalacheck_2.11:1.10.0 - http://www.scalacheck.org)
258+
(BSD-style) spire (org.spire-math:spire_2.11:0.7.1 - http://spire-math.org)
259+
(BSD-style) spire-macros (org.spire-math:spire-macros_2.11:0.7.1 - http://spire-math.org)
260260
(New BSD License) Kryo (com.esotericsoftware.kryo:kryo:2.21 - http://code.google.com/p/kryo/)
261261
(New BSD License) MinLog (com.esotericsoftware.minlog:minlog:1.2 - http://code.google.com/p/minlog/)
262262
(New BSD License) ReflectASM (com.esotericsoftware.reflectasm:reflectasm:1.07 - http://code.google.com/p/reflectasm/)
@@ -283,7 +283,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt.
283283
(MIT License) SLF4J API Module (org.slf4j:slf4j-api:1.7.5 - http://www.slf4j.org)
284284
(MIT License) SLF4J LOG4J-12 Binding (org.slf4j:slf4j-log4j12:1.7.5 - http://www.slf4j.org)
285285
(MIT License) pyrolite (org.spark-project:pyrolite:2.0.1 - http://pythonhosted.org/Pyro4/)
286-
(MIT License) scopt (com.github.scopt:scopt_2.10:3.2.0 - https://github.com/scopt/scopt)
286+
(MIT License) scopt (com.github.scopt:scopt_2.11:3.2.0 - https://github.com/scopt/scopt)
287287
(The MIT License) Mockito (org.mockito:mockito-core:1.9.5 - http://www.mockito.org)
288288
(MIT License) jquery (https://jquery.org/license/)
289289
(MIT License) AnchorJS (https://github.com/bryanbraun/anchorjs)

R/pkg/NAMESPACE

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,9 @@ export("print.jobj")
1313
# MLlib integration
1414
exportMethods("glm",
1515
"predict",
16-
"summary")
16+
"summary",
17+
"kmeans",
18+
"fitted")
1719

1820
# Job group lifecycle management methods
1921
export("setJobGroup",
@@ -109,6 +111,7 @@ exportMethods("%in%",
109111
"add_months",
110112
"alias",
111113
"approxCountDistinct",
114+
"approxQuantile",
112115
"array_contains",
113116
"asc",
114117
"ascii",

R/pkg/R/functions.R

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1962,7 +1962,7 @@ setMethod("sha2", signature(y = "Column", x = "numeric"),
19621962

19631963
#' shiftLeft
19641964
#'
1965-
#' Shift the the given value numBits left. If the given value is a long value, this function
1965+
#' Shift the given value numBits left. If the given value is a long value, this function
19661966
#' will return a long value else it will return an integer value.
19671967
#'
19681968
#' @family math_funcs
@@ -1980,7 +1980,7 @@ setMethod("shiftLeft", signature(y = "Column", x = "numeric"),
19801980

19811981
#' shiftRight
19821982
#'
1983-
#' Shift the the given value numBits right. If the given value is a long value, it will return
1983+
#' Shift the given value numBits right. If the given value is a long value, it will return
19841984
#' a long value else it will return an integer value.
19851985
#'
19861986
#' @family math_funcs
@@ -1998,7 +1998,7 @@ setMethod("shiftRight", signature(y = "Column", x = "numeric"),
19981998

19991999
#' shiftRightUnsigned
20002000
#'
2001-
#' Unsigned shift the the given value numBits right. If the given value is a long value,
2001+
#' Unsigned shift the given value numBits right. If the given value is a long value,
20022002
#' it will return a long value else it will return an integer value.
20032003
#'
20042004
#' @family math_funcs

R/pkg/R/generics.R

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,13 @@ setGeneric("crosstab", function(x, col1, col2) { standardGeneric("crosstab") })
6767
# @export
6868
setGeneric("freqItems", function(x, cols, support = 0.01) { standardGeneric("freqItems") })
6969

70+
# @rdname statfunctions
71+
# @export
72+
setGeneric("approxQuantile",
73+
function(x, col, probabilities, relativeError) {
74+
standardGeneric("approxQuantile")
75+
})
76+
7077
# @rdname distinct
7178
# @export
7279
setGeneric("distinct", function(x, numPartitions = 1) { standardGeneric("distinct") })
@@ -1160,3 +1167,11 @@ setGeneric("predict", function(object, ...) { standardGeneric("predict") })
11601167
#' @rdname rbind
11611168
#' @export
11621169
setGeneric("rbind", signature = "...")
1170+
1171+
#' @rdname kmeans
1172+
#' @export
1173+
setGeneric("kmeans")
1174+
1175+
#' @rdname fitted
1176+
#' @export
1177+
setGeneric("fitted")

R/pkg/R/mllib.R

Lines changed: 70 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -104,11 +104,11 @@ setMethod("predict", signature(object = "PipelineModel"),
104104
setMethod("summary", signature(object = "PipelineModel"),
105105
function(object, ...) {
106106
modelName <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers",
107-
"getModelName", object@model)
107+
"getModelName", object@model)
108108
features <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers",
109-
"getModelFeatures", object@model)
109+
"getModelFeatures", object@model)
110110
coefficients <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers",
111-
"getModelCoefficients", object@model)
111+
"getModelCoefficients", object@model)
112112
if (modelName == "LinearRegressionModel") {
113113
devianceResiduals <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers",
114114
"getModelDevianceResiduals", object@model)
@@ -119,10 +119,76 @@ setMethod("summary", signature(object = "PipelineModel"),
119119
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
120120
rownames(coefficients) <- unlist(features)
121121
return(list(devianceResiduals = devianceResiduals, coefficients = coefficients))
122-
} else {
122+
} else if (modelName == "LogisticRegressionModel") {
123123
coefficients <- as.matrix(unlist(coefficients))
124124
colnames(coefficients) <- c("Estimate")
125125
rownames(coefficients) <- unlist(features)
126126
return(list(coefficients = coefficients))
127+
} else if (modelName == "KMeansModel") {
128+
modelSize <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers",
129+
"getKMeansModelSize", object@model)
130+
cluster <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers",
131+
"getKMeansCluster", object@model, "classes")
132+
k <- unlist(modelSize)[1]
133+
size <- unlist(modelSize)[-1]
134+
coefficients <- t(matrix(coefficients, ncol = k))
135+
colnames(coefficients) <- unlist(features)
136+
rownames(coefficients) <- 1:k
137+
return(list(coefficients = coefficients, size = size, cluster = dataFrame(cluster)))
138+
} else {
139+
stop(paste("Unsupported model", modelName, sep = " "))
140+
}
141+
})
142+
143+
#' Fit a k-means model
144+
#'
145+
#' Fit a k-means model, similarly to R's kmeans().
146+
#'
147+
#' @param x DataFrame for training
148+
#' @param centers Number of centers
149+
#' @param iter.max Maximum iteration number
150+
#' @param algorithm Algorithm choosen to fit the model
151+
#' @return A fitted k-means model
152+
#' @rdname kmeans
153+
#' @export
154+
#' @examples
155+
#'\dontrun{
156+
#' model <- kmeans(x, centers = 2, algorithm="random")
157+
#'}
158+
setMethod("kmeans", signature(x = "DataFrame"),
159+
function(x, centers, iter.max = 10, algorithm = c("random", "k-means||")) {
160+
columnNames <- as.array(colnames(x))
161+
algorithm <- match.arg(algorithm)
162+
model <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers", "fitKMeans", x@sdf,
163+
algorithm, iter.max, centers, columnNames)
164+
return(new("PipelineModel", model = model))
165+
})
166+
167+
#' Get fitted result from a model
168+
#'
169+
#' Get fitted result from a model, similarly to R's fitted().
170+
#'
171+
#' @param object A fitted MLlib model
172+
#' @return DataFrame containing fitted values
173+
#' @rdname fitted
174+
#' @export
175+
#' @examples
176+
#'\dontrun{
177+
#' model <- kmeans(trainingData, 2)
178+
#' fitted.model <- fitted(model)
179+
#' showDF(fitted.model)
180+
#'}
181+
setMethod("fitted", signature(object = "PipelineModel"),
182+
function(object, method = c("centers", "classes"), ...) {
183+
modelName <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers",
184+
"getModelName", object@model)
185+
186+
if (modelName == "KMeansModel") {
187+
method <- match.arg(method)
188+
fittedResult <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers",
189+
"getKMeansCluster", object@model, method)
190+
return(dataFrame(fittedResult))
191+
} else {
192+
stop(paste("Unsupported model", modelName, sep = " "))
127193
}
128194
})

R/pkg/R/pairRDD.R

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -305,11 +305,11 @@ setMethod("groupByKey",
305305
#' Merge values by key
306306
#'
307307
#' This function operates on RDDs where every element is of the form list(K, V) or c(K, V).
308-
#' and merges the values for each key using an associative reduce function.
308+
#' and merges the values for each key using an associative and commutative reduce function.
309309
#'
310310
#' @param x The RDD to reduce by key. Should be an RDD where each element is
311311
#' list(K, V) or c(K, V).
312-
#' @param combineFunc The associative reduce function to use.
312+
#' @param combineFunc The associative and commutative reduce function to use.
313313
#' @param numPartitions Number of partitions to create.
314314
#' @return An RDD where each element is list(K, V') where V' is the merged
315315
#' value
@@ -347,12 +347,12 @@ setMethod("reduceByKey",
347347
#' Merge values by key locally
348348
#'
349349
#' This function operates on RDDs where every element is of the form list(K, V) or c(K, V).
350-
#' and merges the values for each key using an associative reduce function, but return the
351-
#' results immediately to the driver as an R list.
350+
#' and merges the values for each key using an associative and commutative reduce function, but
351+
#' return the results immediately to the driver as an R list.
352352
#'
353353
#' @param x The RDD to reduce by key. Should be an RDD where each element is
354354
#' list(K, V) or c(K, V).
355-
#' @param combineFunc The associative reduce function to use.
355+
#' @param combineFunc The associative and commutative reduce function to use.
356356
#' @return A list of elements of type list(K, V') where V' is the merged value for each key
357357
#' @seealso reduceByKey
358358
#' @examples

R/pkg/R/serialize.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ writeObject <- function(con, object, writeType = TRUE) {
5454
# passing in vectors as arrays and instead require arrays to be passed
5555
# as lists.
5656
type <- class(object)[[1]] # class of POSIXlt is c("POSIXlt", "POSIXt")
57-
# Checking types is needed here, since is.na only handles atomic vectors,
57+
# Checking types is needed here, since 'is.na' only handles atomic vectors,
5858
# lists and pairlists
5959
if (type %in% c("integer", "character", "logical", "double", "numeric")) {
6060
if (is.na(object)) {

R/pkg/R/sparkR.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,7 @@ sparkRHive.init <- function(jsc = NULL) {
299299
#'
300300
#' @param sc existing spark context
301301
#' @param groupid the ID to be assigned to job groups
302-
#' @param description description for the the job group ID
302+
#' @param description description for the job group ID
303303
#' @param interruptOnCancel flag to indicate if the job is interrupted on job cancellation
304304
#' @examples
305305
#'\dontrun{

R/pkg/R/stats.R

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,45 @@ setMethod("freqItems", signature(x = "DataFrame", cols = "character"),
130130
collect(dataFrame(sct))
131131
})
132132

133+
#' approxQuantile
134+
#'
135+
#' Calculates the approximate quantiles of a numerical column of a DataFrame.
136+
#'
137+
#' The result of this algorithm has the following deterministic bound:
138+
#' If the DataFrame has N elements and if we request the quantile at probability `p` up to error
139+
#' `err`, then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank
140+
#' of `x` is close to (p * N). More precisely,
141+
#' floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
142+
#' This method implements a variation of the Greenwald-Khanna algorithm (with some speed
143+
#' optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670
144+
#' Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.
145+
#'
146+
#' @param x A SparkSQL DataFrame.
147+
#' @param col The name of the numerical column.
148+
#' @param probabilities A list of quantile probabilities. Each number must belong to [0, 1].
149+
#' For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
150+
#' @param relativeError The relative target precision to achieve (>= 0). If set to zero,
151+
#' the exact quantiles are computed, which could be very expensive.
152+
#' Note that values greater than 1 are accepted but give the same result as 1.
153+
#' @return The approximate quantiles at the given probabilities.
154+
#'
155+
#' @rdname statfunctions
156+
#' @name approxQuantile
157+
#' @export
158+
#' @examples
159+
#' \dontrun{
160+
#' df <- jsonFile(sqlContext, "/path/to/file.json")
161+
#' quantiles <- approxQuantile(df, "key", c(0.5, 0.8), 0.0)
162+
#' }
163+
setMethod("approxQuantile",
164+
signature(x = "DataFrame", col = "character",
165+
probabilities = "numeric", relativeError = "numeric"),
166+
function(x, col, probabilities, relativeError) {
167+
statFunctions <- callJMethod(x@sdf, "stat")
168+
callJMethod(statFunctions, "approxQuantile", col,
169+
as.list(probabilities), relativeError)
170+
})
171+
133172
#' sampleBy
134173
#'
135174
#' Returns a stratified sample without replacement based on the fraction given on each stratum.

0 commit comments

Comments
 (0)