[SPARK-15767][R][ML] Decision Tree Regression wrapper in SparkR #13690

vectorijk · 2016-06-15T20:44:41Z

What changes were proposed in this pull request?

Implement a wrapper in SparkR to support decision tree regression. R's naive Decision Tree Regression implementation is from package rpart with signature rpart(formula, dataframe, method="anova"). I propose we could implement API like spark.rpart(dataframe, formula, ...) . After having implemented decision tree classification, we could refactor this two into an API more like rpart().

How was this patch tested?

Test with unit test in SparkR

SparkQA · 2016-06-15T21:44:16Z

Test build #60600 has finished for PR 13690 at commit 7ea9544.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DecisionTreeRegressorWrapperWriter(instance: DecisionTreeRegressorWrapper)
- class DecisionTreeRegressorWrapperReader extends MLReader[DecisionTreeRegressorWrapper]

SparkQA · 2016-06-22T18:26:51Z

Test build #61050 has finished for PR 13690 at commit 378607f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DecisionTreeRegressorWrapperWriter(instance: DecisionTreeRegressorWrapper)
- class DecisionTreeRegressorWrapperReader extends MLReader[DecisionTreeRegressorWrapper]

felixcheung · 2016-08-11T23:45:50Z

Hi @vectorijk would you be interested in continuing this work?

vectorijk · 2016-08-14T00:57:48Z

Yes, sure. But I'm in a vacation this week. I will keep working on this and
update as soon as possible when I get back next week.

On Thu, Aug 11, 2016, 19:46 Felix Cheung [email protected] wrote:

Hi @vectorijk https://github.com/vectorijk would you be interested in
continuing this work?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13690 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADQu6Z5doEmjTpTXYESSYVlyiIM0c2sJks5qe7RFgaJpZM4I2xmp
.

felixcheung · 2016-08-17T12:38:21Z

Great! based on earlier discussions we might want to call this spark.decisionTreeRegression or similar? As you say, if we are very compatible we could have (another) alias like rpart (ie. with spark., like glm)

junyangq · 2016-08-22T23:53:34Z

ping @vectorijk Have you started working on the random forest wrapper. If not and you feel busy, I can also work on that :)

junyangq · 2016-08-22T23:55:09Z

Also, if you need any help with this PR, just let me know and we may work together to make it.

vectorijk · 2016-08-23T16:21:16Z

@junyangq I have started working on random forest wrapper. I will open PR as soon as possible. Also, I'll update this PR very soon. Thanks.

junyangq · 2016-08-23T16:46:18Z

Sounds great. Thank you @vectorijk

SparkQA · 2016-09-01T11:28:27Z

Test build #64777 has finished for PR 13690 at commit 2835a7a.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-01T12:08:32Z

Test build #64776 has finished for PR 13690 at commit f8b3484.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DecisionTreeRegressorWrapperWriter(instance: DecisionTreeRegressorWrapper)
- class DecisionTreeRegressorWrapperReader extends MLReader[DecisionTreeRegressorWrapper]

shivaram · 2016-09-03T22:29:34Z

@vectorijk Is this ready for another round of review ?

felixcheung · 2016-09-23T17:36:32Z

@vectorijk hi - would you have time to update this?

felixcheung · 2016-09-29T17:24:21Z

hi @vectorijk - would you have time to update this? If not, I will try to follow up basing on your work.

vectorijk · 2016-09-29T17:41:12Z

@felixcheung I'll update the changes in this two days.

felixcheung · 2016-09-29T17:49:10Z

R/pkg/R/mllib.R

I think this has been updated to use an internal function - could you check?

felixcheung · 2016-09-29T17:50:43Z

R/pkg/R/mllib.R

@note since 2.1.0 like others?

felixcheung · 2016-09-29T17:51:06Z

R/pkg/R/mllib.R

please add #' doc block

felixcheung · 2016-09-29T17:51:28Z

Thanks! Aside from having to rebase, there are some left over of "spark.rpart", a few some changes and also would be great to add tests for this.

SparkQA · 2016-10-06T07:57:54Z

Test build #66438 has finished for PR 13690 at commit b18b718.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-06T08:29:11Z

Test build #66442 has finished for PR 13690 at commit 9787219.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-06T11:47:36Z

Test build #66448 has finished for PR 13690 at commit d034735.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

vectorijk · 2016-10-06T19:33:20Z

@felixcheung @shivaram @junyangq It's ready for the review.

felixcheung · 2016-10-07T05:19:26Z

could you fix the test failure?

Duplicated \argument entries in documentation object 'spark.decisionTree':
  'newData' '...' 'object' '...' 'x'

felixcheung · 2016-10-07T05:22:47Z

R/pkg/R/mllib.R

 #' @seealso \link{spark.als}, \link{spark.gaussianMixture}, \link{spark.isoreg}, \link{spark.kmeans},
-#' @seealso \link{spark.lda}, \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg}
+#' @seealso \link{spark.lda}, \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg},
+#' @seealso \link{spark.decisionTree},


let's keep this list sorted?

felixcheung · 2016-10-07T05:22:58Z

R/pkg/R/mllib.R

 #' @seealso \link{spark.glm}, \link{glm},
 #' @seealso \link{spark.als}, \link{spark.gaussianMixture}, \link{spark.isoreg}, \link{spark.kmeans},
-#' @seealso \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg}
+#' @seealso \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg}, \link{spark.decisionTree}


felixcheung · 2016-10-07T05:26:39Z

R/pkg/R/mllib.R

+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula = "formula"),
+          function(data, formula, type = c("regression", "classification"),
+                   maxDepth = 5, maxBins = 32 ) {
+            formula <- paste(deparse(formula), collapse = "")


use match.arg to check type?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/match.arg.html

felixcheung · 2016-10-07T05:29:24Z

R/pkg/inst/tests/testthat/test_mllib.R


+test_that("spark.decisionTree Regression", {
+  data <- suppressWarnings(createDataFrame(longley))
+  model <- spark.decisionTree(data, Employed~., "regression", maxDepth = 5, maxBins = 16)


could be more readable as Employed ~ . (with spaces)

addressed comments above

SparkQA · 2016-10-08T07:36:38Z

Test build #66567 has finished for PR 13690 at commit 0694f84.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-09T01:23:05Z

R/pkg/inst/tests/testthat/test_mllib.R

 })

+test_that("spark.decisionTree Regression", {
+  data <- suppressWarnings(createDataFrame(longley))


please add a test for print (see spark.glm)

felixcheung · 2016-10-09T01:25:27Z

R/pkg/R/mllib.R

+#' a SparkDataFrame. Users can call \code{summary} to get a summary of the fitted Decision Tree
+#' model, \code{predict} to make predictions on new data, and \code{write.ml}/\code{read.ml} to
+#' save/load fitted models.
+#' For more details, see \href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}


could you point this url to the Spark programming guide, like http://spark.apache.org/docs/latest/ml-classification-regression.html

felixcheung · 2016-10-09T01:28:24Z

R/pkg/R/mllib.R

+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. Currently only a few formula
+#'                operators are supported, including '~', ':', '+', and '-'.
+#' @param type type of model to fit


please add the types supported, eg. one of "regression" or "classification" as the type of model

felixcheung · 2016-10-09T01:29:52Z

R/pkg/R/mllib.R

+#'
+#' # fit a Decision Tree Regression Model
+#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16)
+#'


Could we add an example for "classification" too?

felixcheung · 2016-10-09T01:30:32Z

R/pkg/R/mllib.R

+#' @note spark.decisionTree since 2.1.0
+setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula = "formula"),
+          function(data, formula, type = c("regression", "classification"),
+                   maxDepth = 5, maxBins = 32 ) {


nit: extra space after 32 )

felixcheung · 2016-10-09T01:41:49Z

R/pkg/R/mllib.R

+
+#' Save the Decision Tree Classification model to the input path.
+#'
+#' @param object A fitted Decision tree classification model


could you check the output doc by running create-doc.sh - I think this will duplicate the object when the @rdname is changed - in that case, just have one instance of this and say "regression or classification model"

felixcheung · 2016-10-09T01:42:01Z

R/pkg/R/mllib.R

+#'                  which means throw exception if the output path exists.
+#'
+#' @aliases write.ml,DecisionTreeClassificationModel,character-method
+#' @rdname spark.decisionTreeClassification


change to @rdname spark.decisionTree

felixcheung · 2016-10-09T01:44:16Z

R/pkg/R/mllib.R

+#' @export
+#' @note summary(DecisionTreeRegressionModel) since 2.1.0
+setMethod("summary", signature(object = "DecisionTreeRegressionModel"),
+          function(object, ...) {


do not put ... in signature here

felixcheung · 2016-10-09T01:54:04Z

mllib/src/main/scala/org/apache/spark/ml/r/DecisionTreeClassifierWrapper.scala

+
+    val rFormula = new RFormula()
+      .setFormula(formula)
+      .setFeaturesCol("features")


could you take a look at another model wrapper (like NaiveBayesWrapper) and RWrapperUtils on how to handle DataFrame column name - this shouldn't be hardcoded here?

felixcheung · 2016-10-09T01:57:11Z

mllib/src/main/scala/org/apache/spark/ml/r/DecisionTreeRegressorWrapper.scala

+
+    val rFormula = new RFormula()
+      .setFormula(formula)
+      .setFeaturesCol("features")


felixcheung · 2016-10-09T02:00:22Z

R/pkg/R/mllib.R

+#' @export
+#' @note summary(DecisionTreeClassificationModel) since 2.1.0
+setMethod("summary", signature(object = "DecisionTreeClassificationModel"),
+          function(object, ...) {


shivaram · 2016-11-10T18:04:18Z

@felixcheung @vectorijk Should we close this PR ?

vectorijk · 2016-11-11T10:57:20Z

@shivaram I will update this today.

HyukjinKwon · 2016-11-29T05:10:06Z

gentle ping @vectorijk

vectorijk force-pushed the DEV-DTRegression branch from 7ea9544 to 378607f Compare June 22, 2016 17:11

vectorijk changed the title ~~[SPARK-15767][R][ML][WIP] Decision Tree Regression wrapper in SparkR~~ [SPARK-15767][R][ML] Decision Tree Regression wrapper in SparkR Jun 22, 2016

vectorijk force-pushed the DEV-DTRegression branch from 378607f to f8b3484 Compare September 1, 2016 11:08

felixcheung reviewed Sep 29, 2016

View reviewed changes

R/pkg/R/mllib.R Outdated

Copy link

Member

felixcheung Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been updated to use an internal function - could you check?

felixcheung reviewed Sep 29, 2016

View reviewed changes

R/pkg/R/mllib.R Outdated

Copy link

Member

felixcheung Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@note since 2.1.0 like others?

felixcheung reviewed Sep 29, 2016

View reviewed changes

R/pkg/R/mllib.R Outdated

Copy link

Member

felixcheung Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add #' doc block

vectorijk added 3 commits October 6, 2016 00:53

DecisionTree Wrapper in SparkR

bee4868

add

463f965

regression pass unit test

b18b718

vectorijk force-pushed the DEV-DTRegression branch from 2835a7a to b18b718 Compare October 6, 2016 07:54

decision tree documentation

9787219

formatting

d107ab9

classification unit test

d034735

felixcheung reviewed Oct 7, 2016

View reviewed changes

address comments

0694f84

felixcheung requested changes Oct 9, 2016

View reviewed changes

felixcheung reviewed Oct 9, 2016

View reviewed changes

srowen mentioned this pull request Feb 2, 2017

[BUILD] Close stale PRs #16778

Closed

asfgit closed this in 20b4ca1 Feb 3, 2017

[SPARK-15767][R][ML] Decision Tree Regression wrapper in SparkR #13690

[SPARK-15767][R][ML] Decision Tree Regression wrapper in SparkR #13690

Uh oh!

Conversation

vectorijk commented Jun 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 15, 2016

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

felixcheung commented Aug 11, 2016

Uh oh!

vectorijk commented Aug 14, 2016

Uh oh!

felixcheung commented Aug 17, 2016

Uh oh!

junyangq commented Aug 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junyangq commented Aug 22, 2016

Uh oh!

vectorijk commented Aug 23, 2016

Uh oh!

junyangq commented Aug 23, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

shivaram commented Sep 3, 2016

Uh oh!

felixcheung commented Sep 23, 2016

Uh oh!

felixcheung commented Sep 29, 2016

Uh oh!

vectorijk commented Sep 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Sep 29, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

vectorijk commented Oct 6, 2016

Uh oh!

felixcheung commented Oct 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Oct 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

junyangq commented Aug 22, 2016 •

edited

Loading

felixcheung Oct 9, 2016 •

edited

Loading

felixcheung Oct 9, 2016 •

edited

Loading