Skip to content

Conversation

@vectorijk
Copy link
Contributor

What changes were proposed in this pull request?

Implement a wrapper in SparkR to support decision tree regression. R's naive Decision Tree Regression implementation is from package rpart with signature rpart(formula, dataframe, method="anova"). I propose we could implement API like spark.rpart(dataframe, formula, ...) . After having implemented decision tree classification, we could refactor this two into an API more like rpart().

How was this patch tested?

Test with unit test in SparkR

@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #60600 has finished for PR 13690 at commit 7ea9544.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeRegressorWrapperWriter(instance: DecisionTreeRegressorWrapper)
    • class DecisionTreeRegressorWrapperReader extends MLReader[DecisionTreeRegressorWrapper]

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61050 has finished for PR 13690 at commit 378607f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeRegressorWrapperWriter(instance: DecisionTreeRegressorWrapper)
    • class DecisionTreeRegressorWrapperReader extends MLReader[DecisionTreeRegressorWrapper]

@vectorijk vectorijk changed the title [SPARK-15767][R][ML][WIP] Decision Tree Regression wrapper in SparkR [SPARK-15767][R][ML] Decision Tree Regression wrapper in SparkR Jun 22, 2016
@felixcheung
Copy link
Member

Hi @vectorijk would you be interested in continuing this work?

@vectorijk
Copy link
Contributor Author

Yes, sure. But I'm in a vacation this week. I will keep working on this and
update as soon as possible when I get back next week.

On Thu, Aug 11, 2016, 19:46 Felix Cheung [email protected] wrote:

Hi @vectorijk https://github.com/vectorijk would you be interested in
continuing this work?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13690 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADQu6Z5doEmjTpTXYESSYVlyiIM0c2sJks5qe7RFgaJpZM4I2xmp
.

@felixcheung
Copy link
Member

Great! based on earlier discussions we might want to call this spark.decisionTreeRegression or similar? As you say, if we are very compatible we could have (another) alias like rpart (ie. with spark., like glm)

@junyangq
Copy link
Contributor

junyangq commented Aug 22, 2016

ping @vectorijk Have you started working on the random forest wrapper. If not and you feel busy, I can also work on that :)

@junyangq
Copy link
Contributor

Also, if you need any help with this PR, just let me know and we may work together to make it.

@vectorijk
Copy link
Contributor Author

@junyangq I have started working on random forest wrapper. I will open PR as soon as possible. Also, I'll update this PR very soon. Thanks.

@junyangq
Copy link
Contributor

Sounds great. Thank you @vectorijk

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64777 has finished for PR 13690 at commit 2835a7a.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64776 has finished for PR 13690 at commit f8b3484.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeRegressorWrapperWriter(instance: DecisionTreeRegressorWrapper)
    • class DecisionTreeRegressorWrapperReader extends MLReader[DecisionTreeRegressorWrapper]

@shivaram
Copy link
Contributor

shivaram commented Sep 3, 2016

@vectorijk Is this ready for another round of review ?

@felixcheung
Copy link
Member

@vectorijk hi - would you have time to update this?

@felixcheung
Copy link
Member

hi @vectorijk - would you have time to update this? If not, I will try to follow up basing on your work.

@vectorijk
Copy link
Contributor Author

@felixcheung I'll update the changes in this two days.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been updated to use an internal function - could you check?

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@note since 2.1.0 like others?

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add #' doc block

@felixcheung
Copy link
Member

Thanks! Aside from having to rebase, there are some left over of "spark.rpart", a few some changes and also would be great to add tests for this.

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66438 has finished for PR 13690 at commit b18b718.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66442 has finished for PR 13690 at commit 9787219.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66448 has finished for PR 13690 at commit d034735.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vectorijk
Copy link
Contributor Author

@felixcheung @shivaram @junyangq It's ready for the review.

@felixcheung
Copy link
Member

could you fix the test failure?

Duplicated \argument entries in documentation object 'spark.decisionTree':
  'newData' '...' 'object' '...' 'x'

R/pkg/R/mllib.R Outdated
#' @seealso \link{spark.als}, \link{spark.gaussianMixture}, \link{spark.isoreg}, \link{spark.kmeans},
#' @seealso \link{spark.lda}, \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg}
#' @seealso \link{spark.lda}, \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg},
#' @seealso \link{spark.decisionTree},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep this list sorted?

R/pkg/R/mllib.R Outdated
#' @seealso \link{spark.glm}, \link{glm},
#' @seealso \link{spark.als}, \link{spark.gaussianMixture}, \link{spark.isoreg}, \link{spark.kmeans},
#' @seealso \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg}
#' @seealso \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg}, \link{spark.decisionTree}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, type = c("regression", "classification"),
maxDepth = 5, maxBins = 32 ) {
formula <- paste(deparse(formula), collapse = "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


test_that("spark.decisionTree Regression", {
data <- suppressWarnings(createDataFrame(longley))
model <- spark.decisionTree(data, Employed~., "regression", maxDepth = 5, maxBins = 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be more readable as Employed ~ . (with spaces)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed comments above

@SparkQA
Copy link

SparkQA commented Oct 8, 2016

Test build #66567 has finished for PR 13690 at commit 0694f84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

})

test_that("spark.decisionTree Regression", {
data <- suppressWarnings(createDataFrame(longley))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test for print (see spark.glm)

#' a SparkDataFrame. Users can call \code{summary} to get a summary of the fitted Decision Tree
#' model, \code{predict} to make predictions on new data, and \code{write.ml}/\code{read.ml} to
#' save/load fitted models.
#' For more details, see \href{https://en.wikipedia.org/wiki/Decision_tree_learning}{Decision Tree}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you point this url to the Spark programming guide, like http://spark.apache.org/docs/latest/ml-classification-regression.html

#' @param data a SparkDataFrame for training.
#' @param formula a symbolic description of the model to be fitted. Currently only a few formula
#' operators are supported, including '~', ':', '+', and '-'.
#' @param type type of model to fit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the types supported, eg. one of "regression" or "classification" as the type of model

#'
#' # fit a Decision Tree Regression Model
#' model <- spark.decisionTree(data, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16)
#'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an example for "classification" too?

#' @note spark.decisionTree since 2.1.0
setMethod("spark.decisionTree", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, type = c("regression", "classification"),
maxDepth = 5, maxBins = 32 ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space after 32 )


#' Save the Decision Tree Classification model to the input path.
#'
#' @param object A fitted Decision tree classification model
Copy link
Member

@felixcheung felixcheung Oct 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you check the output doc by running create-doc.sh - I think this will duplicate the object when the @rdname is changed - in that case, just have one instance of this and say "regression or classification model"

#' which means throw exception if the output path exists.
#'
#' @aliases write.ml,DecisionTreeClassificationModel,character-method
#' @rdname spark.decisionTreeClassification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to @rdname spark.decisionTree

#' @export
#' @note summary(DecisionTreeRegressionModel) since 2.1.0
setMethod("summary", signature(object = "DecisionTreeRegressionModel"),
function(object, ...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not put ... in signature here


val rFormula = new RFormula()
.setFormula(formula)
.setFeaturesCol("features")
Copy link
Member

@felixcheung felixcheung Oct 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you take a look at another model wrapper (like NaiveBayesWrapper) and RWrapperUtils on how to handle DataFrame column name - this shouldn't be hardcoded here?


val rFormula = new RFormula()
.setFormula(formula)
.setFeaturesCol("features")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

#' @export
#' @note summary(DecisionTreeClassificationModel) since 2.1.0
setMethod("summary", signature(object = "DecisionTreeClassificationModel"),
function(object, ...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@shivaram
Copy link
Contributor

@felixcheung @vectorijk Should we close this PR ?

@vectorijk
Copy link
Contributor Author

@shivaram I will update this today.

@HyukjinKwon
Copy link
Member

gentle ping @vectorijk

@srowen srowen mentioned this pull request Feb 2, 2017
@asfgit asfgit closed this in 20b4ca1 Feb 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants