[SPARK-3272][MLLib]Calculate prediction for nodes separately from calculating information gain for splits in decision tree #2180

chouqin · 2014-08-28T08:25:09Z

In current implementation, prediction for a node is calculated along with calculation of information gain stats for each possible splits. The value to predict for a specific node is determined, no matter what the splits are.
To save computation, we can first calculate prediction first and then calculate information gain stats for each split.

This is also necessary if we want to support minimum instances per node parameters(SPARK-2207) because when all splits don't satisfy minimum instances requirement , we don't use information gain of any splits. There should be a way to get the prediction value.

This PR also removes unused function nodeIndexToLevel.

CC: @mengxr @manishamde @jkbradley, do you think this is really necessary?

… of splits

AmplabJenkins · 2014-08-28T08:29:09Z

Can one of the admins verify this patch?

ScrapCodes · 2014-08-28T08:40:36Z

mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

According to style guide for spark this change should be reverted.

Thanks, this is my fault, I will fix style soon

ScrapCodes · 2014-08-28T08:43:08Z

I can not say anything about the usefulness of the patch. But we follow the spark style guide across our code base. https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

chouqin · 2014-08-28T11:17:59Z

@ScrapCodes thanks for you comments, I have changed indentation to meet the spark style guide just now.

jkbradley · 2014-08-28T16:13:22Z

@chouqin Thanks for observing that we can sometimes avoid calculating the prediction and/or the info gain. I'm worried that this won't really change the scaling of the algorithm much since calculating the prediction is a low-cost operation. (This computation is done on the master node, and for any reasonable size dataset, the time spent on the master node is negligible compared to the time spent on the treeAggregate() call.)

I'm also worried about this PR clashing with the current DecisionTree PR: [https://github.com//pull/2125], which moves the calculation of predictions into separate Impurity* classes. Would it be possible to update this once [https://github.com//pull/2125] has gone through?

At that time, I think this PR could be simplified a bit by removing the Predict class. InformationGainStats.predict already holds the prediction, and InformationGainStats.gain can be computed or ignored as needed.

SparkQA · 2014-09-05T23:42:05Z

Can one of the admins verify this patch?

chouqin · 2014-09-09T09:22:29Z

Close this PR and move to #2332

qiping.lqp added 3 commits August 28, 2014 16:03

separate calculation of predict of node from calculation of info gain…

0552c7e

… of splits

commit Predict.scala

c205eb8

fix decision tree suite

d92b3d4

ScrapCodes reviewed Aug 28, 2014
View reviewed changes

chouqin changed the title ~~Dt predict~~ [SPARK-3272][MLLib]Calculate prediction for nodes separately from calculating information gain for splits in decision tree Aug 28, 2014

fix indentation

e6af523

chouqin closed this Sep 9, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-3272][MLLib]Calculate prediction for nodes separately from calculating information gain for splits in decision tree #2180

[SPARK-3272][MLLib]Calculate prediction for nodes separately from calculating information gain for splits in decision tree #2180

Uh oh!

chouqin commented Aug 28, 2014

Uh oh!

AmplabJenkins commented Aug 28, 2014

Uh oh!

ScrapCodes Aug 28, 2014

Uh oh!

chouqin Aug 28, 2014

Uh oh!

ScrapCodes commented Aug 28, 2014

Uh oh!

chouqin commented Aug 28, 2014

Uh oh!

jkbradley commented Aug 28, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

chouqin commented Sep 9, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-3272][MLLib]Calculate prediction for nodes separately from calculating information gain for splits in decision tree #2180

[SPARK-3272][MLLib]Calculate prediction for nodes separately from calculating information gain for splits in decision tree #2180

Uh oh!

Conversation

chouqin commented Aug 28, 2014

Uh oh!

AmplabJenkins commented Aug 28, 2014

Uh oh!

ScrapCodes Aug 28, 2014

Choose a reason for hiding this comment

Uh oh!

chouqin Aug 28, 2014

Choose a reason for hiding this comment

Uh oh!

ScrapCodes commented Aug 28, 2014

Uh oh!

chouqin commented Aug 28, 2014

Uh oh!

jkbradley commented Aug 28, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

chouqin commented Sep 9, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants