Skip to content

Conversation

@chouqin
Copy link
Contributor

@chouqin chouqin commented Aug 28, 2014

In current implementation, prediction for a node is calculated along with calculation of information gain stats for each possible splits. The value to predict for a specific node is determined, no matter what the splits are.
To save computation, we can first calculate prediction first and then calculate information gain stats for each split.

This is also necessary if we want to support minimum instances per node parameters(SPARK-2207) because when all splits don't satisfy minimum instances requirement , we don't use information gain of any splits. There should be a way to get the prediction value.

This PR also removes unused function nodeIndexToLevel.

CC: @mengxr @manishamde @jkbradley, do you think this is really necessary?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to style guide for spark this change should be reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is my fault, I will fix style soon

@ScrapCodes
Copy link
Member

I can not say anything about the usefulness of the patch. But we follow the spark style guide across our code base. https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

@chouqin chouqin changed the title Dt predict [SPARK-3272][MLLib]Calculate prediction for nodes separately from calculating information gain for splits in decision tree Aug 28, 2014
@chouqin
Copy link
Contributor Author

chouqin commented Aug 28, 2014

@ScrapCodes thanks for you comments, I have changed indentation to meet the spark style guide just now.

@jkbradley
Copy link
Member

@chouqin Thanks for observing that we can sometimes avoid calculating the prediction and/or the info gain. I'm worried that this won't really change the scaling of the algorithm much since calculating the prediction is a low-cost operation. (This computation is done on the master node, and for any reasonable size dataset, the time spent on the master node is negligible compared to the time spent on the treeAggregate() call.)

I'm also worried about this PR clashing with the current DecisionTree PR: [https://github.com//pull/2125], which moves the calculation of predictions into separate Impurity* classes. Would it be possible to update this once [https://github.com//pull/2125] has gone through?

At that time, I think this PR could be simplified a bit by removing the Predict class. InformationGainStats.predict already holds the prediction, and InformationGainStats.gain can be computed or ignored as needed.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@chouqin
Copy link
Contributor Author

chouqin commented Sep 9, 2014

Close this PR and move to #2332

@chouqin chouqin closed this Sep 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants