[SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) #1316

jonsondag · 2014-07-07T14:16:34Z

Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
= binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
binData(shift + (2 *(numBins - 2 - splitIndex))) +
rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

AmplabJenkins · 2014-07-07T14:21:04Z

Can one of the admins verify this patch?

srowen · 2014-07-07T17:15:45Z

This was reported as https://issues.apache.org/jira/browse/SPARK-2152 (and duplicated for some reason as https://issues.apache.org/jira/browse/SPARK-2160) but the author didn't seem to make a PR. You could note this is tied to SPARK-2152 and resolves both.

jonsondag · 2014-07-07T18:01:01Z

Thanks, I have edited the subject of this pull request and added comments in JIRA.

pwendell · 2014-07-08T07:57:41Z

@johnnywalleye mind adding [MLlib] to the title. Then it's more likely to catch @mengxr's eye :)

jonsondag · 2014-07-08T11:58:53Z

Sure, updated.

manishamde · 2014-07-08T21:55:55Z

@johnnywalleye Thanks! I will review this PR and get back soon.

jonsondag · 2014-07-08T22:11:55Z

Great, thanks!

On Tue, Jul 8, 2014 at 5:56 PM, manishamde [email protected] wrote:

@johnnywalleye https://github.com/johnnywalleye I will review this PR
and get back soon.

—
Reply to this email directly or view it on GitHub
#1316 (comment).

manishamde · 2014-07-08T22:47:29Z

LGTM! The classic off-by-one error gets you no matter how much attention you pay. Thanks for the rigorous review.

@pwendell Could we also add this to the 1.0 patch release?

@johnnywalleye If possible, could you also update this PR. #886 It's on my repo on branch 'multiclass'. It's easy for me to update this but I am hoping you could review it as well since it's a non-trivial change. :-)

manishamde · 2014-07-09T00:03:16Z

@johnnywalleye Nevermind. I fixed the same error on the multiclass PR.

…(also resolves SPARK-2160) Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala. In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows: rightNodeAgg(featureIndex)(2 * (numBins - 2)) = binData(shift + (2 * numBins - 1))) Then there is a loop that sets the rest of the values, as in line 809: rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = binData(shift + (2 *(numBins - 2 - splitIndex))) + rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex)) But since splitIndex starts at 1, this ends up skipping a set of binData values. The changes here address this issue, for both the Regression and Classification cases. Author: johnnywalleye <[email protected]> Closes #1316 from johnnywalleye/master and squashes the following commits: 73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations (cherry picked from commit 1114207) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2014-07-09T02:23:12Z

Thanks @johnnywalleye for the fix and @manishamde for reviewing the PR! I merged it into both master and branch-1.0, but I'm not sure whether it can make into v1.0.1 . I will update the JIRA if it doesn't.

Btw, I tested this patch locally but let's try calling Jenkins (sorry).

mengxr · 2014-07-09T02:23:19Z

Jenkins, test this please.

manishamde · 2014-07-09T17:25:41Z

Thanks @mengxr

Fixes test failures introduced by #1316. For both the regression and classification cases, val stats is the InformationGainStats for the best tree split. stats.predict is the predicted value for the data, before the split is made. Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that. Author: johnnywalleye <[email protected]> Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits: ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests

Fixes test failures introduced by #1316. For both the regression and classification cases, val stats is the InformationGainStats for the best tree split. stats.predict is the predicted value for the data, before the split is made. Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that. Author: johnnywalleye <[email protected]> Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits: ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests (cherry picked from commit d35e3db) Signed-off-by: Xiangrui Meng <[email protected]>

…(also resolves SPARK-2160) Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala. In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows: rightNodeAgg(featureIndex)(2 * (numBins - 2)) = binData(shift + (2 * numBins - 1))) Then there is a loop that sets the rest of the values, as in line 809: rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = binData(shift + (2 *(numBins - 2 - splitIndex))) + rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex)) But since splitIndex starts at 1, this ends up skipping a set of binData values. The changes here address this issue, for both the Regression and Classification cases. Author: johnnywalleye <[email protected]> Closes apache#1316 from johnnywalleye/master and squashes the following commits: 73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations (cherry picked from commit 1114207) Signed-off-by: Xiangrui Meng <[email protected]>

Fixes test failures introduced by apache#1316. For both the regression and classification cases, val stats is the InformationGainStats for the best tree split. stats.predict is the predicted value for the data, before the split is made. Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that. Author: johnnywalleye <[email protected]> Closes apache#1343 from johnnywalleye/decision-tree-tests and squashes the following commits: ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests (cherry picked from commit d35e3db) Signed-off-by: Xiangrui Meng <[email protected]>

…(also resolves SPARK-2160) Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala. In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows: rightNodeAgg(featureIndex)(2 * (numBins - 2)) = binData(shift + (2 * numBins - 1))) Then there is a loop that sets the rest of the values, as in line 809: rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = binData(shift + (2 *(numBins - 2 - splitIndex))) + rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex)) But since splitIndex starts at 1, this ends up skipping a set of binData values. The changes here address this issue, for both the Regression and Classification cases. Author: johnnywalleye <[email protected]> Closes apache#1316 from johnnywalleye/master and squashes the following commits: 73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations

Fixes test failures introduced by apache#1316. For both the regression and classification cases, val stats is the InformationGainStats for the best tree split. stats.predict is the predicted value for the data, before the split is made. Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that. Author: johnnywalleye <[email protected]> Closes apache#1343 from johnnywalleye/decision-tree-tests and squashes the following commits: ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests

fix bin offset in DecisionTree node aggregations

73809da

jonsondag changed the title ~~fix bin offset in DecisionTree node aggregations~~ [SPARK-2152] fix bin offset in DecisionTree node aggregations Jul 7, 2014

jonsondag changed the title ~~[SPARK-2152] fix bin offset in DecisionTree node aggregations~~ [SPARK-2152] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) Jul 7, 2014

jonsondag changed the title ~~[SPARK-2152] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160)~~ [SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) Jul 8, 2014

asfgit closed this in 1114207 Jul 9, 2014

jonsondag mentioned this pull request Jul 9, 2014

[SPARK-2417][MLlib] Fix DecisionTree tests #1343

Closed

[SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) #1316

[SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) #1316

Uh oh!

Conversation

jonsondag commented Jul 7, 2014

Uh oh!

AmplabJenkins commented Jul 7, 2014

Uh oh!

srowen commented Jul 7, 2014

Uh oh!

jonsondag commented Jul 7, 2014

Uh oh!

pwendell commented Jul 8, 2014

Uh oh!

jonsondag commented Jul 8, 2014

Uh oh!

manishamde commented Jul 8, 2014

Uh oh!

jonsondag commented Jul 8, 2014

Uh oh!

manishamde commented Jul 8, 2014

Uh oh!

manishamde commented Jul 9, 2014

Uh oh!

mengxr commented Jul 9, 2014

Uh oh!

mengxr commented Jul 9, 2014

Uh oh!

manishamde commented Jul 9, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants