Skip to content

Conversation

@jonsondag
Copy link
Contributor

Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
= binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
binData(shift + (2 *(numBins - 2 - splitIndex))) +
rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Jul 7, 2014

This was reported as https://issues.apache.org/jira/browse/SPARK-2152 (and duplicated for some reason as https://issues.apache.org/jira/browse/SPARK-2160) but the author didn't seem to make a PR. You could note this is tied to SPARK-2152 and resolves both.

@jonsondag jonsondag changed the title fix bin offset in DecisionTree node aggregations [SPARK-2152] fix bin offset in DecisionTree node aggregations Jul 7, 2014
@jonsondag jonsondag changed the title [SPARK-2152] fix bin offset in DecisionTree node aggregations [SPARK-2152] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) Jul 7, 2014
@jonsondag
Copy link
Contributor Author

Thanks, I have edited the subject of this pull request and added comments in JIRA.

@pwendell
Copy link
Contributor

pwendell commented Jul 8, 2014

@johnnywalleye mind adding [MLlib] to the title. Then it's more likely to catch @mengxr's eye :)

@jonsondag jonsondag changed the title [SPARK-2152] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) [SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) Jul 8, 2014
@jonsondag
Copy link
Contributor Author

Sure, updated.

@manishamde
Copy link
Contributor

@johnnywalleye Thanks! I will review this PR and get back soon.

@jonsondag
Copy link
Contributor Author

Great, thanks!

On Tue, Jul 8, 2014 at 5:56 PM, manishamde [email protected] wrote:

@johnnywalleye https://github.com/johnnywalleye I will review this PR
and get back soon.


Reply to this email directly or view it on GitHub
#1316 (comment).

@manishamde
Copy link
Contributor

LGTM! The classic off-by-one error gets you no matter how much attention you pay. Thanks for the rigorous review.

@pwendell Could we also add this to the 1.0 patch release?

@johnnywalleye If possible, could you also update this PR. #886 It's on my repo on branch 'multiclass'. It's easy for me to update this but I am hoping you could review it as well since it's a non-trivial change. :-)

@manishamde
Copy link
Contributor

@johnnywalleye Nevermind. I fixed the same error on the multiclass PR.

@asfgit asfgit closed this in 1114207 Jul 9, 2014
asfgit pushed a commit that referenced this pull request Jul 9, 2014
…(also resolves SPARK-2160)

Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
  = binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
  binData(shift + (2 *(numBins - 2 - splitIndex))) +
  rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

Author: johnnywalleye <[email protected]>

Closes #1316 from johnnywalleye/master and squashes the following commits:

73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations

(cherry picked from commit 1114207)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented Jul 9, 2014

Thanks @johnnywalleye for the fix and @manishamde for reviewing the PR! I merged it into both master and branch-1.0, but I'm not sure whether it can make into v1.0.1 . I will update the JIRA if it doesn't.

Btw, I tested this patch locally but let's try calling Jenkins (sorry).

@mengxr
Copy link
Contributor

mengxr commented Jul 9, 2014

Jenkins, test this please.

@manishamde
Copy link
Contributor

Thanks @mengxr

asfgit pushed a commit that referenced this pull request Jul 9, 2014
Fixes test failures introduced by #1316.

For both the regression and classification cases,
val stats is the InformationGainStats for the best tree split.
stats.predict is the predicted value for the data, before the split is made.
Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that.

Author: johnnywalleye <[email protected]>

Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits:

ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests
asfgit pushed a commit that referenced this pull request Jul 9, 2014
Fixes test failures introduced by #1316.

For both the regression and classification cases,
val stats is the InformationGainStats for the best tree split.
stats.predict is the predicted value for the data, before the split is made.
Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that.

Author: johnnywalleye <[email protected]>

Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits:

ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests

(cherry picked from commit d35e3db)
Signed-off-by: Xiangrui Meng <[email protected]>
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 18, 2014
…(also resolves SPARK-2160)

Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
  = binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
  binData(shift + (2 *(numBins - 2 - splitIndex))) +
  rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

Author: johnnywalleye <[email protected]>

Closes apache#1316 from johnnywalleye/master and squashes the following commits:

73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations

(cherry picked from commit 1114207)
Signed-off-by: Xiangrui Meng <[email protected]>
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 18, 2014
Fixes test failures introduced by apache#1316.

For both the regression and classification cases,
val stats is the InformationGainStats for the best tree split.
stats.predict is the predicted value for the data, before the split is made.
Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that.

Author: johnnywalleye <[email protected]>

Closes apache#1343 from johnnywalleye/decision-tree-tests and squashes the following commits:

ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests

(cherry picked from commit d35e3db)
Signed-off-by: Xiangrui Meng <[email protected]>
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…(also resolves SPARK-2160)

Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
  = binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
  binData(shift + (2 *(numBins - 2 - splitIndex))) +
  rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

Author: johnnywalleye <[email protected]>

Closes apache#1316 from johnnywalleye/master and squashes the following commits:

73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Fixes test failures introduced by apache#1316.

For both the regression and classification cases,
val stats is the InformationGainStats for the best tree split.
stats.predict is the predicted value for the data, before the split is made.
Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that.

Author: johnnywalleye <[email protected]>

Closes apache#1343 from johnnywalleye/decision-tree-tests and squashes the following commits:

ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants