-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26721][ML] Avoid per-tree normalization in featureImportance for GBT #23773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -363,7 +363,8 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { | |
| val gbtWithFeatureSubset = gbt.setFeatureSubsetStrategy("1") | ||
| val importanceFeatures = gbtWithFeatureSubset.fit(df).featureImportances | ||
| val mostIF = importanceFeatures.argmax | ||
| assert(mostImportantFeature !== mostIF) | ||
| assert(mostIF === 1) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously two most important features are different. Why now they are both
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure about the exact reason why they were different earlier (of course the behavior changed because of the fix, but this is expected). You can compare the importances vector with the one returned by PS please notice that sklearn version must be >=
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I should have commented on this; I actually don't know why the test previously asserted the answers must be different. That's actually the thing I'd least expect, though it's possible. Why does it still assert the importances are different? I suspect they won't match exactly, sure, but if there's an assertion here, isn't it that they're close? They may just not be that comparable in which case there's nothing to assert.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The assertion is there to check that a different subset strategy actually produces different results. In particular, in the first case, the importances vector is [1.0, 0.0, ...] while in the second case more features are used (because the trees can check a random variable at time), so the vector is something like [0.7, ...]. Hence this assertion makes sense in order to check that the featureSubset strategy is properly taken in account.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I get it, we just expect something different to happen under the hood, even if we're largely expecting a similar or the same answer. Leave it in; if it failed because it exactly matched, we'd know it, and could easily figure out whether that's actually now expected or a bug.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Don't the second case use just one feature and the first case use all features? What you mean more features are used for the second case? Or I misread the test code?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the first case, every tree can choose among all features. Since feature 1 basically is the correct "label" (I mean they are the same), in the first case all the trees choose feature 1 in the first node and they get 100% accuracy. Hence the importance vector is [1.0, 0.0, ...]. In the second case, only 1 random feature per time can be considered. So the trees are more "diverse" and they consider also other features. So the importance vector is the one I mentioned above. You can maybe try and debug this UT if you want to understand better (probably it is more effective than my poor english) or you can try and run the same in sklearn.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @mgaido91. I don't have a workable laptop in recent days. So it is hardly for me to run the unit test. That is why I ask for more details. Sounds that this assertion
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, the seed is fixed, so the UT is actually deterministic and there is no flakyness. Despite with a different seed the result may be different, I'd consider very unlikely anyway that 1 would not be the most important one in any case, since it is really the ground truth in this case.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, it is correct since there is fixed seed. Anyway, OK for me to leave it as it. |
||
| assert(importances(mostImportantFeature) !== importanceFeatures(mostIF)) | ||
| } | ||
|
|
||
| test("model evaluateEachIteration") { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is needed to update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it is still valid. The final vector is still normalized to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you skip normalization of importance vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see. The normalization mentioned here is for total importance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the normalization of the importance vector for each tree, but then at the end the vector is still normalized. To simplify in a diagram, before the PR it was:
tree importance->normalization->sum->normalizationnow it is
tree importance->sum->normalizationSo the final result is still normalized.