-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26721][ML] Avoid per-tree normalization in featureImportance for GBT #23773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * Estimate of the importance of each feature. | ||
| * | ||
| * Each feature's importance is the average of its importance across all trees in the ensemble | ||
| * The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is needed to update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it is still valid. The final vector is still normalized to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you skip normalization of importance vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see. The normalization mentioned here is for total importance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the normalization of the importance vector for each tree, but then at the end the vector is still normalized. To simplify in a diagram, before the PR it was:
tree importance -> normalization -> sum -> normalization
now it is
tree importance -> sum -> normalization
So the final result is still normalized.
|
Test build #102296 has finished for PR 23773 at commit
|
|
Test build #102297 has finished for PR 23773 at commit
|
| val importanceFeatures = gbtWithFeatureSubset.fit(df).featureImportances | ||
| val mostIF = importanceFeatures.argmax | ||
| assert(mostImportantFeature !== mostIF) | ||
| assert(mostIF === 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously two most important features are different. Why now they are both 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the exact reason why they were different earlier (of course the behavior changed because of the fix, but this is expected). You can compare the importances vector with the one returned by sklearn: as I mentioned in the PR description they are very similar (so sklearn too says 1 is the most important in both scenarios using sklearn too).
PS please notice that sklearn version must be >= 0.20.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I should have commented on this; I actually don't know why the test previously asserted the answers must be different. That's actually the thing I'd least expect, though it's possible. Why does it still assert the importances are different? I suspect they won't match exactly, sure, but if there's an assertion here, isn't it that they're close? They may just not be that comparable in which case there's nothing to assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion is there to check that a different subset strategy actually produces different results. In particular, in the first case, the importances vector is [1.0, 0.0, ...] while in the second case more features are used (because the trees can check a random variable at time), so the vector is something like [0.7, ...]. Hence this assertion makes sense in order to check that the featureSubset strategy is properly taken in account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I get it, we just expect something different to happen under the hood, even if we're largely expecting a similar or the same answer. Leave it in; if it failed because it exactly matched, we'd know it, and could easily figure out whether that's actually now expected or a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, in the first case, the importances vector is [1.0, 0.0, ...] while in the second case more features are used (because the trees can check a random variable at time), so the vector is something like [0.7, ...].
Don't the second case use just one feature and the first case use all features? What you mean more features are used for the second case? Or I misread the test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first case, every tree can choose among all features. Since feature 1 basically is the correct "label" (I mean they are the same), in the first case all the trees choose feature 1 in the first node and they get 100% accuracy. Hence the importance vector is [1.0, 0.0, ...]. In the second case, only 1 random feature per time can be considered. So the trees are more "diverse" and they consider also other features. So the importance vector is the one I mentioned above. You can maybe try and debug this UT if you want to understand better (probably it is more effective than my poor english) or you can try and run the same in sklearn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mgaido91.
I don't have a workable laptop in recent days. So it is hardly for me to run the unit test. That is why I ask for more details.
Sounds that this assertion assert(importances(mostImportantFeature) !== importanceFeatures(mostIF)) makes sense. But for assert(mostIF === 1), because it picks one random feature per time, are we sure that the most importance feature is 1 at all cases? In extreme case, this feature might not be chosen at all. It is potentially flaky. This assertion doesn't make too much sense to me, maybe we don't need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the seed is fixed, so the UT is actually deterministic and there is no flakyness. Despite with a different seed the result may be different, I'd consider very unlikely anyway that 1 would not be the most important one in any case, since it is really the ground truth in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it is correct since there is fixed seed.
Anyway, assert(mostIF === 1) actually means assert(mostImportantFeature == mostIF). This assertion doesn't make much sense as the previous one assert(mostImportantFeature !== mostIF). It doesn't tell us much except that it is happened to have the same most important feature...
OK for me to leave it as it.
|
Merged to master |
…or GBT ## What changes were proposed in this pull request? Our feature importance calculation is taken from sklearn's one, which has been recently fixed (in scikit-learn/scikit-learn#11176). Citing the description of that PR: > Because the feature importances are (currently, by default) normalized and then averaged, feature importances from later stages are overweighted. The PR performs a fix similar to sklearn's one. The per-tree normalization of the feature importance is skipped and GBT. Credits for pointing out clearly the issue and the sklearn's PR to Daniel Jumper. ## How was this patch tested? modified UT, checked that the computed `featureImportance` in that test is similar to sklearn's one (ti can't be the same, because the trees may be slightly different) Closes apache#23773 from mgaido91/SPARK-26721. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
Our feature importance calculation is taken from sklearn's one, which has been recently fixed (in scikit-learn/scikit-learn#11176). Citing the description of that PR:
The PR performs a fix similar to sklearn's one. The per-tree normalization of the feature importance is skipped and GBT.
Credits for pointing out clearly the issue and the sklearn's PR to Daniel Jumper.
How was this patch tested?
modified UT, checked that the computed
featureImportancein that test is similar to sklearn's one (ti can't be the same, because the trees may be slightly different)