Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -341,11 +341,12 @@ class GBTClassificationModel private[ml](
* The importance vector is normalized to sum to 1. This method is suggested by Hastie et al.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is needed to update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it is still valid. The final vector is still normalized to 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you skip normalization of importance vector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see. The normalization mentioned here is for total importance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the normalization of the importance vector for each tree, but then at the end the vector is still normalized. To simplify in a diagram, before the PR it was:
tree importance -> normalization -> sum -> normalization
now it is
tree importance -> sum -> normalization
So the final result is still normalized.

* (Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001.)
* and follows the implementation from scikit-learn.

*
* See `DecisionTreeClassificationModel.featureImportances`
*/
@Since("2.0.0")
lazy val featureImportances: Vector = TreeEnsembleModel.featureImportances(trees, numFeatures)
lazy val featureImportances: Vector =
TreeEnsembleModel.featureImportances(trees, numFeatures, perTreeNormalization = false)

/** Raw prediction for the positive class. */
private def margin(features: Vector): Double = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,8 @@ class GBTRegressionModel private[ml](
* @see `DecisionTreeRegressionModel.featureImportances`
*/
@Since("2.0.0")
lazy val featureImportances: Vector = TreeEnsembleModel.featureImportances(trees, numFeatures)
lazy val featureImportances: Vector =
TreeEnsembleModel.featureImportances(trees, numFeatures, perTreeNormalization = false)

/** (private[ml]) Convert to a model in the old API */
private[ml] def toOld: OldGBTModel = {
Expand Down
23 changes: 19 additions & 4 deletions mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ private[ml] object TreeEnsembleModel {
* - Average over trees:
* - importance(feature j) = sum (over nodes which split on feature j) of the gain,
* where gain is scaled by the number of instances passing through node
* - Normalize importances for tree to sum to 1.
* - Normalize importances for tree to sum to 1 (only if `perTreeNormalization` is `true`).
* - Normalize feature importance vector to sum to 1.
*
* References:
Expand All @@ -145,20 +145,35 @@ private[ml] object TreeEnsembleModel {
* @param numFeatures Number of features in model (even if not all are explicitly used by
* the model).
* If -1, then numFeatures is set based on the max feature index in all trees.
* @param perTreeNormalization By default this is set to `true` and it means that the importances
* of each tree are normalized before being summed. If set to `false`,
* the normalization is skipped.
* @return Feature importance values, of length numFeatures.
*/
def featureImportances[M <: DecisionTreeModel](trees: Array[M], numFeatures: Int): Vector = {
def featureImportances[M <: DecisionTreeModel](
trees: Array[M],
numFeatures: Int,
perTreeNormalization: Boolean = true): Vector = {
val totalImportances = new OpenHashMap[Int, Double]()
trees.foreach { tree =>
// Aggregate feature importance vector for this tree
val importances = new OpenHashMap[Int, Double]()
computeFeatureImportance(tree.rootNode, importances)
// Normalize importance vector for this tree, and add it to total.
// TODO: In the future, also support normalizing by tree.rootNode.impurityStats.count?
val treeNorm = importances.map(_._2).sum
val treeNorm = if (perTreeNormalization) {
importances.map(_._2).sum
} else {
// We won't use it
Double.NaN
}
if (treeNorm != 0) {
importances.foreach { case (idx, impt) =>
val normImpt = impt / treeNorm
val normImpt = if (perTreeNormalization) {
impt / treeNorm
} else {
impt
}
totalImportances.changeValue(idx, normImpt, _ + normImpt)
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,8 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest {
val gbtWithFeatureSubset = gbt.setFeatureSubsetStrategy("1")
val importanceFeatures = gbtWithFeatureSubset.fit(df).featureImportances
val mostIF = importanceFeatures.argmax
assert(mostImportantFeature !== mostIF)
assert(mostIF === 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously two most important features are different. Why now they are both 1?

Copy link
Contributor Author

@mgaido91 mgaido91 Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the exact reason why they were different earlier (of course the behavior changed because of the fix, but this is expected). You can compare the importances vector with the one returned by sklearn: as I mentioned in the PR description they are very similar (so sklearn too says 1 is the most important in both scenarios using sklearn too).

PS please notice that sklearn version must be >= 0.20.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I should have commented on this; I actually don't know why the test previously asserted the answers must be different. That's actually the thing I'd least expect, though it's possible. Why does it still assert the importances are different? I suspect they won't match exactly, sure, but if there's an assertion here, isn't it that they're close? They may just not be that comparable in which case there's nothing to assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion is there to check that a different subset strategy actually produces different results. In particular, in the first case, the importances vector is [1.0, 0.0, ...] while in the second case more features are used (because the trees can check a random variable at time), so the vector is something like [0.7, ...]. Hence this assertion makes sense in order to check that the featureSubset strategy is properly taken in account.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I get it, we just expect something different to happen under the hood, even if we're largely expecting a similar or the same answer. Leave it in; if it failed because it exactly matched, we'd know it, and could easily figure out whether that's actually now expected or a bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, in the first case, the importances vector is [1.0, 0.0, ...] while in the second case more features are used (because the trees can check a random variable at time), so the vector is something like [0.7, ...].

Don't the second case use just one feature and the first case use all features? What you mean more features are used for the second case? Or I misread the test code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first case, every tree can choose among all features. Since feature 1 basically is the correct "label" (I mean they are the same), in the first case all the trees choose feature 1 in the first node and they get 100% accuracy. Hence the importance vector is [1.0, 0.0, ...]. In the second case, only 1 random feature per time can be considered. So the trees are more "diverse" and they consider also other features. So the importance vector is the one I mentioned above. You can maybe try and debug this UT if you want to understand better (probably it is more effective than my poor english) or you can try and run the same in sklearn.

Copy link
Member

@viirya viirya Feb 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mgaido91.

I don't have a workable laptop in recent days. So it is hardly for me to run the unit test. That is why I ask for more details.

Sounds that this assertion assert(importances(mostImportantFeature) !== importanceFeatures(mostIF)) makes sense. But for assert(mostIF === 1), because it picks one random feature per time, are we sure that the most importance feature is 1 at all cases? In extreme case, this feature might not be chosen at all. It is potentially flaky. This assertion doesn't make too much sense to me, maybe we don't need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the seed is fixed, so the UT is actually deterministic and there is no flakyness. Despite with a different seed the result may be different, I'd consider very unlikely anyway that 1 would not be the most important one in any case, since it is really the ground truth in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it is correct since there is fixed seed.

Anyway, assert(mostIF === 1) actually means assert(mostImportantFeature == mostIF). This assertion doesn't make much sense as the previous one assert(mostImportantFeature !== mostIF). It doesn't tell us much except that it is happened to have the same most important feature...

OK for me to leave it as it.

assert(importances(mostImportantFeature) !== importanceFeatures(mostIF))
}

test("model evaluateEachIteration") {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,8 @@ class GBTRegressorSuite extends MLTest with DefaultReadWriteTest {
val gbtWithFeatureSubset = gbt.setFeatureSubsetStrategy("1")
val importanceFeatures = gbtWithFeatureSubset.fit(df).featureImportances
val mostIF = importanceFeatures.argmax
assert(mostImportantFeature !== mostIF)
assert(mostIF === 1)
assert(importances(mostImportantFeature) !== importanceFeatures(mostIF))
}

test("model evaluateEachIteration") {
Expand Down