-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading #22986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #98607 has finished for PR 22986 at commit
|
| } | ||
|
|
||
| private[ml] trait GBTClassifierParams extends GBTParams with TreeClassifierParams { | ||
| private[ml] trait GBTClassifierParams extends GBTParams with TreeRegressorParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need regressor params to fix this? it sounds like it should extend TreeClassifierParams, just given the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, by the names seems so, but if you check their implementation, they both just add a impurity param with the only difference of the allowed values. In the case of TreeRegressorParams the only allowed value is variance (which is what GBTClassifier uses), in TreeClassifierParams the allowed values are gini (default) and entropy. So variance is not a valid value for the impurity param of TreeClassifierParams, so there is no way to set it properly for the GBTClassifier....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely that's not the ideal way to fix it though, semantically. Can't GBTClassifier override the impurity internally (or can we let it do so)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To continue to extend TreeClassifierParams but be able to use correct impurity, seems we need to remove final from impurity in trait TreeClassifierParams...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try a small refactor creating a HasImpurity trait with the allowed values being overridden by its descendant. This way it should work. Let me know what you think about it. Thanks.
I tried but this is causing way too many MiMa issues.... Removing the final would be problematic as well, because of the setDefault...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in GBTClassifier the only valid value is variance. It is true that the setImpurity method throws an exception if someone tries to set it, but we are not safe against bad values as it happens now. And, most importantly TreeClassifierParams is extended by other classes, so we would allow the variance value to be set for RandomForest and DecisionTree, which is bad because it is not a valid value for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, other classes have to restrict what's settable. I don't see the issue with bad values here? Is it serialized forms? At least this sounds like the normal OOP way to solve this. Other ways around this sound like hacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean restricting in the setImpurity method as you are suggesting is probably feasible, but I am not sure it is the right thing to do. I haven't seen that pattern anywhere else in the codebase. We set the allowed parameters in the Param itself usually. One of the reason is to avoid problems when you are setting the default for instance or when set is used (this happens for instance when loading a saved model, so we can load an invalid model): so I think the solution is not really safe.
I am not sure the solution you are proposing is cleaner. Also, before this PR there were 2 interfaces: TreeClassifierParams and TreeRegressorParams only for this reason, ie. for the impurity param to allow/have different values. We could switch to a single HasImpurity with your approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more into this and see that setImpurity is deprecated. I presume the point is to use set(impurity, ...) instead. Yeah, that's no longer possible to override in subclasses, scratch that; overriding setImpurity would have been just fine IMHO but that's not going to last anyway.
I do agree this is why there are different traits for classifiers and regressors, but I don't think that means a classifier should extend TreeRegressorParams because its parameters happen to match.
One option is to let the definition of impurity itself be overridden. That seems OK. Or we could make a new 'VarianceClassifier' or something that defines this variance-only impurity parameter and let TreeRegressionParams and GBTClassifierParams extend it. It is a little wacky, but quite reasonable from an OOP perspective. I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen yes, I'll try and create a VarianceClassifier trait. The only problem there is that it requires many MiMa exclusions, but I'll do that. Thanks.
|
Test build #98610 has finished for PR 22986 at commit
|
|
Test build #98609 has finished for PR 22986 at commit
|
|
Tried this patch. Seems this resolves the issue. |
| private[ml] trait GBTClassifierParams extends GBTParams with TreeRegressorParams { | ||
|
|
||
| /** | ||
| * Loss function which GBT tries to minimize. (case-insensitive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that GBTClassifier also logs wrong value of impurity param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. Currently we just saying that we are using gini, while we are using variance...
|
Test build #98780 has finished for PR 22986 at commit
|
|
Test build #98785 has finished for PR 22986 at commit
|
|
Test build #98819 has finished for PR 22986 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this type of approach looks good
| /** | ||
| * Parameters for Decision Tree-based regression algorithms. | ||
| */ | ||
| private[ml] trait TreeRegressorParams extends Params { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can actually keep this trait and extend HasVarianceImpurity. It would be a no-op like a few other traits here. Although 'redundant', would it avoid any MiMa warnings or very slightly improve compatibility, even if this is private[spark]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this trait is still there (just some lines below). The MiMa warnings are there because the returned type of setImpurity returns this.type, which is now HasVarianceImpurity and the same for the setter...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at this one more time ... hm, isn't this.type always referring to the current class's type? maybe it's some detail of the generated code and this is essentially a false positive. Or, honestly setImpurity can just be removed in this change too. It's deprecated for all of these classes. I don't mind doing that later too; would just avoid these MiMa exclusions too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is AFAIK. I think it is a false positive.
I wouldn't remove it in this change because since this is a bug, this PR can be backported to 2.4.1 too. I'd do it in a separate change. We would need anyway to add the MiMa rules for the missing methods, so it doesn't change much on the exclusions either...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but I don't know if we can back-port it because of the binary incompatibility, internal as it may be. I don't know. If it's not an issue then yes it can back port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I am not sure how to verify that.
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right I wasn't reading carefully. OK.
|
Merged to master |
## What changes were proposed in this pull request? Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones. The PR changes the `impurity` param used to one which allows only the value `variance`. ## How was this patch tested? modified UT Closes apache#22986 from mgaido91/SPARK-25959. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Sean Owen <[email protected]>
Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones. The PR changes the `impurity` param used to one which allows only the value `variance`. modified UT Closes apache#22986 from mgaido91/SPARK-25959. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Sean Owen <[email protected]> Change-Id: Ib0e24dbeb4f3a622622173b3038ce3d69a136bf0 Back ported to 2.4 by Dagang Wei. Change-Id: I9258aff57799f1e03ffa31436d75057bdbc18bcd
What changes were proposed in this pull request?
Our
GBTClassifiersupports onlyvarianceimpurity. But unfortunately, itsimpurityparam by default contains the valuegini: it is not even modifiable by the user and it differs from the actual impurity used, which isvariance. This issue does not limit to a wrong value returned for it if the user queries bygetImpurity, but it also affect the load of a saved model, as itsimpurityStatsare created asgini(since this is the value stored for the model impurity) which leads to wrongfeatureImportancesin model loaded from saved ones.The PR changes the
impurityparam used to one which allows only the valuevariance.How was this patch tested?
modified UT