-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector and add ML Python API. #15214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Oh I see. I trust your judgment on this, just wish we could have gotten your review on the original PR. @mpjlu what do you think? |
|
Test build #65824 has finished for PR 15214 at commit
|
|
Hi @srowen and @yanboliang ; Thanks for your following up PR. |
| case ChiSqSelectorType.KBest => | ||
| val selector = new feature.ChiSqSelector() | ||
| $(selectorType) match { | ||
| case OldChiSqSelector.KBest => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to set SelectorType here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated.
| case OldChiSqSelector.Percentile => | ||
| selector.setPercentile($(percentile)) | ||
| case ChiSqSelectorType.FPR => | ||
| case OldChiSqSelector.FPR => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| } | ||
|
|
||
| @Since("1.6.0") | ||
| override def transformSchema(schema: StructType): StructType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== or != ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, typo.
| LabeledPoint(1.0, Vectors.dense(Array(4.0))), | ||
| LabeledPoint(2.0, Vectors.dense(Array(9.0)))) | ||
| val model = new ChiSqSelector().setAlpha(0.1).fit(labeledDiscreteData) | ||
| val model = new ChiSqSelector().setSelectorType("fpr").setAlpha(0.1).fit(labeledDiscreteData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should also do the same thing for https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ML test case.
|
@mpjlu The most important cause of this change is that the fit/train model should not dependent on the order of users setting params. In other words, users should get the same model whether set A following B or B following A. Thanks! |
|
Test build #65864 has finished for PR 15214 at commit
|
|
Hi @yanboliang , got it. Thanks. |
|
I'm OK with it. @mpjlu sounds like you approve? |
|
hi @srowen . In this PR, setFPR(0.05) is split to two functions: setSelectorType("fpr").setAlpha(0.05). This maybe clear to the user. |
|
OK, I could also support either behavior. After all, for any component, |
|
@srowen @mpjlu If users are not very familiar with |
|
And you can also refer all other Estimator in ML, even you swap the arguments setting order, you still get the same model. Thanks. |
|
Thanks, this looks good to me. |
|
Hi @srowen , sorry for forgetting update the doc and python/ml/feature.py in last PR. |
|
Merged to master |
What changes were proposed in this pull request?
#14597 modified
ChiSqSelectorto supportfprtype selector, however, it left some issue need to be addressed:numTopFeaturesandpercentile, it will trainkbestorpercentilemodel based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such asGeneralizedLinearRegressionandLogisticRegression.alphacan be set forfprmodel, we can not handle it elegantly in the existing framework. And similar issues forkbestandpercentilemodel. Setting selector type explicitly can solve this issue also.selectorType = percentileandalpha = 0.1, we should notify users the parameteralphawill take no effect. We should handle complex parameter interaction checks attransformSchema. (FYI [SPARK-13761] [ML] Deprecate validateParams #11620)How was this patch tested?
Unit test.