-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30776][ML] Support FValueSelector for continuous features and continuous labels #27679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…continuous labels
|
There are lots of common code between this FValueSelector and ChiSqSelector. In next subtask, I will create a common Selector, and make FValueSelector and ChiSqSelector extend Selector. |
|
Test build #118849 has finished for PR 27679 at commit
|
| @@ -0,0 +1,448 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is pasted twice
|
|
||
| import org.apache.spark.annotation.Since | ||
| import org.apache.spark.ml._ | ||
| import org.apache.spark.ml.attribute.{AttributeGroup, _} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import org.apache.spark.ml.attribute._ ?
| * Object containing the test results for the ANOVA classification test. | ||
| */ | ||
| @Since("3.1.0") | ||
| class ANOVAClassificationTestResult private[stat] ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANOVAClassificationTest is not yet implemented, Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove for now
|
|
||
| /** Used to construct output schema of tests */ | ||
| private case class FValueResult( | ||
| case class FValueResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will put back private
| instance: | ||
| FValueSelectorModel) extends MLWriter { | ||
|
|
||
| private case class Data(selectedFeatures: Seq[Int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint:
private case class Data(
selectedFeatures: Seq[Int],
pValue: Seq[Double],
statistics: Seq[Double])
| s"FValueSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}" | ||
| } | ||
|
|
||
| private[spark] def compressSparse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reusing compressSparse and compressDense defined in ChiSqSelectorModel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to have any dependency on mllib. Later on, I will also change ChiSelector to remove the dependency on mllib, so I can refactor all the common code between ChiSelector and FValueSelector to put in an abstract Selector.
| * Trait for selection test results. | ||
| */ | ||
| @Since("3.1.0") | ||
| trait SelectionTestResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add such trait after all kinds of selectors are implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little easier to implement the fit method if I add this trait here for now. Since this PR is more like an intermediate PR , I guess it might be OK to be a little messy?
|
Test build #119184 has finished for PR 27679 at commit
|
| private[feature] trait FValueSelectorParams extends Params | ||
| with HasFeaturesCol with HasOutputCol with HasLabelCol { | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "If the number of features is less than numTopFeatures, then this will select all features."
I found that ChiSqSelector has similar logic, so I guess we can apply a small optimization in the future:
check the params before fit, if the number of features is less than numTopFeatures (or similar logic based on other params like fdr==0 ), then directly return model with all feature selected.
For now, I think we can just keep current logic.
| class FValueSelectorModel private[ml]( | ||
| override val uid: String, | ||
| val selectedFeatures: Array[Int], | ||
| val pValues: Array[Double], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that ChiSqSelectorModel do not contain similar statistics. I am not sure whether we should keep them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK either way, but if having statistics info here, I will need to add statistics info in ChiSqSelector too, so I can have a common Selector later. It might be easier not to have these for now.
| override def load(path: String): FValueSelectorModel = super.load(path) | ||
|
|
||
| private[FValueSelectorModel] class FValueSelectorModelWriter( | ||
| instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
|
||
| private case class Data( | ||
| selectedFeatures: Seq[Int], | ||
| pValue: Seq[Double], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar question: do we need pValue and statistics in Model, they are not used in transformation,
| * Object containing the test results for the FValue regression test. | ||
| */ | ||
| @Since("3.1.0") | ||
| class FValueRegressionTestResult private[stat] ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current class name is FValueResult?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
Test build #119247 has finished for PR 27679 at commit
|
|
Merged to master |
|
Thanks! |
…gorical labels ### What changes were proposed in this pull request? Add ANOVA Selector ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. #27679 added FValueSelector for continuous features and continuous labels. This PR adds ANOVASelector for continuous features and categorical labels. ### Does this PR introduce any user-facing change? Yes, add a new Selector. ### How was this patch tested? add new test suites Closes #27895 from huaxingao/anova. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…continuous labels ### What changes were proposed in this pull request? Add FValueRegressionSelector for continuous features and continuous labels. ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. This PR adds FValueSelector for continuous features and continuous labels. ANOVASelector for continuous features and categorical labels will be added later using a separate PR. ### Does this PR introduce any user-facing change? Yes. Add a new Selector ### How was this patch tested? Add new tests Closes apache#27679 from huaxingao/spark_30776. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
…gorical labels ### What changes were proposed in this pull request? Add ANOVA Selector ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. apache#27679 added FValueSelector for continuous features and continuous labels. This PR adds ANOVASelector for continuous features and categorical labels. ### Does this PR introduce any user-facing change? Yes, add a new Selector. ### How was this patch tested? add new test suites Closes apache#27895 from huaxingao/anova. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
Add FValueRegressionSelector for continuous features and continuous labels.
Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.
This PR adds FValueSelector for continuous features and continuous labels.
ANOVASelector for continuous features and categorical labels will be added later using a separate PR.
Does this PR introduce any user-facing change?
Yes.
Add a new Selector
How was this patch tested?
Add new tests