[SPARK-30776][ML] Support FValueSelector for continuous features and continuous labels #27679

huaxingao · 2020-02-24T05:54:11Z

What changes were proposed in this pull request?

Add FValueRegressionSelector for continuous features and continuous labels.

Why are the changes needed?

Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.

This PR adds FValueSelector for continuous features and continuous labels.
ANOVASelector for continuous features and categorical labels will be added later using a separate PR.

Does this PR introduce any user-facing change?

Yes.
Add a new Selector

How was this patch tested?

Add new tests

…continuous labels

huaxingao · 2020-02-24T05:57:33Z

There are lots of common code between this FValueSelector and ChiSqSelector. In next subtask, I will create a common Selector, and make FValueSelector and ChiSqSelector extend Selector.

SparkQA · 2020-02-24T07:12:15Z

Test build #118849 has finished for PR 27679 at commit 832d8c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-03-02T07:21:02Z

mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala

@@ -0,0 +1,448 @@
+/*


This section is pasted twice

zhengruifeng · 2020-03-02T07:21:31Z

mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala

+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}


import org.apache.spark.ml.attribute._ ?

zhengruifeng · 2020-03-02T07:25:45Z

mllib/src/main/scala/org/apache/spark/ml/stat/SelectionTestResult.scala

+ * Object containing the test results for the ANOVA classification test.
+ */
+@Since("3.1.0")
+class ANOVAClassificationTestResult private[stat] (


ANOVAClassificationTest is not yet implemented, Right?

I will remove for now

zhengruifeng · 2020-03-02T07:26:43Z

mllib/src/main/scala/org/apache/spark/ml/stat/FValueTest.scala


  /** Used to construct output schema of tests */
-  private case class FValueResult(
+  case class FValueResult(


Should we make it public?

will put back private

zhengruifeng · 2020-03-02T07:27:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala

+      instance:
+      FValueSelectorModel) extends MLWriter {
+
+    private case class Data(selectedFeatures: Seq[Int],


lint:

private case class Data( selectedFeatures: Seq[Int], pValue: Seq[Double], statistics: Seq[Double])

zhengruifeng · 2020-03-02T07:28:58Z

mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala

+    s"FValueSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
+  }
+
+  private[spark] def compressSparse(


Why not reusing compressSparse and compressDense defined in ChiSqSelectorModel?

I don't want to have any dependency on mllib. Later on, I will also change ChiSelector to remove the dependency on mllib, so I can refactor all the common code between ChiSelector and FValueSelector to put in an abstract Selector.

zhengruifeng · 2020-03-02T07:29:58Z

mllib/src/main/scala/org/apache/spark/ml/stat/SelectionTestResult.scala

+ * Trait for selection test results.
+ */
+@Since("3.1.0")
+trait SelectionTestResult {


I'd like to add such trait after all kinds of selectors are implemented

It's a little easier to implement the fit method if I add this trait here for now. Since this PR is more like an intermediate PR , I guess it might be OK to be a little messy?

SparkQA · 2020-03-02T21:09:46Z

Test build #119184 has finished for PR 27679 at commit 53cc49a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-03-03T08:54:16Z

mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala

+private[feature] trait FValueSelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**


Nit: "If the number of features is less than numTopFeatures, then this will select all features."
I found that ChiSqSelector has similar logic, so I guess we can apply a small optimization in the future:
check the params before fit, if the number of features is less than numTopFeatures (or similar logic based on other params like fdr==0 ), then directly return model with all feature selected.

For now, I think we can just keep current logic.

zhengruifeng · 2020-03-03T08:58:52Z

mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala

+class FValueSelectorModel private[ml](
+    override val uid: String,
+    val selectedFeatures: Array[Int],
+    val pValues: Array[Double],


I notice that ChiSqSelectorModel do not contain similar statistics. I am not sure whether we should keep them.

I am OK either way, but if having statistics info here, I will need to add statistics info in ChiSqSelector too, so I can have a common Selector later. It might be easier not to have these for now.

zhengruifeng · 2020-03-03T09:03:09Z

mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala

+  override def load(path: String): FValueSelectorModel = super.load(path)
+
+  private[FValueSelectorModel] class FValueSelectorModelWriter(
+      instance:


zhengruifeng · 2020-03-03T09:04:06Z

mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala

+
+    private case class Data(
+        selectedFeatures: Seq[Int],
+        pValue: Seq[Double],


similar question: do we need pValue and statistics in Model, they are not used in transformation,

zhengruifeng · 2020-03-03T09:07:16Z

mllib/src/main/scala/org/apache/spark/ml/stat/SelectionTestResult.scala

+ * Object containing the test results for the FValue regression test.
+ */
+@Since("3.1.0")
+class FValueRegressionTestResult private[stat] (


current class name is FValueResult?

SparkQA · 2020-03-03T19:54:49Z

Test build #119247 has finished for PR 27679 at commit 4584465.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-03-06T05:25:21Z

Merged to master

huaxingao · 2020-03-06T05:50:43Z

Thanks!

…gorical labels ### What changes were proposed in this pull request? Add ANOVA Selector ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. #27679 added FValueSelector for continuous features and continuous labels. This PR adds ANOVASelector for continuous features and categorical labels. ### Does this PR introduce any user-facing change? Yes, add a new Selector. ### How was this patch tested? add new test suites Closes #27895 from huaxingao/anova. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…continuous labels ### What changes were proposed in this pull request? Add FValueRegressionSelector for continuous features and continuous labels. ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. This PR adds FValueSelector for continuous features and continuous labels. ANOVASelector for continuous features and categorical labels will be added later using a separate PR. ### Does this PR introduce any user-facing change? Yes. Add a new Selector ### How was this patch tested? Add new tests Closes apache#27679 from huaxingao/spark_30776. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: zhengruifeng <[email protected]>

…gorical labels ### What changes were proposed in this pull request? Add ANOVA Selector ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. apache#27679 added FValueSelector for continuous features and continuous labels. This PR adds ANOVASelector for continuous features and categorical labels. ### Does this PR introduce any user-facing change? Yes, add a new Selector. ### How was this patch tested? add new test suites Closes apache#27895 from huaxingao/anova. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]>

huaxingao added 2 commits February 23, 2020 21:44

[SPARK-30776][ML] Support FValueSelector for continuous features and …

2ea3361

…continuous labels

nit

832d8c4

dongjoon-hyun added the ML label Feb 28, 2020

zhengruifeng reviewed Mar 2, 2020

View reviewed changes

address comments

53cc49a

zhengruifeng reviewed Mar 3, 2020

View reviewed changes

address comments

4584465

zhengruifeng approved these changes Mar 4, 2020

View reviewed changes

zhengruifeng closed this in 6468d6f Mar 6, 2020

huaxingao deleted the spark_30776 branch March 6, 2020 05:50

This was referenced Mar 7, 2020

[SPARK-31077][ML] Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel #27841

Closed

[SPARK-31138][ML] Add ANOVA Selector for continuous features and categorical labels #27895

Closed

[SPARK-30776][ML] Support FValueSelector for continuous features and continuous labels #27679

[SPARK-30776][ML] Support FValueSelector for continuous features and continuous labels #27679

Uh oh!

Conversation

huaxingao commented Feb 24, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

huaxingao commented Feb 24, 2020

Uh oh!

SparkQA commented Feb 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 3, 2020

Uh oh!

zhengruifeng commented Mar 6, 2020

Uh oh!

huaxingao commented Mar 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants