Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add FValueRegressionSelector for continuous features and continuous labels.

Why are the changes needed?

Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.

This PR adds FValueSelector for continuous features and continuous labels.
ANOVASelector for continuous features and categorical labels will be added later using a separate PR.

Does this PR introduce any user-facing change?

Yes.
Add a new Selector

How was this patch tested?

Add new tests

@huaxingao
Copy link
Contributor Author

There are lots of common code between this FValueSelector and ChiSqSelector. In next subtask, I will create a common Selector, and make FValueSelector and ChiSqSelector extend Selector.

@SparkQA
Copy link

SparkQA commented Feb 24, 2020

Test build #118849 has finished for PR 27679 at commit 832d8c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -0,0 +1,448 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is pasted twice


import org.apache.spark.annotation.Since
import org.apache.spark.ml._
import org.apache.spark.ml.attribute.{AttributeGroup, _}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import org.apache.spark.ml.attribute._ ?

* Object containing the test results for the ANOVA classification test.
*/
@Since("3.1.0")
class ANOVAClassificationTestResult private[stat] (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANOVAClassificationTest is not yet implemented, Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove for now


/** Used to construct output schema of tests */
private case class FValueResult(
case class FValueResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will put back private

instance:
FValueSelectorModel) extends MLWriter {

private case class Data(selectedFeatures: Seq[Int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint:

private case class Data(
    selectedFeatures: Seq[Int],
    pValue: Seq[Double],
    statistics: Seq[Double])

s"FValueSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
}

private[spark] def compressSparse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reusing compressSparse and compressDense defined in ChiSqSelectorModel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to have any dependency on mllib. Later on, I will also change ChiSelector to remove the dependency on mllib, so I can refactor all the common code between ChiSelector and FValueSelector to put in an abstract Selector.

* Trait for selection test results.
*/
@Since("3.1.0")
trait SelectionTestResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add such trait after all kinds of selectors are implemented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little easier to implement the fit method if I add this trait here for now. Since this PR is more like an intermediate PR , I guess it might be OK to be a little messy?

@SparkQA
Copy link

SparkQA commented Mar 2, 2020

Test build #119184 has finished for PR 27679 at commit 53cc49a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[feature] trait FValueSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "If the number of features is less than numTopFeatures, then this will select all features."
I found that ChiSqSelector has similar logic, so I guess we can apply a small optimization in the future:
check the params before fit, if the number of features is less than numTopFeatures (or similar logic based on other params like fdr==0 ), then directly return model with all feature selected.

For now, I think we can just keep current logic.

class FValueSelectorModel private[ml](
override val uid: String,
val selectedFeatures: Array[Int],
val pValues: Array[Double],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that ChiSqSelectorModel do not contain similar statistics. I am not sure whether we should keep them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK either way, but if having statistics info here, I will need to add statistics info in ChiSqSelector too, so I can have a common Selector later. It might be easier not to have these for now.

override def load(path: String): FValueSelectorModel = super.load(path)

private[FValueSelectorModel] class FValueSelectorModelWriter(
instance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


private case class Data(
selectedFeatures: Seq[Int],
pValue: Seq[Double],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar question: do we need pValue and statistics in Model, they are not used in transformation,

* Object containing the test results for the FValue regression test.
*/
@Since("3.1.0")
class FValueRegressionTestResult private[stat] (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current class name is FValueResult?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@SparkQA
Copy link

SparkQA commented Mar 3, 2020

Test build #119247 has finished for PR 27679 at commit 4584465.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

Merged to master

@huaxingao
Copy link
Contributor Author

Thanks!

@huaxingao huaxingao deleted the spark_30776 branch March 6, 2020 05:50
srowen pushed a commit that referenced this pull request Mar 20, 2020
…gorical labels

### What changes were proposed in this pull request?
Add ANOVA Selector

### Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.

#27679 added FValueSelector for continuous features and continuous labels.
This PR adds ANOVASelector for continuous features and categorical labels.

### Does this PR introduce any user-facing change?
Yes, add a new Selector.

### How was this patch tested?
add new test suites

Closes #27895 from huaxingao/anova.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…continuous labels

### What changes were proposed in this pull request?
Add FValueRegressionSelector for continuous features and continuous labels.

### Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.

This PR adds FValueSelector for continuous features and continuous labels.
ANOVASelector for continuous features and categorical labels will be added later using a separate PR.

### Does this PR introduce any user-facing change?
Yes.
Add a new Selector

### How was this patch tested?
Add new tests

Closes apache#27679 from huaxingao/spark_30776.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: zhengruifeng <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…gorical labels

### What changes were proposed in this pull request?
Add ANOVA Selector

### Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.

apache#27679 added FValueSelector for continuous features and continuous labels.
This PR adds ANOVASelector for continuous features and categorical labels.

### Does this PR introduce any user-facing change?
Yes, add a new Selector.

### How was this patch tested?
add new test suites

Closes apache#27895 from huaxingao/anova.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants