[SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector and add ML Python API. #15214

yanboliang · 2016-09-23T12:03:53Z

What changes were proposed in this pull request?

#14597 modified ChiSqSelector to support fpr type selector, however, it left some issue need to be addressed:

We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set numTopFeatures and percentile, it will train kbest or percentile model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as GeneralizedLinearRegression and LogisticRegression.
Meanwhile, if there are more than one parameter except alpha can be set for fpr model, we can not handle it elegantly in the existing framework. And similar issues for kbest and percentile model. Setting selector type explicitly can solve this issue also.
If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set selectorType = percentile and alpha = 0.1, we should notify users the parameter alpha will take no effect. We should handle complex parameter interaction checks at transformSchema. (FYI [SPARK-13761] [ML] Deprecate validateParams #11620)
We should use lower case of the selector type names to follow MLlib convention.
Add ML Python API.

How was this patch tested?

Unit test.

srowen · 2016-09-23T12:18:14Z

Oh I see. I trust your judgment on this, just wish we could have gotten your review on the original PR. @mpjlu what do you think?

SparkQA · 2016-09-23T13:03:00Z

Test build #65824 has finished for PR 15214 at commit 8d1536a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2016-09-23T16:15:10Z

Hi @srowen and @yanboliang ; Thanks for your following up PR.
I partly agree with your comments on 17017.
1. "if users both set numTopFeatures and percentile, it will train kbest or percentile model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly."
For the user confused you mentioned here, I think the main reason is function name. I have changed the function name of setAlpha to setFPR in SPARK-17645. setNumTopFeature should be setKBest.
By this change, it can be much clear.
For example, setKBest(100), setPercentile(0.1), setFPR(0.05). The selection type and parameters is very clear by one function.
But for your method, user have to strike "setSelectorType("KBest").setNumTopFeatures(100)" to do the same thing as "setKBest(100)"
2. "if there are more than one parameter except alpha can be set for fpr model, we can not handle it elegantly in the existing framework. And similar issues for kbest and percentile model. "
I cannot think out any other parameters for fpr, kbest, percentile now. But if there is, I think it is just the same thing as your method. for example, setKBest(100).setOther(),,
I agree with you for other change. Thanks very much.

mpjlu · 2016-09-23T16:20:38Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

-      case ChiSqSelectorType.KBest =>
+    val selector = new feature.ChiSqSelector()
+    $(selectorType) match {
+      case OldChiSqSelector.KBest =>


Do you need to set SelectorType here?

Yes, updated.

mpjlu · 2016-09-23T16:21:37Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

+      case OldChiSqSelector.Percentile =>
        selector.setPercentile($(percentile))
-      case ChiSqSelectorType.FPR =>
+      case OldChiSqSelector.FPR =>


mpjlu · 2016-09-23T16:23:41Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

  }

  @Since("1.6.0")
  override def transformSchema(schema: StructType): StructType = {


Sorry, typo.

mpjlu · 2016-09-23T16:30:29Z

mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala

        LabeledPoint(1.0, Vectors.dense(Array(4.0))),
        LabeledPoint(2.0, Vectors.dense(Array(9.0))))
-    val model = new ChiSqSelector().setAlpha(0.1).fit(labeledDiscreteData)
+    val model = new ChiSqSelector().setSelectorType("fpr").setAlpha(0.1).fit(labeledDiscreteData)


you should also do the same thing for https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala

Added ML test case.

yanboliang · 2016-09-24T07:59:32Z

@mpjlu The most important cause of this change is that the fit/train model should not dependent on the order of users setting params. In other words, users should get the same model whether set A following B or B following A. Thanks!

SparkQA · 2016-09-24T08:45:52Z

Test build #65864 has finished for PR 15214 at commit 16347f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2016-09-24T16:41:39Z

Hi @yanboliang , got it. Thanks.

srowen · 2016-09-25T13:09:10Z

I'm OK with it. @mpjlu sounds like you approve?

mpjlu · 2016-09-25T14:59:40Z

hi @srowen .
My understand of yanbo's comments here is,
if user use chSqSelector like this:
model1 = new ChiSqSelector().setFPR(0.05).setKBest(100).fit(data)
model2 = new ChiSqSelector().setKBest(100).setFPR(0.05).fit(data)
model1 will be different with model2. so the model is dependent on the order of users setting params.
Actually, user should not use ChiSqSelector like this. One just need to set one SelectorType/Parameter is ok. But if one don't know ChiSqSelector, he may do like this. So yanbo think this is a problem.

In this PR, setFPR(0.05) is split to two functions: setSelectorType("fpr").setAlpha(0.05). This maybe clear to the user.
By the principle of software development: one function do one thing, I am ok with this change.
But from user experience, I like the spark-17017 method.

srowen · 2016-09-25T15:06:42Z

OK, I could also support either behavior. After all, for any component, .setFoo(x).setFoo(y) also creates a different model if the order is swapped, so I am not so clear that's a 'problem'.

yanboliang · 2016-09-25T15:30:16Z

@srowen @mpjlu
Another important reason for this change: it's error prone for Python ML API.

def __init__(self, numTopFeatures=50, featuresCol="features", outputCol=None, labelCol="label", selectorType="kbest", percentile=0.1, alpha=0.05):
......
def setParams(self, numTopFeatures=50, featuresCol="features", outputCol=None, labelCol="labels", selectorType="kbest", percentile=0.1, alpha=0.05):
......

If users are not very familiar with ChiSqSelector, they are likely to set all parameters following the API docs. The output model is also relevant with the arguments order. Users not aware of the order of arguments for the Python API is a very strong possibility. Thanks.

yanboliang · 2016-09-25T15:40:31Z

And you can also refer all other Estimator in ML, even you swap the arguments setting order, you still get the same model. Thanks.

mpjlu · 2016-09-25T15:54:59Z

Thanks, this looks good to me.

mpjlu · 2016-09-25T16:29:33Z

Hi @srowen , sorry for forgetting update the doc and python/ml/feature.py in last PR.
This pr has added ml/feature.py. It looks good to me.
Thanks

srowen · 2016-09-26T08:45:37Z

Merged to master

Refactor of ChiSqSelector and add ML Python API.

8d1536a

mpjlu reviewed Sep 23, 2016

View reviewed changes

Add more test cases.

16347f4

srowen mentioned this pull request Sep 25, 2016

[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector #15236

Closed

asfgit closed this in ac65139 Sep 26, 2016

yanboliang deleted the spark-17017 branch September 26, 2016 10:28

[SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector and add ML Python API. #15214

[SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector and add ML Python API. #15214

Uh oh!

Conversation

yanboliang commented Sep 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Sep 23, 2016

Uh oh!

SparkQA commented Sep 23, 2016

Uh oh!

mpjlu commented Sep 23, 2016

Uh oh!

mpjlu Sep 23, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Sep 24, 2016

Choose a reason for hiding this comment

Uh oh!

mpjlu Sep 23, 2016

Choose a reason for hiding this comment

Uh oh!

mpjlu Sep 23, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Sep 24, 2016

Choose a reason for hiding this comment

Uh oh!

mpjlu Sep 23, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Sep 24, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang commented Sep 24, 2016

Uh oh!

SparkQA commented Sep 24, 2016

Uh oh!

mpjlu commented Sep 24, 2016

Uh oh!

srowen commented Sep 25, 2016

Uh oh!

mpjlu commented Sep 25, 2016

Uh oh!

srowen commented Sep 25, 2016

Uh oh!

yanboliang commented Sep 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanboliang commented Sep 25, 2016

Uh oh!

mpjlu commented Sep 25, 2016

Uh oh!

mpjlu commented Sep 25, 2016

Uh oh!

srowen commented Sep 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yanboliang commented Sep 23, 2016 •

edited

Loading

yanboliang commented Sep 25, 2016 •

edited

Loading