Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Sep 23, 2016

What changes were proposed in this pull request?

#14597 modified ChiSqSelector to support fpr type selector, however, it left some issue need to be addressed:

  • We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set numTopFeatures and percentile, it will train kbest or percentile model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as GeneralizedLinearRegression and LogisticRegression.
  • Meanwhile, if there are more than one parameter except alpha can be set for fpr model, we can not handle it elegantly in the existing framework. And similar issues for kbest and percentile model. Setting selector type explicitly can solve this issue also.
  • If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set selectorType = percentile and alpha = 0.1, we should notify users the parameter alpha will take no effect. We should handle complex parameter interaction checks at transformSchema. (FYI [SPARK-13761] [ML] Deprecate validateParams #11620)
  • We should use lower case of the selector type names to follow MLlib convention.
  • Add ML Python API.

How was this patch tested?

Unit test.

@srowen
Copy link
Member

srowen commented Sep 23, 2016

Oh I see. I trust your judgment on this, just wish we could have gotten your review on the original PR. @mpjlu what do you think?

@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65824 has finished for PR 15214 at commit 8d1536a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link

mpjlu commented Sep 23, 2016

Hi @srowen and @yanboliang ; Thanks for your following up PR.
I partly agree with your comments on 17017.
1. "if users both set numTopFeatures and percentile, it will train kbest or percentile model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly."
For the user confused you mentioned here, I think the main reason is function name. I have changed the function name of setAlpha to setFPR in SPARK-17645. setNumTopFeature should be setKBest.
By this change, it can be much clear.
For example, setKBest(100), setPercentile(0.1), setFPR(0.05). The selection type and parameters is very clear by one function.
But for your method, user have to strike "setSelectorType("KBest").setNumTopFeatures(100)" to do the same thing as "setKBest(100)"
2. "if there are more than one parameter except alpha can be set for fpr model, we can not handle it elegantly in the existing framework. And similar issues for kbest and percentile model. "
I cannot think out any other parameters for fpr, kbest, percentile now. But if there is, I think it is just the same thing as your method. for example, setKBest(100).setOther(),,
I agree with you for other change. Thanks very much.

case ChiSqSelectorType.KBest =>
val selector = new feature.ChiSqSelector()
$(selectorType) match {
case OldChiSqSelector.KBest =>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to set SelectorType here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated.

case OldChiSqSelector.Percentile =>
selector.setPercentile($(percentile))
case ChiSqSelectorType.FPR =>
case OldChiSqSelector.FPR =>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

@Since("1.6.0")
override def transformSchema(schema: StructType): StructType = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== or != ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, typo.

LabeledPoint(1.0, Vectors.dense(Array(4.0))),
LabeledPoint(2.0, Vectors.dense(Array(9.0))))
val model = new ChiSqSelector().setAlpha(0.1).fit(labeledDiscreteData)
val model = new ChiSqSelector().setSelectorType("fpr").setAlpha(0.1).fit(labeledDiscreteData)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ML test case.

@yanboliang
Copy link
Contributor Author

@mpjlu The most important cause of this change is that the fit/train model should not dependent on the order of users setting params. In other words, users should get the same model whether set A following B or B following A. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 24, 2016

Test build #65864 has finished for PR 15214 at commit 16347f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link

mpjlu commented Sep 24, 2016

Hi @yanboliang , got it. Thanks.

@srowen
Copy link
Member

srowen commented Sep 25, 2016

I'm OK with it. @mpjlu sounds like you approve?

@mpjlu
Copy link

mpjlu commented Sep 25, 2016

hi @srowen .
My understand of yanbo's comments here is,
if user use chSqSelector like this:
model1 = new ChiSqSelector().setFPR(0.05).setKBest(100).fit(data)
model2 = new ChiSqSelector().setKBest(100).setFPR(0.05).fit(data)
model1 will be different with model2. so the model is dependent on the order of users setting params.
Actually, user should not use ChiSqSelector like this. One just need to set one SelectorType/Parameter is ok. But if one don't know ChiSqSelector, he may do like this. So yanbo think this is a problem.

In this PR, setFPR(0.05) is split to two functions: setSelectorType("fpr").setAlpha(0.05). This maybe clear to the user.
By the principle of software development: one function do one thing, I am ok with this change.
But from user experience, I like the spark-17017 method.

@srowen
Copy link
Member

srowen commented Sep 25, 2016

OK, I could also support either behavior. After all, for any component, .setFoo(x).setFoo(y) also creates a different model if the order is swapped, so I am not so clear that's a 'problem'.

@yanboliang
Copy link
Contributor Author

yanboliang commented Sep 25, 2016

@srowen @mpjlu
Another important reason for this change: it's error prone for Python ML API.

def __init__(self, numTopFeatures=50, featuresCol="features", outputCol=None, labelCol="label", selectorType="kbest", percentile=0.1, alpha=0.05):
......
def setParams(self, numTopFeatures=50, featuresCol="features", outputCol=None, labelCol="labels", selectorType="kbest", percentile=0.1, alpha=0.05):
......

If users are not very familiar with ChiSqSelector, they are likely to set all parameters following the API docs. The output model is also relevant with the arguments order. Users not aware of the order of arguments for the Python API is a very strong possibility. Thanks.

@yanboliang
Copy link
Contributor Author

And you can also refer all other Estimator in ML, even you swap the arguments setting order, you still get the same model. Thanks.

@mpjlu
Copy link

mpjlu commented Sep 25, 2016

Thanks, this looks good to me.

@mpjlu
Copy link

mpjlu commented Sep 25, 2016

Hi @srowen , sorry for forgetting update the doc and python/ml/feature.py in last PR.
This pr has added ml/feature.py. It looks good to me.
Thanks

@srowen
Copy link
Member

srowen commented Sep 26, 2016

Merged to master

@asfgit asfgit closed this in ac65139 Sep 26, 2016
@yanboliang yanboliang deleted the spark-17017 branch September 26, 2016 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants