[SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature #14449

mpjlu · 2016-08-02T04:37:12Z

What changes were proposed in this pull request?

Now, there is only numTopFeatures Param in ChiSquareSelector. In practice, it is convenience to use the percentage as the Param.

We add the percentage Param for ChiSquareSelector in this PR.

How was this patch tested?

add scala ut

AmplabJenkins · 2016-08-02T04:42:15Z

Can one of the admins verify this patch?

srowen · 2016-08-02T14:58:52Z

I'm not sure it's worth a whole other API method for this. If you want to select 10% of features, you can trivially ask for 0.1 * numFeatures features from the selector.

mpjlu · 2016-08-03T02:21:38Z

Hi @srowen, thanks for your comment.
I agree for your comment, user can get the number of features without percentage method. For the user experience, sometimes the percentage method seems better.
In scikit-learn, there are both SelectKBest and SelectPercentile APIs: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

mpjlu · 2016-08-04T08:53:51Z

Hi @srowen , I also plan to submit some PR about feature selection methods based on univariate statistical test, like the methods in scikit-learn: SelectFpr (using false positive rate), SelectFdr ( using false discovery rate), and SelectFwe (family wise error ).
http://scikit-learn.org/dev/modules/feature_selection.html
Now, the ChiSqSelector in Spark only can select the topNumFeatures. Do you think Spark should support other feature selection methods?

hqzizania · 2016-08-04T08:59:34Z

Percentage is a useful addition to ChiSquareSelector, it is a common and intuitive param to data scientists and statistician as scikit-learn has, but it may be not worthy a whole other API in MLlib indeed. @srowen I suppose it could be implemented by adding a new Param in ML.ChiSqSelector?

MLnick · 2016-08-05T10:08:35Z

If anything it could be a Param in the ml version (that can be discussed on the JIRA perhaps). But I'm pretty ambivalent about even that since as Sean says it's pretty easy to just specify it. Overall I just don't think it's worth it.

MLnick · 2016-08-05T10:15:00Z

As for other feature selection methods, feel free to create a JIRA to discuss. Some work has been done outside of Spark in packages - e.g. https://github.com/sramirez/spark-infotheoretic-feature-selection. Generally I think this is a good place for that kind of work to start - I don't think it necessarily must be in Spark itself. If usage and performance is high, it can always be considered for inclusion later on.

srowen · 2016-08-14T11:14:27Z

I'd like to close this in favor of the changes in #14597 because I think it would actually lead towards making this functionality trivial to expose from the model class.

Closes apache#10995 Closes apache#13658 Closes apache#14505 Closes apache#14536 Closes apache#12753 Closes apache#14449 Closes apache#12694 Closes apache#12695 Closes apache#14810

add the percentage ChiSquareSelector feature

fb3c9a9

srowen mentioned this pull request Aug 27, 2016

[BUILD] Closes some stale PRs. #14849

Closed

asfgit closed this in 1a48c00 Aug 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature #14449

[SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature #14449

Uh oh!

mpjlu commented Aug 2, 2016 •

edited

Loading

Uh oh!

AmplabJenkins commented Aug 2, 2016

Uh oh!

srowen commented Aug 2, 2016

Uh oh!

mpjlu commented Aug 3, 2016

Uh oh!

mpjlu commented Aug 4, 2016

Uh oh!

hqzizania commented Aug 4, 2016

Uh oh!

MLnick commented Aug 5, 2016

Uh oh!

MLnick commented Aug 5, 2016

Uh oh!

srowen commented Aug 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature #14449

[SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature #14449

Uh oh!

Conversation

mpjlu commented Aug 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Aug 2, 2016

Uh oh!

srowen commented Aug 2, 2016

Uh oh!

mpjlu commented Aug 3, 2016

Uh oh!

mpjlu commented Aug 4, 2016

Uh oh!

hqzizania commented Aug 4, 2016

Uh oh!

MLnick commented Aug 5, 2016

Uh oh!

MLnick commented Aug 5, 2016

Uh oh!

srowen commented Aug 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mpjlu commented Aug 2, 2016 •

edited

Loading