Skip to content

Conversation

@mpjlu
Copy link

@mpjlu mpjlu commented Aug 2, 2016

What changes were proposed in this pull request?

Now, there is only numTopFeatures Param in ChiSquareSelector. In practice, it is convenience to use the percentage as the Param.

We add the percentage Param for ChiSquareSelector in this PR.

How was this patch tested?

add scala ut

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Aug 2, 2016

I'm not sure it's worth a whole other API method for this. If you want to select 10% of features, you can trivially ask for 0.1 * numFeatures features from the selector.

@mpjlu
Copy link
Author

mpjlu commented Aug 3, 2016

Hi @srowen, thanks for your comment.
I agree for your comment, user can get the number of features without percentage method. For the user experience, sometimes the percentage method seems better.
In scikit-learn, there are both SelectKBest and SelectPercentile APIs: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

@mpjlu
Copy link
Author

mpjlu commented Aug 4, 2016

Hi @srowen , I also plan to submit some PR about feature selection methods based on univariate statistical test, like the methods in scikit-learn: SelectFpr (using false positive rate), SelectFdr ( using false discovery rate), and SelectFwe (family wise error ).
http://scikit-learn.org/dev/modules/feature_selection.html
Now, the ChiSqSelector in Spark only can select the topNumFeatures. Do you think Spark should support other feature selection methods?

@hqzizania
Copy link
Contributor

Percentage is a useful addition to ChiSquareSelector, it is a common and intuitive param to data scientists and statistician as scikit-learn has, but it may be not worthy a whole other API in MLlib indeed. @srowen I suppose it could be implemented by adding a new Param in ML.ChiSqSelector?

@MLnick
Copy link
Contributor

MLnick commented Aug 5, 2016

If anything it could be a Param in the ml version (that can be discussed on the JIRA perhaps). But I'm pretty ambivalent about even that since as Sean says it's pretty easy to just specify it. Overall I just don't think it's worth it.

@MLnick
Copy link
Contributor

MLnick commented Aug 5, 2016

As for other feature selection methods, feel free to create a JIRA to discuss. Some work has been done outside of Spark in packages - e.g. https://github.com/sramirez/spark-infotheoretic-feature-selection. Generally I think this is a good place for that kind of work to start - I don't think it necessarily must be in Spark itself. If usage and performance is high, it can always be considered for inclusion later on.

@srowen
Copy link
Member

srowen commented Aug 14, 2016

I'd like to close this in favor of the changes in #14597 because I think it would actually lead towards making this functionality trivial to expose from the model class.

srowen added a commit to srowen/spark that referenced this pull request Aug 27, 2016
@asfgit asfgit closed this in 1a48c00 Aug 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants