File tree Expand file tree Collapse file tree 2 files changed +20
-8
lines changed Expand file tree Collapse file tree 2 files changed +20
-8
lines changed Original file line number Diff line number Diff line change @@ -1331,10 +1331,16 @@ for more details on the API.
13311331## ChiSqSelector
13321332
13331333` ChiSqSelector ` stands for Chi-Squared feature selection. It operates on labeled data with
1334- categorical features. ChiSqSelector orders features based on a
1335- [ Chi-Squared test of independence] ( https://en.wikipedia.org/wiki/Chi-squared_test )
1336- from the class, and then filters (selects) the top features which the class label depends on the
1337- most. This is akin to yielding the features with the most predictive power.
1334+ categorical features. ChiSqSelector uses the
1335+ [ Chi-Squared test of independence] ( https://en.wikipedia.org/wiki/Chi-squared_test ) to decide which
1336+ features to choose. It supports three selection methods: ` KBest ` , ` Percentile ` and ` FPR ` :
1337+
1338+ * ` KBest ` chooses the ` k ` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
1339+ * ` Percentile ` is similar to ` KBest ` but chooses a fraction of all features instead of a fixed number.
1340+ * ` FPR ` chooses all features whose false positive rate meets some threshold.
1341+
1342+ By default, the selection method is ` KBest ` , the default number of top features is 50. User can use
1343+ ` setNumTopFeatures ` , ` setPercentile ` and ` setAlpha ` to set different selection methods.
13381344
13391345** Examples**
13401346
Original file line number Diff line number Diff line change @@ -225,10 +225,16 @@ features for use in model construction. It reduces the size of the feature space
225225both speed and statistical learning behavior.
226226
227227[ ` ChiSqSelector ` ] ( api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector ) implements
228- Chi-Squared feature selection. It operates on labeled data with categorical features.
229- ` ChiSqSelector ` orders features based on a Chi-Squared test of independence from the class,
230- and then filters (selects) the top features which the class label depends on the most.
231- This is akin to yielding the features with the most predictive power.
228+ Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
229+ [ Chi-Squared test of independence] ( https://en.wikipedia.org/wiki/Chi-squared_test ) to decide which
230+ features to choose. It supports three selection methods: ` KBest ` , ` Percentile ` and ` FPR ` :
231+
232+ * ` KBest ` chooses the ` k ` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
233+ * ` Percentile ` is similar to ` KBest ` but chooses a fraction of all features instead of a fixed number.
234+ * ` FPR ` chooses all features whose false positive rate meets some threshold.
235+
236+ By default, the selection method is ` KBest ` , the default number of top features is 50. User can use
237+ ` setNumTopFeatures ` , ` setPercentile ` and ` setAlpha ` to set different selection methods.
232238
233239The number of features to select can be tuned using a held-out validation set.
234240
You can’t perform that action at this time.
0 commit comments