[SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API #19753

WeichenXu123 · 2017-11-15T06:06:23Z

What changes were proposed in this pull request?

Add python api for VectorIndexerModel support handle unseen categories via handleInvalid.

How was this patch tested?

doctest added.

SparkQA · 2017-11-15T06:09:34Z

Test build #83881 has finished for PR 19753 at commit 108ce2b.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,

SparkQA · 2017-11-15T08:05:02Z

Test build #83884 has finished for PR 19753 at commit f684cd0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

smurching · 2017-11-15T19:29:56Z

Looking at this now, thanks @WeichenXu123!

smurching

Just one question, do we want to add a Python setter/getter for handleInvalid? Otherwise this LGTM.

WeichenXu123 · 2017-11-15T23:03:52Z

@smurching The getter/setter is included in the super class HasHandleInvalid. I can add test for it.

SparkQA · 2017-11-16T00:39:47Z

Test build #83917 has finished for PR 19753 at commit 311da8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

smurching

Got it, thanks for clarifying :)
Just had a few more thoughts, nice work!

smurching · 2017-11-16T01:11:27Z

python/pyspark/ml/feature.py

    @keyword_only
    @since("1.4.0")
-    def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+    def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"):


Another Q: I see there's a pattern of setParams using None as a default value for all/most of its arguments in other featurizers, perhaps we should do the same (i.e. have a default argument of handleValid=None here)? IMO specifying the default parameter value in one place is preferable to duplicating it.

The same goes for the constructor (IMO we should default to handleInvalid=None there too), but open to hearing your thoughts.

ah, but, unfortunately, I think you're wrong. The inputCol=None represent, if user do not specify the inputCol, there is no default value, and exception will be thrown.
Duplicating default params is an issue, but already exists in all the pyspark.ml estimator/models.
e.g., you can check StringIndexer in pyspark, it also has handleInvalid param.

You can also check Params._set method in pyspark, you will find, it skips input params which value is None

Thanks for the explanation, that makes sense!

smurching · 2017-11-16T03:57:00Z

This LGTM, @jkbradley would you be able to give this a look?

jkbradley · 2017-11-21T02:37:59Z

I'll try to take a look but am pretty swamped currently. CC @yanboliang @MLnick @dbtsai @holdenk might you have time?

viirya · 2017-11-21T03:40:01Z

python/pyspark/ml/feature.py

+                    JavaMLWritable):
    """
    Class for indexing categorical feature columns in a dataset of `Vector`.



There is a TODO in the doc of VectorIndexer: Add option for allowing unknown categories.. I think we can remove it?

viirya · 2017-11-21T03:48:14Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala

    "How to handle invalid data (unseen labels or NULL values). " +
    "Options are 'skip' (filter out rows with invalid data), 'error' (throw an error), " +
-    "or 'keep' (put invalid data in a special additional bucket, at index numLabels).",
+    "or 'keep' (put invalid data in a special additional bucket, at index numCategories).",


Can numCategories be confused for users with a defined constant? How about more verbose one: at index of the number of categories of the feature?

viirya · 2017-11-21T03:58:37Z

LGTM with two minor comments.

holdenk · 2017-11-21T10:34:30Z

I can take a look tomorrow, been traveling but just got back.

WeichenXu123 · 2017-11-21T12:06:59Z

Thanks @holdenk

SparkQA · 2017-11-21T13:04:36Z

Test build #84067 has finished for PR 19753 at commit c302d59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

LGTM

init pr

108ce2b

fix_py_style

f684cd0

smurching reviewed Nov 15, 2017

View reviewed changes

add test for getter setter

311da8a

smurching reviewed Nov 16, 2017

View reviewed changes

viirya reviewed Nov 21, 2017

View reviewed changes

address minor issues

c302d59

holdenk approved these changes Nov 21, 2017

View reviewed changes

asfgit closed this in 2d868d9 Nov 21, 2017

WeichenXu123 deleted the vector_indexer_invalid_py branch April 24, 2019 21:18

[SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API #19753

[SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API #19753

Uh oh!

Conversation

WeichenXu123 commented Nov 15, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 15, 2017

Uh oh!

SparkQA commented Nov 15, 2017

Uh oh!

smurching commented Nov 15, 2017

Uh oh!

smurching left a comment

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Nov 15, 2017

Uh oh!

SparkQA commented Nov 16, 2017

Uh oh!

smurching left a comment

Choose a reason for hiding this comment

Uh oh!

smurching Nov 16, 2017

Choose a reason for hiding this comment

Uh oh!

smurching Nov 16, 2017

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Nov 16, 2017

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Nov 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smurching Nov 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smurching commented Nov 16, 2017

Uh oh!

jkbradley commented Nov 21, 2017

Uh oh!

viirya Nov 21, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Nov 21, 2017

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 21, 2017

Uh oh!

holdenk commented Nov 21, 2017

Uh oh!

WeichenXu123 commented Nov 21, 2017

Uh oh!

SparkQA commented Nov 21, 2017

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

WeichenXu123 Nov 16, 2017 •

edited

Loading

smurching Nov 16, 2017 •

edited

Loading