-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API #19753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API #19753
Conversation
|
Test build #83881 has finished for PR 19753 at commit
|
|
Test build #83884 has finished for PR 19753 at commit
|
|
Looking at this now, thanks @WeichenXu123! |
smurching
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question, do we want to add a Python setter/getter for handleInvalid? Otherwise this LGTM.
|
@smurching The getter/setter is included in the super class |
|
Test build #83917 has finished for PR 19753 at commit
|
smurching
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for clarifying :)
Just had a few more thoughts, nice work!
| @keyword_only | ||
| @since("1.4.0") | ||
| def setParams(self, maxCategories=20, inputCol=None, outputCol=None): | ||
| def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another Q: I see there's a pattern of setParams using None as a default value for all/most of its arguments in other featurizers, perhaps we should do the same (i.e. have a default argument of handleValid=None here)? IMO specifying the default parameter value in one place is preferable to duplicating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same goes for the constructor (IMO we should default to handleInvalid=None there too), but open to hearing your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, but, unfortunately, I think you're wrong. The inputCol=None represent, if user do not specify the inputCol, there is no default value, and exception will be thrown.
Duplicating default params is an issue, but already exists in all the pyspark.ml estimator/models.
e.g., you can check StringIndexer in pyspark, it also has handleInvalid param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also check Params._set method in pyspark, you will find, it skips input params which value is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, that makes sense!
|
This LGTM, @jkbradley would you be able to give this a look? |
|
I'll try to take a look but am pretty swamped currently. CC @yanboliang @MLnick @dbtsai @holdenk might you have time? |
| JavaMLWritable): | ||
| """ | ||
| Class for indexing categorical feature columns in a dataset of `Vector`. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a TODO in the doc of VectorIndexer: Add option for allowing unknown categories.. I think we can remove it?
| "How to handle invalid data (unseen labels or NULL values). " + | ||
| "Options are 'skip' (filter out rows with invalid data), 'error' (throw an error), " + | ||
| "or 'keep' (put invalid data in a special additional bucket, at index numLabels).", | ||
| "or 'keep' (put invalid data in a special additional bucket, at index numCategories).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can numCategories be confused for users with a defined constant? How about more verbose one: at index of the number of categories of the feature?
|
LGTM with two minor comments. |
|
I can take a look tomorrow, been traveling but just got back. |
|
Thanks @holdenk |
|
Test build #84067 has finished for PR 19753 at commit
|
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What changes were proposed in this pull request?
Add python api for VectorIndexerModel support handle unseen categories via handleInvalid.
How was this patch tested?
doctest added.