-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11569] [ML] Fix StringIndexer to handle null value properly #9920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jkbradley @holdenk Could you please take a look my latest fix and let me know of any further comments? thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no longer filters out values that are not present in labaelToIndex
|
Sorry for my slow reply - looking at this it seems like you've updated the meaning of handleInvalid - it no longer serves its original purposes (unless I've missed something). This is probably not quite the best path forward - maybe something for handleNulls and keep the old handle invalid? I really like the thoroughness of the tests & I think the logic is pretty solid (just changing the meaning of things in the API is to avoided). |
|
@jliwork would you like to continue working on this? or else please close the PR |
|
Can one of the admins verify this patch? |
|
Yes. I'd like to continue working on this. But since this PR is obsolete. I will close it and open a new one instead. |
|
Great - can you ping me on your new PR when its ready? :) |
|
@holdenk Definitely :-) Thanks! |
## What changes were proposed in this pull request? This PR is to enhance StringIndexer with NULL values handling. Before the PR, StringIndexer will throw an exception when encounters NULL values. With this PR: - handleInvalid=error: Throw an exception as before - handleInvalid=skip: Skip null values as well as unseen labels - handleInvalid=keep: Give null values an additional index as well as unseen labels BTW, I noticed someone was trying to solve the same problem ( apache#9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-) ## How was this patch tested? new unit tests Author: Menglong TAN <[email protected]> Author: Menglong TAN <[email protected]> Closes apache#17233 from crackcell/11569_StringIndexer_NULL.
I was having some problem with rebase on #9709, so I had to close that PR and creating a new pull request with my latest fix.
Thanks to @jkbradley and @holdenk for your comments. I have updated my fix so that it will allow user to config either to filter out null values or throw an error with StringIndexer.setHandleInvalid("skip") API. The default is StringIndexer.setHandleInvalid("error").
Please let me know what you think. Thanks again!