-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8764][ML] string indexer should take option to handle unseen values #7266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8764][ML] string indexer should take option to handle unseen values #7266
Conversation
|
Test build #36715 has finished for PR 7266 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the two different exceptions add any new information (the user already knows if the model is configured to skip invalid or not). Also, this branch should never execute because of L152. Perhaps we can collapse this into a single exception?
|
Test build #36848 has finished for PR 7266 at commit
|
|
Test build #36964 has finished for PR 7266 at commit
|
|
LGTM |
|
cc @jkbradley for review since created the jira. |
|
Test build #37544 has finished for PR 7266 at commit
|
|
jenkins, retest this please. |
|
@jkbradley if you have a chance to look at this PR too its in the same class/file as the last one. |
|
Test build #186 has finished for PR 7266 at commit
|
|
Test build #39348 has finished for PR 7266 at commit
|
|
I was taking another look at it, and I like the setup. But one thing I had not thought of was a third option: creating a new label/index which all unseen values are mapped to. That would let users avoid filtering out rows, and instead replace the bad/unseen values with a default value. Rather than putting that in this PR, could you modify the Param to be a String, so that we can specify other options in the future? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
|
Test build #39952 has finished for PR 7266 at commit
|
|
Test build #39957 has finished for PR 7266 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make docs more explicit. It's very helpful to users. Here, it'd be nice to state the options and what they mean: "Options: skip (filter out rows with bad values) or error (throw an error). Note that skipping rows means the Transformer can output a smaller dataset."
|
Those 2 small items are the only issues I see. |
|
Test build #40054 has finished for PR 7266 at commit
|
|
Test build #40059 has finished for PR 7266 at commit
|
|
Catalyst failure seems likely unrelated, jenkins retest this please. |
|
Test build #258 has finished for PR 7266 at commit
|
|
Test build #40068 has finished for PR 7266 at commit
|
|
LGTM, merging with master. |
…values As a precursor to adding a public constructor add an option to handle unseen values by skipping rather than throwing an exception (default remains throwing an exception), Author: Holden Karau <[email protected]> Closes apache#7266 from holdenk/SPARK-8764-string-indexer-should-take-option-to-handle-unseen-values and squashes the following commits: 38a4de9 [Holden Karau] fix long line 045bf22 [Holden Karau] Add a second b entry so b gets 0 for sure 81dd312 [Holden Karau] Update the docs for handleInvalid param to be more descriptive 7f37f6e [Holden Karau] remove extra space (scala style) 414e249 [Holden Karau] And switch to using handleInvalid instead of skipInvalid 1e53f9b [Holden Karau] update the param (codegen side) 7a22215 [Holden Karau] fix typo 100a39b [Holden Karau] Merge in master aa5b093 [Holden Karau] Since we filter we should never go down this code path if getSkipInvalid is true 75ffa69 [Holden Karau] Remove extra newline d69ef5e [Holden Karau] Add a test b5734be [Holden Karau] Add support for unseen labels afecd4e [Holden Karau] Add a param to skip invalid entries.
|
For me it would be usefull(when working with trees) if string indexer was mapping unseen labels to maximum known label+1. The row would not be skipped if it contained some non-used feature for classification. |
|
@miro-balaz : This probably isn't the best place for a new feature request - but if you head over to the ASF JIRA you can create a new ticket and cc the people who worked on this. |
|
thank you for directions On Monday, 12 September 2016, Holden Karau [email protected] wrote:
|
|
When "skip" is chosen as the way to handle Unseen labels, is there a way to know which rows were skipped? |
As a precursor to adding a public constructor add an option to handle unseen values by skipping rather than throwing an exception (default remains throwing an exception),