-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11569][ML] Fix StringIndexer to handle null value properly #17233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
I'll take a look |
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Just a few comments
| * Param for how to handle invalid data (unseen labels or NULL values). | ||
| * Options are 'skip' (filter out rows with invalid data), | ||
| * 'error' (throw an error), or 'keep' (put invalid data in a special additional | ||
| * bucket, at index numLabels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add ")" at end: "numLabels)."
| .setInputCol("label") | ||
| .setOutputCol("labelIndex") | ||
|
|
||
| withClue("StringIndexer should throw error when setHandleValid=error when given NULL values") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setHandleValid -> setHandleInvalid
| assert(attrSkip.values.get === Array("b", "a")) | ||
| assert(transformedSkip.select("labelIndex").rdd.map { r => | ||
| r.getDouble(0) | ||
| }.collect() === expectedSkip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't assume that the order of Rows of a collected DataFrame/Dataset will be the same each time. Add an ID column so that you can collect things as Sets for comparison to make this robust; check out the unit test above for missing labels for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roger
| }.collect() === expectedKeep) | ||
| } | ||
|
|
||
| test("StringIndexer with a numeric input column with NULLs") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to test numerics separately? If there's a reason to, then can you please refactor these 2 tests to eliminate duplicated code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll remove the numeric test.
| labelToIndex(label) | ||
| } else if (keepInvalid) { | ||
| labels.length | ||
| val indexer = udf { row: Row => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use a Row; just take a String:
val indexer = udf { label: String =>
if (label == null) {
if (keepInvalid) {
labels.length
} else {
throw new SparkException("StringIndexer encountered NULL value. To handle or skip " +
"NULLS, try setting StringIndexer.handleInvalid.")
}
} else {
if (labelToIndex.contains(label)) {
labelToIndex(label)
} else if (keepInvalid) {
labels.length
} else {
throw new SparkException(s"Unseen label: $label. To handle unseen labels, " +
s"set Param handleInvalid to ${StringIndexer.KEEP_INVALID}.")
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
|
@jkbradley Hi, I have made some updates according to your comments, please review it again. :-) |
|
LGTM |
|
how did you solve this problem? about "getting no progress" |
What changes were proposed in this pull request?
This PR is to enhance StringIndexer with NULL values handling.
Before the PR, StringIndexer will throw an exception when encounters NULL values.
With this PR:
BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-)
How was this patch tested?
new unit tests