-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17219][ML] enhanced NaN value handling in Bucketizer #15428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #66731 has finished for PR 15428 at commit
|
a3e4308 to
cd8113c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm neutral on the complexity that this adds, but not against it. It gets a little funny to say "keep invalid data" but I think we discussed that on the JIRA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, usually we treat NaN as a type of invalid value, but we know there are cases they are proved useful, so, it's a bit dilemma here, also with the API naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this need to try to handle "error"?
val filteredDataSet = getHandleInvalid match {
case "skip" => dataset.na.drop
case "keep" => dataset
case "error" =>
if (...dataset contains NaN...) {
throw new IllegalArgumentException(...)
} else {
dataset
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, actually, NaN will trigger an error later in binarySearchForBuckets as an invalid feature value if no special handling is made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that the method handles NaN below. What binarySearch returns is undefined. One place or the other I think this has to be explicitly handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think the convention is to leave the open paren on the previous line
Doesn't this need to handle "skip" and "error"? throw an exception on NaN if "error" or ignore it if "skip"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic behind is that, we will filter out all NaN values in the dataset if user chooses 'skip', then no further special NaN handling is needed; if not choose to 'skip' NaN, the original dataset will be passed for binary search followed, in which, if user indicates "keep" handling for NaN, an extra bucket will be reserved, if not, an error will be raised.
cd8113c to
5cd58b7
Compare
|
Test build #66733 has finished for PR 15428 at commit
|
|
Test build #66735 has finished for PR 15428 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @thunterdb or @jkbradley mostly review from here to see if this is what they intended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that the method handles NaN below. What binarySearch returns is undefined. One place or the other I think this has to be explicitly handled.
|
Thanks! I'll take a look. Could you please fix the typo in the title? "enchanced" -> "enhanced" |
|
typo corrected. Thank you all. @srowen @jkbradley |
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with review. Thanks for the PR!
Other comments:
- Add unit test for
DataFrameStatFunctions.approxQuantileto check for handling of NaN values. - Create JIRA for adding handleInvalid to Python API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Boolean flag internally to avoid the string comparison on each call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, at this point, you already know the value for flag, so you can just use it here, rather than making the UDF take an extra argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is used by StringIndexer, which will not support "keep," we should not modify HasHandleInvalid. I recommend copying out the shared param and just putting it in Bucketizer and QuantileDiscretizer, rather than trying to reuse HasHandleInvalid. That will let you specialize the documentation too so that you can be more specific.
HasHandleInvalid may not be a good shared Param yet, but perhaps in the future..
docs/ml-features.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same as below) Is "possible that the number of buckets used will be less than this value" true? It was true before this used Dataset.approxQuantiles, but I don't think it is true any longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases when the number of buckets requested by the user is greater than the number of distinct splits generated from Bucketizer, the returned number of buckets will be less than requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you're right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put dataset.toDF() here and not import existentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation should go in the handleInvalid Param doc string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just set splits here; there's no need to set the other Params for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also recommend directly testing the values of the splits. The current test makes sure that handleInvalid is passed to the bucketizer correctly, which is important but separate.
Also, please use vals (not vars) for clarity. I'd recommend making a helper method for lines 87-93, which can be reused for the test of handleInvalid = "skip"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here; I'd also recommend directly testing the values of the splits.
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too: no longer the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as the comment above.
5cd58b7 to
c350e9f
Compare
|
Test build #67231 has finished for PR 15428 at commit
|
c350e9f to
0fb8d38
Compare
|
Thanks for your valuable suggestions. @jkbradley @srowen |
0fb8d38 to
d1dd840
Compare
|
Test build #67232 has finished for PR 15428 at commit
|
|
Test build #67233 has finished for PR 15428 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, one more comment. I'm not quite sure how closure capture behaves currently, but it might be good to define local vals for $(splits) and getHandleNaN.isDefined && getHandleNaN.get. Since these reference methods in the Bucketizer class, I believe the UDF may capture the whole Bucketizer class instead of just those vals.
After you define them in local vals here, you can use those vals in this UDF.
docs/ml-features.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you're right
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! A few more comments.
Also, I see you renamed the param to handleNaN. Do you think we will ever want to handle null values too? It might be good to stick with "handleInvalid" in case we add null support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of complex logic just to get the expected number of splits. I'd recommend just putting the expected bucket values in the original DataFrame:
val df = sc.parallelize(Seq(
(1.0, /*expected value for option "keep"*/, /*expected value for option "skip"*/),
...
)).toDF("input", "expectedKeep", "expectedSkip")
Then you can compare with the actual values. That'll be an easier test to understand IMO.
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no need to update Python API until it is updated to include handleNaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't available in Python yet, so can you please revert this change to feature.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add getHandleNaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "More options may be added later."
Also state default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "More options may be added later."
d1dd840 to
47aad24
Compare
|
Test build #67386 has finished for PR 15428 at commit
|
47aad24 to
70cee57
Compare
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
We provided user when dealing NaN value in the dataset with 3 options, to either reserve an extra
bucket for NaN values, or remove the NaN values, or report an error, by setting "keep", "skip",
or "error"(default) to handleInvalid.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("skip")
Signed-off-by: VinceShieh <[email protected]>
70cee57 to
b14fbab
Compare
|
Test build #67430 has finished for PR 15428 at commit
|
|
Test build #67429 has finished for PR 15428 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates.
Also, when you push further updates, could you please just push a new commit? Rebasing makes it harder to review since it makes it impossible to connect your most recent changes with the recent reviewer comments. Thanks!
| /** | ||
| * Param for how to handle invalid entries. Options are skip (which will filter out rows with | ||
| * invalid values), or error (which will throw an error), or keep (which will keep the invalid | ||
| * values in certain way). Default behaviour is to report an error for invalid entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just write: default: "error"
Rewording as "report" instead of "throw" could confuse people.
|
|
||
| /** @group getParam */ | ||
| @Since("2.1.0") | ||
| def gethandleInvalid: Option[Boolean] = $(handleInvalid) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just return $(handleInvalid), just like any other Param getter method.
| /** | ||
| * Param for how to handle invalid entries. Options are skip (which will filter out rows with | ||
| * invalid values), or error (which will throw an error), or keep (which will keep the invalid | ||
| * values in certain way). Default behaviour is to report an error for invalid entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just write: default: "error"
Rewording as "report" instead of "throw" could confuse people.
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't available in Python yet, so can you please revert this change to feature.py?
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
We provided user when dealing NaN value in the dataset with 3 options, to either reserve an extra
bucket for NaN values, or remove the NaN values, or report an error, by setting "keep", "skip",
or "error"(default) to handleInvalid.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("skip")
Signed-off-by: VinceShieh <[email protected]>
|
Test build #67493 has finished for PR 15428 at commit
|
|
Thanks, you still need to remove the change in feature.py, but other than that, this should be ready. |
Signed-off-by: VinceShieh <[email protected]>
|
sorry, I must have forgotten to commit the changes. |
|
Test build #67547 has finished for PR 15428 at commit
|
…naming of set/get for handleInvalid
|
Thanks for the update! On a last glance through, I spotted a few things to fix up. Since some had to do with wording, I thought it'd be easier just to send a PR to your PR. Can you please review this and merge it if it looks Ok to you? Thanks! Here it is: [https://github.com/VinceShieh/pull/2] |
|
Test build #67611 has finished for PR 15428 at commit
|
|
Thanks for merging that! This LGTM |
## What changes were proposed in this pull request?
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("keep")
## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <[email protected]>
Author: Vincent Xie <[email protected]>
Author: Joseph K. Bradley <[email protected]>
Closes apache#15428 from VinceShieh/spark-17219_followup.
## What changes were proposed in this pull request?
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("keep")
## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <[email protected]>
Author: Vincent Xie <[email protected]>
Author: Joseph K. Bradley <[email protected]>
Closes apache#15428 from VinceShieh/spark-17219_followup.
…r in pyspark This PR is to document the change on QuantileDiscretizer in pyspark for PR: apache#15428 Signed-off-by: VinceShieh <[email protected]>
…r in pyspark ## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer in pyspark for PR: apache#15428 ## How was this patch tested? No test needed Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <[email protected]> Closes apache#16922 from VinceShieh/spark-19590.
…r in pyspark ## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer in pyspark for PR: apache#15428 ## How was this patch tested? No test needed Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <[email protected]> Closes apache#16922 from VinceShieh/spark-19590.
What changes were proposed in this pull request?
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("keep")
How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
Signed-off-by: VinceShieh [email protected]