-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17219][ML] Add NaN value handling in Bucketizer #14858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index NaN value as ${splits.length -1}
''' all the other changes within this files, without comments, are just for code refactoring
|
Test build #64557 has finished for PR 14858 at commit
|
bfb5b33 to
e0f5912
Compare
|
Test build #64636 has finished for PR 14858 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are to give user the exact number of returned buckets, we need to go through the whole input dataset to check whether NaN value exists, the computation incurred just for a log warning msg is too high, so I choose to give user such msg instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is now a bit confusing, since it's reporting different things based on state that isn't logged for the user. If it's hard just say "bucketing to fewer buckets" as before at this stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just put it "the returned number of buckets might differ from what was requested depending on the data sample values. " , since the result number could be less than/equal to the requested number when same quantiles were spotted. And if no same quantiles exists in the splits, but dataset has NaN value, the actually number of buckets would then be greater than requested.
e0f5912 to
e970bed
Compare
|
Test build #64637 has finished for PR 14858 at commit
|
e970bed to
c42fc5e
Compare
|
Test build #64688 has finished for PR 14858 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, though now this is logged unconditionally. I think we'd only want to log (an info-level message) if the number of buckets didn't match the request, which is what you had previously?
Separately, I think the behavior of NaN has to be documented somewhere in this class too, to make people aware that it's always possible to get an extra bucket of data if there are NaNs.
Otherwise looking good to me.
c42fc5e to
a16ea15
Compare
|
updated tests and documents related to this change |
|
Test build #64753 has finished for PR 14858 at commit
|
docs/ml-features.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but this doesn't specify the behavior. It should be explicit that while data will go into buckets 0 through numBuckets-1, that NaN values will be counted in bucket numBuckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider situation, when a high proportion of duplicated data and/or NaN exist in a data sample, the exact number of buckets is hard to get, it could be less than/equal to/ more than 'numBuckets'. what we can be sure is that, the NaN value if existed will be grouped in the last bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always possible to have less data than buckets. The problem here is that you might have enough non-NaN data, even, to properly determine distinct buckets, but fail to do so because of NaNs making some splits NaN. You'd end up with fewer splits than intended when you could have created all meaningful splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this part of documentation, how about put it:
"The number of bins is set by the numBuckets parameter, but the returned 'actualNumBuckets' might differ from what was request depending on the data sample value; while data will go into Buckets[0] to Buckets[actualNumBuckets - 1] and NaN value, if existed, will go into Buckets[actualNumBuckets]"
sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest:
The number of bins is set by the numBuckets parameter. It's possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Note also that NaN values are handled specially and placed into their own bucket. For example, if 4 buckets are used, then non-NaN data will be put into buckets 0-3, but NaNs will be counted in a special bucket 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Thanks!
a16ea15 to
cc5a1e7
Compare
|
Test build #64825 has finished for PR 14858 at commit
|
cc5a1e7 to
9229eeb
Compare
|
Test build #65081 has finished for PR 14858 at commit
|
9229eeb to
95466a5
Compare
|
Test build #65093 has finished for PR 14858 at commit
|
|
Great, I have one last request @VinceShieh and that is to update the docs for QuantileDiscretizer in Scala and Python to reflect the additional comment about NaN that you put in the main docs. That would really complete it. I think the code and behavior looks solid now. |
4e54e27 to
085ae15
Compare
|
Test build #65294 has finished for PR 14858 at commit
|
|
Test build #65295 has finished for PR 14858 at commit
|
085ae15 to
b1b8a7f
Compare
|
Test build #65303 has finished for PR 14858 at commit
|
|
@srowen Updated. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I have one question here. The reason why we dont move this NaN filter into approxiQuantile or multipleApproxQuantiles is that, those apis are shared with sparkSQL? Becoz, personally I think it would look better if we put this filter inside multipleApproxQuantiles, though it would introduce more changes and, should make sure it doesnt impact other components other than mllib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that a similar argument applies for approxQuantile methods. I think the most reasonable semantics are to ignore NaN as well.
QuantileSummaries should probably reject insertion of NaN too.
I'd support making that change as well here, and expanding the scope accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @thunterdb for an opinion on that one, as he has touched most of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VinceShieh if you're interested in proceeding with the change you describe, go ahead. The new behavior should be documented explicitly, because I think it's the behavior one would already expect.
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Unit tests added in BucketizerSuite and QuantileDiscretizerSuite Signed-off-by: VinceShieh <[email protected]>
b1b8a7f to
edd4d68
Compare
|
Test build #65645 has finished for PR 14858 at commit
|
|
OK, I think it's also fair to just merge this as is. It's possible that later approxQuantile should be changed to ignore NaN. |
|
Merged to master |
What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
After:
How was this patch tested?
New test cases added in
BucketizerSuite.Signed-off-by: VinceShieh [email protected]