[SPARK-17219][ML] Add NaN value handling in Bucketizer #14858

VinceShieh · 2016-08-29T08:12:04Z

What changes were proposed in this pull request?

This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:

Bucketizer.transform on NaN value threw an illegal exception.

After:

NaN values will be grouped in an extra bucket.

How was this patch tested?

New test cases added in BucketizerSuite.
Signed-off-by: VinceShieh [email protected]

VinceShieh · 2016-08-29T08:20:52Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

index NaN value as ${splits.length -1}
''' all the other changes within this files, without comments, are just for code refactoring

SparkQA · 2016-08-29T17:53:01Z

Test build #64557 has finished for PR 14858 at commit bfb5b33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-30T06:44:42Z

Test build #64636 has finished for PR 14858 at commit e0f5912.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

VinceShieh · 2016-08-30T06:48:12Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

if we are to give user the exact number of returned buckets, we need to go through the whole input dataset to check whether NaN value exists, the computation incurred just for a log warning msg is too high, so I choose to give user such msg instead

I think this is now a bit confusing, since it's reporting different things based on state that isn't logged for the user. If it's hard just say "bucketing to fewer buckets" as before at this stage.

I think we just put it "the returned number of buckets might differ from what was requested depending on the data sample values. " , since the result number could be less than/equal to the requested number when same quantiles were spotted. And if no same quantiles exists in the splits, but dataset has NaN value, the actually number of buckets would then be greater than requested.

SparkQA · 2016-08-30T07:54:43Z

Test build #64637 has finished for PR 14858 at commit e970bed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-31T02:09:57Z

Test build #64688 has finished for PR 14858 at commit c42fc5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-08-31T09:27:43Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

OK, though now this is logged unconditionally. I think we'd only want to log (an info-level message) if the number of buckets didn't match the request, which is what you had previously?

Separately, I think the behavior of NaN has to be documented somewhere in this class too, to make people aware that it's always possible to get an extra bucket of data if there are NaNs.

Otherwise looking good to me.

VinceShieh · 2016-09-01T02:59:22Z

updated tests and documents related to this change

SparkQA · 2016-09-01T03:52:16Z

Test build #64753 has finished for PR 14858 at commit a16ea15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-01T08:26:34Z

docs/ml-features.md

OK, but this doesn't specify the behavior. It should be explicit that while data will go into buckets 0 through numBuckets-1, that NaN values will be counted in bucket numBuckets.

consider situation, when a high proportion of duplicated data and/or NaN exist in a data sample, the exact number of buckets is hard to get, it could be less than/equal to/ more than 'numBuckets'. what we can be sure is that, the NaN value if existed will be grouped in the last bucket.

It's always possible to have less data than buckets. The problem here is that you might have enough non-NaN data, even, to properly determine distinct buckets, but fail to do so because of NaNs making some splits NaN. You'd end up with fewer splits than intended when you could have created all meaningful splits.

for this part of documentation, how about put it:
"The number of bins is set by the numBuckets parameter, but the returned 'actualNumBuckets' might differ from what was request depending on the data sample value; while data will go into Buckets[0] to Buckets[actualNumBuckets - 1] and NaN value, if existed, will go into Buckets[actualNumBuckets]"
sounds good?

I would suggest:

The number of bins is set by the numBuckets parameter. It's possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Note also that NaN values are handled specially and placed into their own bucket. For example, if 4 buckets are used, then non-NaN data will be put into buckets 0-3, but NaNs will be counted in a special bucket 4.

Nit. Thanks!

SparkQA · 2016-09-02T02:53:49Z

Test build #64825 has finished for PR 14858 at commit cc5a1e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-08T06:18:16Z

Test build #65081 has finished for PR 14858 at commit 9229eeb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-08T08:57:19Z

Test build #65093 has finished for PR 14858 at commit 95466a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-12T11:30:35Z

Great, I have one last request @VinceShieh and that is to update the docs for QuantileDiscretizer in Scala and Python to reflect the additional comment about NaN that you put in the main docs. That would really complete it. I think the code and behavior looks solid now.

SparkQA · 2016-09-13T02:33:39Z

Test build #65294 has finished for PR 14858 at commit 4e54e27.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-13T02:40:15Z

Test build #65295 has finished for PR 14858 at commit 085ae15.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-13T07:08:58Z

Test build #65303 has finished for PR 14858 at commit b1b8a7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

VinceShieh · 2016-09-13T07:55:23Z

@srowen Updated. Thanks.

VinceShieh · 2016-09-13T08:04:58Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

@srowen I have one question here. The reason why we dont move this NaN filter into approxiQuantile or multipleApproxQuantiles is that, those apis are shared with sparkSQL? Becoz, personally I think it would look better if we put this filter inside multipleApproxQuantiles, though it would introduce more changes and, should make sure it doesnt impact other components other than mllib.

Yes, I agree that a similar argument applies for approxQuantile methods. I think the most reasonable semantics are to ignore NaN as well.

QuantileSummaries should probably reject insertion of NaN too.

I'd support making that change as well here, and expanding the scope accordingly.

CC @thunterdb for an opinion on that one, as he has touched most of this code.

@VinceShieh if you're interested in proceeding with the change you describe, go ahead. The new behavior should be documented explicitly, because I think it's the behavior one would already expect.

This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Unit tests added in BucketizerSuite and QuantileDiscretizerSuite Signed-off-by: VinceShieh <[email protected]>

SparkQA · 2016-09-20T10:28:41Z

Test build #65645 has finished for PR 14858 at commit edd4d68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-20T13:01:17Z

OK, I think it's also fair to just merge this as is. It's possible that later approxQuantile should be changed to ignore NaN.

srowen · 2016-09-21T09:21:18Z

Merged to master

VinceShieh reviewed Aug 29, 2016
View reviewed changes

VinceShieh force-pushed the spark-17219 branch from bfb5b33 to e0f5912 Compare August 30, 2016 06:41

VinceShieh reviewed Aug 30, 2016
View reviewed changes

VinceShieh force-pushed the spark-17219 branch from e0f5912 to e970bed Compare August 30, 2016 07:01

VinceShieh force-pushed the spark-17219 branch from e970bed to c42fc5e Compare August 31, 2016 01:16

srowen reviewed Aug 31, 2016
View reviewed changes

VinceShieh force-pushed the spark-17219 branch from c42fc5e to a16ea15 Compare September 1, 2016 02:56

srowen reviewed Sep 1, 2016
View reviewed changes

VinceShieh force-pushed the spark-17219 branch from a16ea15 to cc5a1e7 Compare September 2, 2016 01:56

VinceShieh force-pushed the spark-17219 branch from cc5a1e7 to 9229eeb Compare September 8, 2016 05:40

VinceShieh force-pushed the spark-17219 branch from 9229eeb to 95466a5 Compare September 8, 2016 07:57

VinceShieh force-pushed the spark-17219 branch 2 times, most recently from 4e54e27 to 085ae15 Compare September 13, 2016 01:57

VinceShieh force-pushed the spark-17219 branch from 085ae15 to b1b8a7f Compare September 13, 2016 06:16

VinceShieh reviewed Sep 13, 2016
View reviewed changes

VinceShieh force-pushed the spark-17219 branch from b1b8a7f to edd4d68 Compare September 20, 2016 08:33

asfgit closed this in 57dc326 Sep 21, 2016

MLnick mentioned this pull request Nov 4, 2016

[SPARK-14352][SQL] approxQuantile should support multi columns #12135

Closed

MLnick mentioned this pull request Feb 8, 2017

[SPARK-19436][SQL] Add missing tests for approxQuantile #16776

Closed

[SPARK-17219][ML] Add NaN value handling in Bucketizer #14858

[SPARK-17219][ML] Add NaN value handling in Bucketizer #14858

Uh oh!

Conversation

VinceShieh commented Aug 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

SparkQA commented Aug 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 30, 2016

Uh oh!

SparkQA commented Aug 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VinceShieh commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 8, 2016

Uh oh!

SparkQA commented Sep 8, 2016

Uh oh!

srowen commented Sep 12, 2016

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

VinceShieh commented Sep 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

srowen commented Sep 20, 2016

Uh oh!

srowen commented Sep 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

VinceShieh commented Aug 29, 2016 •

edited

Loading