Skip to content

Conversation

@VinceShieh
Copy link

@VinceShieh VinceShieh commented Aug 29, 2016

What changes were proposed in this pull request?

This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:

Bucketizer.transform on NaN value threw an illegal exception.

After:

NaN values will be grouped in an extra bucket.

How was this patch tested?

New test cases added in BucketizerSuite.
Signed-off-by: VinceShieh [email protected]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index NaN value as ${splits.length -1}
''' all the other changes within this files, without comments, are just for code refactoring

@SparkQA
Copy link

SparkQA commented Aug 29, 2016

Test build #64557 has finished for PR 14858 at commit bfb5b33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64636 has finished for PR 14858 at commit e0f5912.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are to give user the exact number of returned buckets, we need to go through the whole input dataset to check whether NaN value exists, the computation incurred just for a log warning msg is too high, so I choose to give user such msg instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is now a bit confusing, since it's reporting different things based on state that isn't logged for the user. If it's hard just say "bucketing to fewer buckets" as before at this stage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just put it "the returned number of buckets might differ from what was requested depending on the data sample values. " , since the result number could be less than/equal to the requested number when same quantiles were spotted. And if no same quantiles exists in the splits, but dataset has NaN value, the actually number of buckets would then be greater than requested.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64637 has finished for PR 14858 at commit e970bed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64688 has finished for PR 14858 at commit c42fc5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, though now this is logged unconditionally. I think we'd only want to log (an info-level message) if the number of buckets didn't match the request, which is what you had previously?

Separately, I think the behavior of NaN has to be documented somewhere in this class too, to make people aware that it's always possible to get an extra bucket of data if there are NaNs.

Otherwise looking good to me.

@VinceShieh
Copy link
Author

updated tests and documents related to this change

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64753 has finished for PR 14858 at commit a16ea15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this doesn't specify the behavior. It should be explicit that while data will go into buckets 0 through numBuckets-1, that NaN values will be counted in bucket numBuckets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider situation, when a high proportion of duplicated data and/or NaN exist in a data sample, the exact number of buckets is hard to get, it could be less than/equal to/ more than 'numBuckets'. what we can be sure is that, the NaN value if existed will be grouped in the last bucket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always possible to have less data than buckets. The problem here is that you might have enough non-NaN data, even, to properly determine distinct buckets, but fail to do so because of NaNs making some splits NaN. You'd end up with fewer splits than intended when you could have created all meaningful splits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this part of documentation, how about put it:
"The number of bins is set by the numBuckets parameter, but the returned 'actualNumBuckets' might differ from what was request depending on the data sample value; while data will go into Buckets[0] to Buckets[actualNumBuckets - 1] and NaN value, if existed, will go into Buckets[actualNumBuckets]"
sounds good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest:

The number of bins is set by the numBuckets parameter. It's possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Note also that NaN values are handled specially and placed into their own bucket. For example, if 4 buckets are used, then non-NaN data will be put into buckets 0-3, but NaNs will be counted in a special bucket 4.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64825 has finished for PR 14858 at commit cc5a1e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65081 has finished for PR 14858 at commit 9229eeb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65093 has finished for PR 14858 at commit 95466a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 12, 2016

Great, I have one last request @VinceShieh and that is to update the docs for QuantileDiscretizer in Scala and Python to reflect the additional comment about NaN that you put in the main docs. That would really complete it. I think the code and behavior looks solid now.

@VinceShieh VinceShieh force-pushed the spark-17219 branch 2 times, most recently from 4e54e27 to 085ae15 Compare September 13, 2016 01:57
@SparkQA
Copy link

SparkQA commented Sep 13, 2016

Test build #65294 has finished for PR 14858 at commit 4e54e27.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2016

Test build #65295 has finished for PR 14858 at commit 085ae15.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2016

Test build #65303 has finished for PR 14858 at commit b1b8a7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@VinceShieh
Copy link
Author

@srowen Updated. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I have one question here. The reason why we dont move this NaN filter into approxiQuantile or multipleApproxQuantiles is that, those apis are shared with sparkSQL? Becoz, personally I think it would look better if we put this filter inside multipleApproxQuantiles, though it would introduce more changes and, should make sure it doesnt impact other components other than mllib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that a similar argument applies for approxQuantile methods. I think the most reasonable semantics are to ignore NaN as well.

QuantileSummaries should probably reject insertion of NaN too.

I'd support making that change as well here, and expanding the scope accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @thunterdb for an opinion on that one, as he has touched most of this code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VinceShieh if you're interested in proceeding with the change you describe, go ahead. The new behavior should be documented explicitly, because I think it's the behavior one would already expect.

This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.

Unit tests added in BucketizerSuite and QuantileDiscretizerSuite

Signed-off-by: VinceShieh <[email protected]>
@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65645 has finished for PR 14858 at commit edd4d68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 20, 2016

OK, I think it's also fair to just merge this as is. It's possible that later approxQuantile should be changed to ignore NaN.

@srowen
Copy link
Member

srowen commented Sep 21, 2016

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants