[SPARK-17219][ML] enhanced NaN value handling in Bucketizer #15428

VinceShieh · 2016-10-11T07:42:55Z

What changes were proposed in this pull request?

This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.

'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("keep")

How was this patch tested?

Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite

Signed-off-by: VinceShieh [email protected]

SparkQA · 2016-10-11T07:49:20Z

Test build #66731 has finished for PR 15428 at commit a3e4308.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-10-11T08:01:39Z

mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala

I'm neutral on the complexity that this adds, but not against it. It gets a little funny to say "keep invalid data" but I think we discussed that on the JIRA

yeah, usually we treat NaN as a type of invalid value, but we know there are cases they are proved useful, so, it's a bit dilemma here, also with the API naming.

srowen · 2016-10-11T08:03:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

Doesn't this need to try to handle "error"?

val filteredDataSet = getHandleInvalid match { case "skip" => dataset.na.drop case "keep" => dataset case "error" => if (...dataset contains NaN...) { throw new IllegalArgumentException(...) } else { dataset } }

Nope, actually, NaN will trigger an error later in binarySearchForBuckets as an invalid feature value if no special handling is made.

I don't see that the method handles NaN below. What binarySearch returns is undefined. One place or the other I think this has to be explicitly handled.

srowen · 2016-10-11T08:05:56Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

Nit: I think the convention is to leave the open paren on the previous line

Doesn't this need to handle "skip" and "error"? throw an exception on NaN if "error" or ignore it if "skip"?

the logic behind is that, we will filter out all NaN values in the dataset if user chooses 'skip', then no further special NaN handling is needed; if not choose to 'skip' NaN, the original dataset will be passed for binary search followed, in which, if user indicates "keep" handling for NaN, an extra bucket will be reserved, if not, an error will be raised.

SparkQA · 2016-10-11T08:56:53Z

Test build #66733 has finished for PR 15428 at commit cd8113c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T09:31:41Z

Test build #66735 has finished for PR 15428 at commit 5cd58b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I'll let @thunterdb or @jkbradley mostly review from here to see if this is what they intended

srowen · 2016-10-13T09:57:34Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

I don't see that the method handles NaN below. What binarySearch returns is undefined. One place or the other I think this has to be explicitly handled.

jkbradley · 2016-10-18T00:03:47Z

Thanks! I'll take a look.

Could you please fix the typo in the title? "enchanced" -> "enhanced"

VinceShieh · 2016-10-18T01:35:05Z

typo corrected. Thank you all. @srowen @jkbradley

jkbradley

Done with review. Thanks for the PR!

Other comments:

Add unit test for DataFrameStatFunctions.approxQuantile to check for handling of NaN values.
Create JIRA for adding handleInvalid to Python API.

jkbradley · 2016-10-18T00:03:55Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

fix indentation

jkbradley · 2016-10-18T00:05:32Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

Use Boolean flag internally to avoid the string comparison on each call.

Also, at this point, you already know the value for flag, so you can just use it here, rather than making the UDF take an extra argument.

jkbradley · 2016-10-18T00:11:11Z

mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala

Since this is used by StringIndexer, which will not support "keep," we should not modify HasHandleInvalid. I recommend copying out the shared param and just putting it in Bucketizer and QuantileDiscretizer, rather than trying to reuse HasHandleInvalid. That will let you specialize the documentation too so that you can be more specific.

HasHandleInvalid may not be a good shared Param yet, but perhaps in the future..

jkbradley · 2016-10-18T17:47:52Z

docs/ml-features.md

(same as below) Is "possible that the number of buckets used will be less than this value" true? It was true before this used Dataset.approxQuantiles, but I don't think it is true any longer.

In cases when the number of buckets requested by the user is greater than the number of distinct splits generated from Bucketizer, the returned number of buckets will be less than requested.

Yep, you're right

jkbradley · 2016-10-18T17:48:10Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

I'd put dataset.toDF() here and not import existentials.

jkbradley · 2016-10-18T17:48:42Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

This documentation should go in the handleInvalid Param doc string.

jkbradley · 2016-10-18T17:48:53Z

mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala

Just set splits here; there's no need to set the other Params for this test.

jkbradley · 2016-10-18T17:49:07Z

mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala

I'd also recommend directly testing the values of the splits. The current test makes sure that handleInvalid is passed to the bucketizer correctly, which is important but separate.

Also, please use vals (not vars) for clarity. I'd recommend making a helper method for lines 87-93, which can be reused for the test of handleInvalid = "skip"

jkbradley · 2016-10-18T17:49:08Z

mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala

Same here; I'd also recommend directly testing the values of the splits.

jkbradley · 2016-10-18T17:49:16Z

python/pyspark/ml/feature.py

Here too: no longer the case

same as the comment above.

SparkQA · 2016-10-20T02:44:17Z

Test build #67231 has finished for PR 15428 at commit c350e9f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

VinceShieh · 2016-10-20T02:54:12Z

Thanks for your valuable suggestions. @jkbradley @srowen

SparkQA · 2016-10-20T03:57:11Z

Test build #67232 has finished for PR 15428 at commit 0fb8d38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-20T05:29:29Z

Test build #67233 has finished for PR 15428 at commit d1dd840.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-10-21T23:43:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

Ah, sorry, one more comment. I'm not quite sure how closure capture behaves currently, but it might be good to define local vals for $(splits) and getHandleNaN.isDefined && getHandleNaN.get. Since these reference methods in the Bucketizer class, I believe the UDF may capture the whole Bucketizer class instead of just those vals.

After you define them in local vals here, you can use those vals in this UDF.

jkbradley · 2016-10-21T23:56:15Z

docs/ml-features.md

Yep, you're right

jkbradley

Thanks for the updates! A few more comments.

Also, I see you renamed the param to handleNaN. Do you think we will ever want to handle null values too? It might be good to stick with "handleInvalid" in case we add null support.

jkbradley · 2016-10-22T00:02:46Z

mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala

This is kind of complex logic just to get the expected number of splits. I'd recommend just putting the expected bucket values in the original DataFrame:

val df = sc.parallelize(Seq( (1.0, /*expected value for option "keep"*/, /*expected value for option "skip"*/), ... )).toDF("input", "expectedKeep", "expectedSkip")

Then you can compare with the actual values. That'll be an easier test to understand IMO.

jkbradley · 2016-10-22T00:04:23Z

python/pyspark/ml/feature.py

Actually no need to update Python API until it is updated to include handleNaN

This isn't available in Python yet, so can you please revert this change to feature.py?

jkbradley · 2016-10-22T00:06:19Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

Add getHandleNaN

jkbradley · 2016-10-22T00:06:27Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

State default

jkbradley · 2016-10-22T00:08:03Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

Remove "More options may be added later."

Also state default.

jkbradley · 2016-10-22T00:08:04Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

Remove "More options may be added later."

SparkQA · 2016-10-22T15:43:06Z

Test build #67386 has finished for PR 15428 at commit 47aad24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. We provided user when dealing NaN value in the dataset with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting "keep", "skip", or "error"(default) to handleInvalid. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("skip") Signed-off-by: VinceShieh <[email protected]>

SparkQA · 2016-10-24T06:35:22Z

Test build #67430 has finished for PR 15428 at commit b14fbab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T06:35:57Z

Test build #67429 has finished for PR 15428 at commit 70cee57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the updates.
Also, when you push further updates, could you please just push a new commit? Rebasing makes it harder to review since it makes it impossible to connect your most recent changes with the recent reviewer comments. Thanks!

jkbradley · 2016-10-24T16:52:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+  /**
+   * Param for how to handle invalid entries. Options are skip (which will filter out rows with
+   * invalid values), or error (which will throw an error), or keep (which will keep the invalid
+   * values in certain way). Default behaviour is to report an error for invalid entries.


I'd just write: default: "error"
Rewording as "report" instead of "throw" could confuse people.

jkbradley · 2016-10-24T16:52:31Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+
+  /** @group getParam */
+  @Since("2.1.0")
+  def gethandleInvalid: Option[Boolean] = $(handleInvalid) match {


This should just return $(handleInvalid), just like any other Param getter method.

jkbradley · 2016-10-24T16:52:34Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

+  /**
+   * Param for how to handle invalid entries. Options are skip (which will filter out rows with
+   * invalid values), or error (which will throw an error), or keep (which will keep the invalid
+   * values in certain way). Default behaviour is to report an error for invalid entries.


I'd just write: default: "error"
Rewording as "report" instead of "throw" could confuse people.

jkbradley · 2016-10-24T16:52:41Z

python/pyspark/ml/feature.py

This isn't available in Python yet, so can you please revert this change to feature.py?

This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. We provided user when dealing NaN value in the dataset with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting "keep", "skip", or "error"(default) to handleInvalid. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("skip") Signed-off-by: VinceShieh <[email protected]>

SparkQA · 2016-10-25T08:32:58Z

Test build #67493 has finished for PR 15428 at commit 5274d4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-10-25T19:09:45Z

Thanks, you still need to remove the change in feature.py, but other than that, this should be ready.

Signed-off-by: VinceShieh <[email protected]>

VinceShieh · 2016-10-26T03:16:44Z

sorry, I must have forgotten to commit the changes.
All done now. Thanks for reviewing.

SparkQA · 2016-10-26T05:15:56Z

Test build #67547 has finished for PR 15428 at commit 2f98d31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…naming of set/get for handleInvalid

jkbradley · 2016-10-26T20:10:47Z

Thanks for the update! On a last glance through, I spotted a few things to fix up. Since some had to do with wording, I thought it'd be easier just to send a PR to your PR. Can you please review this and merge it if it looks Ok to you? Thanks! Here it is: [https://github.com/VinceShieh/pull/2]

Small cleanups

SparkQA · 2016-10-27T03:08:42Z

Test build #67611 has finished for PR 15428 at commit 2b1b81d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-10-27T18:51:43Z

Thanks for merging that! This LGTM
Merging with master
Thanks a lot @VinceShieh !

## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <[email protected]> Author: Vincent Xie <[email protected]> Author: Joseph K. Bradley <[email protected]> Closes apache#15428 from VinceShieh/spark-17219_followup.

…r in pyspark This PR is to document the change on QuantileDiscretizer in pyspark for PR: apache#15428 Signed-off-by: VinceShieh <[email protected]>

…r in pyspark ## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer in pyspark for PR: apache#15428 ## How was this patch tested? No test needed Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <[email protected]> Closes apache#16922 from VinceShieh/spark-19590.

VinceShieh force-pushed the spark-17219_followup branch from a3e4308 to cd8113c Compare October 11, 2016 07:52

srowen requested changes Oct 11, 2016

View reviewed changes

VinceShieh force-pushed the spark-17219_followup branch from cd8113c to 5cd58b7 Compare October 11, 2016 08:31

srowen requested changes Oct 13, 2016

View reviewed changes

VinceShieh changed the title ~~[SPARK-17219][ML] enchanced NaN value handling in Bucketizer~~ [SPARK-17219][ML] enhanced NaN value handling in Bucketizer Oct 18, 2016

jkbradley reviewed Oct 18, 2016

View reviewed changes

VinceShieh force-pushed the spark-17219_followup branch from 5cd58b7 to c350e9f Compare October 20, 2016 02:41

VinceShieh force-pushed the spark-17219_followup branch from c350e9f to 0fb8d38 Compare October 20, 2016 02:48

VinceShieh force-pushed the spark-17219_followup branch from 0fb8d38 to d1dd840 Compare October 20, 2016 03:25

jkbradley reviewed Oct 21, 2016

View reviewed changes

jkbradley reviewed Oct 22, 2016

View reviewed changes

VinceShieh force-pushed the spark-17219_followup branch from d1dd840 to 47aad24 Compare October 22, 2016 13:41

VinceShieh force-pushed the spark-17219_followup branch from 47aad24 to 70cee57 Compare October 24, 2016 04:31

VinceShieh force-pushed the spark-17219_followup branch from 70cee57 to b14fbab Compare October 24, 2016 04:34

jkbradley reviewed Oct 24, 2016

View reviewed changes

revert changes in feature.py

2f98d31

Signed-off-by: VinceShieh <[email protected]>

Cleanups: docs cleanups, slightly improved unit test coverage, fixed …

2644235

…naming of set/get for handleInvalid

Merge pull request #2 from jkbradley/VinceShieh-spark-17219_followup

2b1b81d

Small cleanups

asfgit closed this in 0b076d4 Oct 27, 2016

zhengruifeng mentioned this pull request Nov 4, 2016

[SPARK-14352][SQL] approxQuantile should support multi columns #12135

Closed

VinceShieh mentioned this pull request Feb 14, 2017

[SPARK-19590][pyspark][ML] Update the document for QuantileDiscretizer in pyspark #16922

Closed

[SPARK-17219][ML] enhanced NaN value handling in Bucketizer #15428

[SPARK-17219][ML] enhanced NaN value handling in Bucketizer #15428

Uh oh!

Conversation

VinceShieh commented Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VinceShieh Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Oct 18, 2016

Uh oh!

VinceShieh commented Oct 18, 2016

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

VinceShieh commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

VinceShieh commented Oct 11, 2016 •

edited

Loading

srowen Oct 11, 2016 •

edited

Loading

VinceShieh Oct 11, 2016 •

edited

Loading