[SPARK-20604][ML] Allow imputer to handle numeric types #17864

actuaryzhang · 2017-05-04T22:27:00Z

What changes were proposed in this pull request?

Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types.

How was this patch tested?

new test

actuaryzhang · 2017-05-04T22:27:57Z

@yanboliang @srowen @MLnick @jkbradley @hhbyyh

The example below shows failure of Imputer on integer data.

    val df = spark.createDataFrame( Seq(
      (0, 1.0, 1.0, 1.0),
      (1, 11.0, 11.0, 11.0),
      (2, 1.5, 1.5, 1.5),
      (3, Double.NaN, 4.5, 1.5)
    )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
    val imputer = new Imputer()
      .setInputCols(Array("value1"))
      .setOutputCols(Array("out1"))
    imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))

java.lang.IllegalArgumentException: requirement failed: Column value1 must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type IntegerType.

SparkQA · 2017-05-04T23:25:35Z

Test build #76468 has finished for PR 17864 at commit e9ab39c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-05-04T23:35:08Z

So the other PR #11601 is really long. For reference, I am picking out the relevant discussions to this PR (also someone tell me if there's a better way to link to pr comments :)

@MLnick "what do you think about handling different numeric types in input/output columns? If the input is say IntType, then strategy mode andmedian is ok but mean is somewhat problematic - or are we ok with rounding to and Int? The alternative is the Imputer always appends a Double output column.

I propose we either (a) do the cast back to input type, but if the user selected "mean" and the input type is not Float or Double, log a warning; or (b) only support Float and Double type for this initial version of the Imputer."

@jkbradley "Just catching up now... I like the idea of maintaining the input type. I'm imagining using an Imputer to fill in continuous features with the mean and categoricals with the mode. Later on, we could even check to see if a column is categorical (in the metadata) and throw an exception for mean.

I'd prefer your option (b) to be safe."

@sethah "For reference, I checked scikit-learn and the Imputer class returns floats regardless of inputs. I also checked R package "mlr" and it appears to do the same. One concern with a.) would be if the true median was something like 5.0, but approxQuantile returned 4.999999999. Then, we cast back to IntegerType and return 4. I wasn't able to produce this situation when I briefly experimented with it, and also the median is already approximate, so I'm not sure if this is really a problem."

actuaryzhang · 2017-05-05T05:19:39Z

@sethah Thanks for summarizing the previous discussions.
What are you suggesting for this PR? I think it makes sense to log a warning when imputing integer types with mean. In addition, perhaps we can set "median" as the default strategy.

hhbyyh · 2017-05-06T02:47:13Z

I imagine most Int features will need to be converted to Double for a Vector, thus returns Double regardless the input type makes sense, which also makes the implementation more straight forward.

actuaryzhang · 2017-05-06T05:21:35Z

@hhbyyh Thanks for the suggestion. I have made a new commit that always casts the input to double and outputs the imputed column as double.

SparkQA · 2017-05-06T06:18:48Z

Test build #76511 has finished for PR 17864 at commit 6479965.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-06T06:24:07Z

Test build #76513 has finished for PR 17864 at commit 86c8a10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-05-10T18:35:53Z

@hhbyyh @sethah @MLnick
Could you take a look at the new commit? Thanks.

actuaryzhang · 2017-05-23T06:04:23Z

Ping folks for comments/review. Many thanks.
@viirya @MLnick @jkbradley @hhbyyh @yanboliang @BenFradet

MLnick · 2017-05-25T18:32:00Z

Originally the idea behind only supporting double was as @sethah posted above - there could be some issues with handling of int casting etc. As mentioned originally, we did consider "always cast to double". The only issue with it is the potential for surprising users who may expect the type of the input column to be maintained in the imputation.

Having said that I would be broadly ok with just appending a double output column, provided we update the docstrings / guide to make things very clear.

actuaryzhang · 2017-05-25T22:32:40Z

@MLnick Thanks much for your comments. Yes, I think always returning Double is consistent with Python and R and also other transformers in ML. Plus, as @hhbyyh mentioned, this makes the implementation easier. Would you mind taking a look at the code and let me know if there is any suggestion for improvement? The doc is already updated to make it clear that it always returns Double regardless of the input type.

actuaryzhang · 2017-05-25T22:33:41Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

 * Note that the mean/median value is computed after filtering out missing values.
 * All Null values in the input columns are treated as missing, and so are also imputed. For
 * computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.
+ * The output column is always of Double type regardless of the input column type.


@MLnick Here is the note on always returning Double type.

actuaryzhang · 2017-06-23T14:39:18Z

Any committer has a chance to take another look at this PR? Thanks.

hhbyyh · 2017-06-25T16:59:38Z

Shall we pay extra attention to the Int case? E.g. input column contains
Double.Nan,
1,
2.

The current implementation will return surrogate as 1.5. I'm not sure if it's the expectation for some users.

It's fine by me but just bring up the issue in case it's missed.

actuaryzhang · 2017-06-26T07:04:55Z

We can log a warning or issue an error if the input column is int and the imputation is by mean.
Would like to know if that's OK with you? @hhbyyh @MLnick

felixcheung · 2017-08-08T06:43:02Z

what's next on this one?

actuaryzhang · 2017-08-08T17:04:05Z

Thanks for following up on this, Felix.
Still waiting for an agreement on this...
Would like to have more direction on this.

SparkQA · 2019-04-20T02:51:40Z

Test build #104769 has finished for PR 17864 at commit 86c8a10.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

holdenk · 2019-05-10T19:20:33Z

It sounds like folks are generally OK with the approach being taken here if you wanted to update the PR?

srowen · 2019-05-15T14:40:49Z

Reviewing this old one .. I'd favor not changing the column type. When outputting the mean, it should round back to integer types if needed.

actuaryzhang · 2019-07-29T22:48:21Z

Sorry, have not been active here due to other stuff...
Do folks still feel we should proceed with this PR?
If so, please suggest any remaining changes and I'll update it.
@srowen @holdenk @felixcheung @hhbyyh

srowen · 2019-07-30T14:28:35Z

I personally would not favor any change that changes the type of the column. Either let's not do this or change the logic to round the results back to integer types in the case of integer columns.

actuaryzhang · 2019-07-30T16:14:43Z

@srowen If keep the column type intact (return integer for integer input), will you be supporting this PR then?

srowen · 2019-07-30T21:18:10Z

I think that would be pretty reasonable and I'd review it.

actuaryzhang · 2019-07-31T18:45:00Z

@srowen @holdenk
Made an update to preserve input data type.
Logic is to convert input to Double internally, compute mean/median, do imputation, and then cast back to original data type. Updated the doc and test accordingly.
Please take a look. Thanks.

srowen

Nice, that's tidy.

srowen · 2019-07-31T19:15:01Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala


+  test("Imputer for Numeric with default missing Value NaN") {
+    val df = spark.createDataFrame(Seq(
+      (0, 1.0, 1.0, 1.0),


I think you don't need the id column?

Removed now. I was just copying examples from existing tests.

srowen · 2019-07-31T19:16:09Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

+      (0, 1.0, 1.0, 1.0),
+      (1, 11.0, 11.0, 11.0),
+      (2, 3.6, 3.6, 3.6),
+      (3, Double.NaN, 5.2, 3.6)


It's not necessarily true that the mean of the values after casting is equal to the mean after casting, but here it happens to be. That's OK I think. What about checking the case where a non-NaN value is set as the missing value?

Added another test to check non-NaN missing value.

srowen · 2019-07-31T19:16:35Z

mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala

+      .setInputCols(Array("value1"))
+      .setOutputCols(Array("out1"))
+
+    val types = Seq(ShortType, IntegerType, LongType, FloatType, DoubleType,


I think this is fine, but wouldn't necessarily bother with anything but Integer Long, maybe Float. (Not sure how long the test takes to run.)

Thanks. Just keeping IntegerType and LongType now.

srowen · 2019-07-31T19:17:56Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

 * (SPARK-15041) and possibly creates incorrect values for a categorical feature.
 *
+ * Note that the input columns are converted to Double data type internally to compute
+ * the mean/median value and impute the missing values, which are then casted back to


I wouldn't put all of this implementation detail into the docs. I would however note that in the case of integer types and mean imputation, the mean will be cast (truncated) to an integer type. That is, your example is a good one.

Great suggestion. Streamlined the doc

SparkQA · 2019-07-31T19:23:08Z

Test build #108486 has finished for PR 17864 at commit ad04109.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-31T20:00:40Z

Test build #108491 has finished for PR 17864 at commit 13a9659.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2019-07-31T22:03:32Z

@srowen Thanks for the review. Made an update that addressed your comments.

SparkQA · 2019-07-31T23:05:20Z

Test build #108495 has finished for PR 17864 at commit 16a7348.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-08-02T15:54:56Z

Merged to master

seme0021 mentioned this pull request May 4, 2017

Placeholder: MLeap Imputer should handle numeric types combust/mleap#206

Closed

actuaryzhang commented May 25, 2017

View reviewed changes

dongjoon-hyun added the ML label Jun 14, 2019

Wayne Zhang added 2 commits July 30, 2019 09:39

allow imputer to handle numeric types

92172e2

convert output to double type

400e848

force returning same type

e80f6cc

actuaryzhang added 2 commits July 31, 2019 10:32

update test

c0722f1

update doc to explain logic for non-double type

9e9589f

actuaryzhang force-pushed the imputer branch from 86c8a10 to 9e9589f Compare July 31, 2019 17:54

actuaryzhang added 3 commits July 31, 2019 10:57

revert some unnecessary changes on test

ad04109

fix output data type

7bbc69e

fix doc

13a9659

srowen reviewed Jul 31, 2019

View reviewed changes

improve tests and doc

16a7348

srowen approved these changes Aug 1, 2019

View reviewed changes

srowen closed this in 6d7a675 Aug 2, 2019

[SPARK-20604][ML] Allow imputer to handle numeric types #17864

[SPARK-20604][ML] Allow imputer to handle numeric types #17864

Uh oh!

Conversation

actuaryzhang commented May 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

actuaryzhang commented May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 4, 2017

Uh oh!

sethah commented May 4, 2017

Uh oh!

actuaryzhang commented May 5, 2017

Uh oh!

hhbyyh commented May 6, 2017

Uh oh!

actuaryzhang commented May 6, 2017

Uh oh!

SparkQA commented May 6, 2017

Uh oh!

SparkQA commented May 6, 2017

Uh oh!

actuaryzhang commented May 10, 2017

Uh oh!

actuaryzhang commented May 23, 2017

Uh oh!

MLnick commented May 25, 2017

Uh oh!

actuaryzhang commented May 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang commented Jun 23, 2017

Uh oh!

hhbyyh commented Jun 25, 2017

Uh oh!

actuaryzhang commented Jun 26, 2017

Uh oh!

felixcheung commented Aug 8, 2017

Uh oh!

actuaryzhang commented Aug 8, 2017

Uh oh!

SparkQA commented Apr 20, 2019

Uh oh!

holdenk commented May 10, 2019

Uh oh!

srowen commented May 15, 2019

Uh oh!

actuaryzhang commented Jul 29, 2019

Uh oh!

srowen commented Jul 30, 2019

Uh oh!

actuaryzhang commented Jul 30, 2019

Uh oh!

srowen commented Jul 30, 2019

Uh oh!

actuaryzhang commented Jul 31, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang commented May 4, 2017 •

edited

Loading