Skip to content

Conversation

@actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types.

How was this patch tested?

new test

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 4, 2017

@yanboliang @srowen @MLnick @jkbradley @hhbyyh

The example below shows failure of Imputer on integer data.

    val df = spark.createDataFrame( Seq(
      (0, 1.0, 1.0, 1.0),
      (1, 11.0, 11.0, 11.0),
      (2, 1.5, 1.5, 1.5),
      (3, Double.NaN, 4.5, 1.5)
    )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
    val imputer = new Imputer()
      .setInputCols(Array("value1"))
      .setOutputCols(Array("out1"))
    imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))

java.lang.IllegalArgumentException: requirement failed: Column value1 must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type IntegerType.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76468 has finished for PR 17864 at commit e9ab39c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented May 4, 2017

So the other PR #11601 is really long. For reference, I am picking out the relevant discussions to this PR (also someone tell me if there's a better way to link to pr comments :)

@MLnick "what do you think about handling different numeric types in input/output columns? If the input is say IntType, then strategy mode andmedian is ok but mean is somewhat problematic - or are we ok with rounding to and Int? The alternative is the Imputer always appends a Double output column.

I propose we either (a) do the cast back to input type, but if the user selected "mean" and the input type is not Float or Double, log a warning; or (b) only support Float and Double type for this initial version of the Imputer."

@jkbradley "Just catching up now... I like the idea of maintaining the input type. I'm imagining using an Imputer to fill in continuous features with the mean and categoricals with the mode. Later on, we could even check to see if a column is categorical (in the metadata) and throw an exception for mean.

I'd prefer your option (b) to be safe."

@sethah "For reference, I checked scikit-learn and the Imputer class returns floats regardless of inputs. I also checked R package "mlr" and it appears to do the same. One concern with a.) would be if the true median was something like 5.0, but approxQuantile returned 4.999999999. Then, we cast back to IntegerType and return 4. I wasn't able to produce this situation when I briefly experimented with it, and also the median is already approximate, so I'm not sure if this is really a problem."

@actuaryzhang
Copy link
Contributor Author

@sethah Thanks for summarizing the previous discussions.
What are you suggesting for this PR? I think it makes sense to log a warning when imputing integer types with mean. In addition, perhaps we can set "median" as the default strategy.

@hhbyyh
Copy link
Contributor

hhbyyh commented May 6, 2017

I imagine most Int features will need to be converted to Double for a Vector, thus returns Double regardless the input type makes sense, which also makes the implementation more straight forward.

@actuaryzhang
Copy link
Contributor Author

@hhbyyh Thanks for the suggestion. I have made a new commit that always casts the input to double and outputs the imputed column as double.

@SparkQA
Copy link

SparkQA commented May 6, 2017

Test build #76511 has finished for PR 17864 at commit 6479965.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2017

Test build #76513 has finished for PR 17864 at commit 86c8a10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@hhbyyh @sethah @MLnick
Could you take a look at the new commit? Thanks.

@actuaryzhang
Copy link
Contributor Author

Ping folks for comments/review. Many thanks.
@viirya @MLnick @jkbradley @hhbyyh @yanboliang @BenFradet

@MLnick
Copy link
Contributor

MLnick commented May 25, 2017

Originally the idea behind only supporting double was as @sethah posted above - there could be some issues with handling of int casting etc. As mentioned originally, we did consider "always cast to double". The only issue with it is the potential for surprising users who may expect the type of the input column to be maintained in the imputation.

Having said that I would be broadly ok with just appending a double output column, provided we update the docstrings / guide to make things very clear.

@actuaryzhang
Copy link
Contributor Author

@MLnick Thanks much for your comments. Yes, I think always returning Double is consistent with Python and R and also other transformers in ML. Plus, as @hhbyyh mentioned, this makes the implementation easier. Would you mind taking a look at the code and let me know if there is any suggestion for improvement? The doc is already updated to make it clear that it always returns Double regardless of the input type.

* Note that the mean/median value is computed after filtering out missing values.
* All Null values in the input columns are treated as missing, and so are also imputed. For
* computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.
* The output column is always of Double type regardless of the input column type.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick Here is the note on always returning Double type.

@actuaryzhang
Copy link
Contributor Author

Any committer has a chance to take another look at this PR? Thanks.

@hhbyyh
Copy link
Contributor

hhbyyh commented Jun 25, 2017

Shall we pay extra attention to the Int case? E.g. input column contains
Double.Nan,
1,
2.

The current implementation will return surrogate as 1.5. I'm not sure if it's the expectation for some users.

It's fine by me but just bring up the issue in case it's missed.

@actuaryzhang
Copy link
Contributor Author

We can log a warning or issue an error if the input column is int and the imputation is by mean.
Would like to know if that's OK with you? @hhbyyh @MLnick

@felixcheung
Copy link
Member

what's next on this one?

@actuaryzhang
Copy link
Contributor Author

Thanks for following up on this, Felix.
Still waiting for an agreement on this...
Would like to have more direction on this.

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104769 has finished for PR 17864 at commit 86c8a10.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented May 10, 2019

It sounds like folks are generally OK with the approach being taken here if you wanted to update the PR?

@srowen
Copy link
Member

srowen commented May 15, 2019

Reviewing this old one .. I'd favor not changing the column type. When outputting the mean, it should round back to integer types if needed.

@actuaryzhang
Copy link
Contributor Author

Sorry, have not been active here due to other stuff...
Do folks still feel we should proceed with this PR?
If so, please suggest any remaining changes and I'll update it.
@srowen @holdenk @felixcheung @hhbyyh

@srowen
Copy link
Member

srowen commented Jul 30, 2019

I personally would not favor any change that changes the type of the column. Either let's not do this or change the logic to round the results back to integer types in the case of integer columns.

@actuaryzhang
Copy link
Contributor Author

@srowen If keep the column type intact (return integer for integer input), will you be supporting this PR then?

@srowen
Copy link
Member

srowen commented Jul 30, 2019

I think that would be pretty reasonable and I'd review it.

@actuaryzhang
Copy link
Contributor Author

@srowen @holdenk
Made an update to preserve input data type.
Logic is to convert input to Double internally, compute mean/median, do imputation, and then cast back to original data type. Updated the doc and test accordingly.
Please take a look. Thanks.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that's tidy.


test("Imputer for Numeric with default missing Value NaN") {
val df = spark.createDataFrame(Seq(
(0, 1.0, 1.0, 1.0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need the id column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed now. I was just copying examples from existing tests.

(0, 1.0, 1.0, 1.0),
(1, 11.0, 11.0, 11.0),
(2, 3.6, 3.6, 3.6),
(3, Double.NaN, 5.2, 3.6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily true that the mean of the values after casting is equal to the mean after casting, but here it happens to be. That's OK I think. What about checking the case where a non-NaN value is set as the missing value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another test to check non-NaN missing value.

.setInputCols(Array("value1"))
.setOutputCols(Array("out1"))

val types = Seq(ShortType, IntegerType, LongType, FloatType, DoubleType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but wouldn't necessarily bother with anything but Integer Long, maybe Float. (Not sure how long the test takes to run.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just keeping IntegerType and LongType now.

* (SPARK-15041) and possibly creates incorrect values for a categorical feature.
*
* Note that the input columns are converted to Double data type internally to compute
* the mean/median value and impute the missing values, which are then casted back to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put all of this implementation detail into the docs. I would however note that in the case of integer types and mean imputation, the mean will be cast (truncated) to an integer type. That is, your example is a good one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. Streamlined the doc

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108486 has finished for PR 17864 at commit ad04109.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108491 has finished for PR 17864 at commit 13a9659.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@srowen Thanks for the review. Made an update that addressed your comments.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108495 has finished for PR 17864 at commit 16a7348.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 2, 2019

Merged to master

@srowen srowen closed this in 6d7a675 Aug 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants