-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20604][ML] Allow imputer to handle numeric types #17864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@yanboliang @srowen @MLnick @jkbradley @hhbyyh The example below shows failure of Imputer on integer data. |
|
Test build #76468 has finished for PR 17864 at commit
|
|
So the other PR #11601 is really long. For reference, I am picking out the relevant discussions to this PR (also someone tell me if there's a better way to link to pr comments :) @MLnick "what do you think about handling different numeric types in input/output columns? If the input is say IntType, then strategy mode andmedian is ok but mean is somewhat problematic - or are we ok with rounding to and Int? The alternative is the Imputer always appends a Double output column. I propose we either (a) do the cast back to input type, but if the user selected "mean" and the input type is not Float or Double, log a warning; or (b) only support Float and Double type for this initial version of the Imputer." @jkbradley "Just catching up now... I like the idea of maintaining the input type. I'm imagining using an Imputer to fill in continuous features with the mean and categoricals with the mode. Later on, we could even check to see if a column is categorical (in the metadata) and throw an exception for mean. I'd prefer your option (b) to be safe." @sethah "For reference, I checked scikit-learn and the Imputer class returns floats regardless of inputs. I also checked R package "mlr" and it appears to do the same. One concern with a.) would be if the true median was something like 5.0, but approxQuantile returned 4.999999999. Then, we cast back to IntegerType and return 4. I wasn't able to produce this situation when I briefly experimented with it, and also the median is already approximate, so I'm not sure if this is really a problem." |
|
@sethah Thanks for summarizing the previous discussions. |
|
I imagine most Int features will need to be converted to Double for a Vector, thus returns Double regardless the input type makes sense, which also makes the implementation more straight forward. |
|
@hhbyyh Thanks for the suggestion. I have made a new commit that always casts the input to double and outputs the imputed column as double. |
|
Test build #76511 has finished for PR 17864 at commit
|
|
Test build #76513 has finished for PR 17864 at commit
|
|
Ping folks for comments/review. Many thanks. |
|
Originally the idea behind only supporting double was as @sethah posted above - there could be some issues with handling of int casting etc. As mentioned originally, we did consider "always cast to double". The only issue with it is the potential for surprising users who may expect the type of the input column to be maintained in the imputation. Having said that I would be broadly ok with just appending a double output column, provided we update the docstrings / guide to make things very clear. |
|
@MLnick Thanks much for your comments. Yes, I think always returning Double is consistent with Python and R and also other transformers in ML. Plus, as @hhbyyh mentioned, this makes the implementation easier. Would you mind taking a look at the code and let me know if there is any suggestion for improvement? The doc is already updated to make it clear that it always returns Double regardless of the input type. |
| * Note that the mean/median value is computed after filtering out missing values. | ||
| * All Null values in the input columns are treated as missing, and so are also imputed. For | ||
| * computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001. | ||
| * The output column is always of Double type regardless of the input column type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick Here is the note on always returning Double type.
|
Any committer has a chance to take another look at this PR? Thanks. |
|
Shall we pay extra attention to the Int case? E.g. input column contains The current implementation will return surrogate as 1.5. I'm not sure if it's the expectation for some users. It's fine by me but just bring up the issue in case it's missed. |
|
what's next on this one? |
|
Thanks for following up on this, Felix. |
|
Test build #104769 has finished for PR 17864 at commit
|
|
It sounds like folks are generally OK with the approach being taken here if you wanted to update the PR? |
|
Reviewing this old one .. I'd favor not changing the column type. When outputting the mean, it should round back to integer types if needed. |
|
Sorry, have not been active here due to other stuff... |
|
I personally would not favor any change that changes the type of the column. Either let's not do this or change the logic to round the results back to integer types in the case of integer columns. |
|
@srowen If keep the column type intact (return integer for integer input), will you be supporting this PR then? |
|
I think that would be pretty reasonable and I'd review it. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that's tidy.
|
|
||
| test("Imputer for Numeric with default missing Value NaN") { | ||
| val df = spark.createDataFrame(Seq( | ||
| (0, 1.0, 1.0, 1.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need the id column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed now. I was just copying examples from existing tests.
| (0, 1.0, 1.0, 1.0), | ||
| (1, 11.0, 11.0, 11.0), | ||
| (2, 3.6, 3.6, 3.6), | ||
| (3, Double.NaN, 5.2, 3.6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessarily true that the mean of the values after casting is equal to the mean after casting, but here it happens to be. That's OK I think. What about checking the case where a non-NaN value is set as the missing value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another test to check non-NaN missing value.
| .setInputCols(Array("value1")) | ||
| .setOutputCols(Array("out1")) | ||
|
|
||
| val types = Seq(ShortType, IntegerType, LongType, FloatType, DoubleType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, but wouldn't necessarily bother with anything but Integer Long, maybe Float. (Not sure how long the test takes to run.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just keeping IntegerType and LongType now.
| * (SPARK-15041) and possibly creates incorrect values for a categorical feature. | ||
| * | ||
| * Note that the input columns are converted to Double data type internally to compute | ||
| * the mean/median value and impute the missing values, which are then casted back to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't put all of this implementation detail into the docs. I would however note that in the case of integer types and mean imputation, the mean will be cast (truncated) to an integer type. That is, your example is a good one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. Streamlined the doc
|
Test build #108486 has finished for PR 17864 at commit
|
|
Test build #108491 has finished for PR 17864 at commit
|
|
@srowen Thanks for the review. Made an update that addressed your comments. |
|
Test build #108495 has finished for PR 17864 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types.
How was this patch tested?
new test