-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column #17840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #76409 has started for PR 17840 at commit |
|
@yinxusen @srowen @mengxr @jkbradley @VinceShieh @yanboliang The example below shows failure of Bucketizer on integer data. |
|
Weird failure message, the log shows all tests passed... |
|
Test build #76420 has finished for PR 17840 at commit
|
yanboliang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one minor issue. Thanks.
| val df = dataFrame.withColumn("feature", col("feature").cast(mType)) | ||
| bucketizer.transform(df).select("result", "expected").collect().foreach { | ||
| case Row(x: Double, y: Double) => | ||
| assert(x === y, "The feature value is not correct after bucketing in type " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature value -> result
| .setOutputCol("result") | ||
| .setSplits(splits) | ||
|
|
||
| val types = Seq(ShortType, IntegerType, LongType, FloatType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other tests for supporting numeric types have included DecimalType - often DecimalType(10, 0), as well as ByteType. See the various Estimator tests which use MLTestingUtils.genClassifDFWithNumericLabelCol and MLTestingUtils.genRegressionDFWithNumericLabelCol
|
Thanks @yanboliang and @MLnick. |
|
Test build #76459 has finished for PR 17840 at commit
|
|
Merged into master and branch-2.0. Thanks. |
## What changes were proposed in this pull request? Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types. ## How was this patch tested? New test. Author: Wayne Zhang <[email protected]> Closes #17840 from actuaryzhang/bucketizer. (cherry picked from commit 0d16faa) Signed-off-by: Yanbo Liang <[email protected]>
What changes were proposed in this pull request?
Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types.
How was this patch tested?
New test.