Skip to content

Conversation

@actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types.

How was this patch tested?

New test.

@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76409 has started for PR 17840 at commit a86fbde.

@actuaryzhang
Copy link
Contributor Author

@yinxusen @srowen @mengxr @jkbradley @VinceShieh @yanboliang

The example below shows failure of Bucketizer on integer data.

val splits = Array(-3.0, 0.0, 3.0)
val data: Array[Int] = Array(-2, -1, 0, 1, 2)
val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0)
val dataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", "expected")
val bucketizer = new Bucketizer()
  .setInputCol("feature")
  .setOutputCol("result")
  .setSplits(splits)
bucketizer.transform(dataFrame)  

java.lang.IllegalArgumentException: requirement failed: Column feature must be of type DoubleType but was actually IntegerType.

@actuaryzhang
Copy link
Contributor Author

Weird failure message, the log shows all tests passed...

@actuaryzhang actuaryzhang changed the title [SPARK-20574][ML] Allow Bucketizer to handle non-Double column [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column May 3, 2017
@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76420 has finished for PR 17840 at commit b4a5b6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one minor issue. Thanks.

val df = dataFrame.withColumn("feature", col("feature").cast(mType))
bucketizer.transform(df).select("result", "expected").collect().foreach {
case Row(x: Double, y: Double) =>
assert(x === y, "The feature value is not correct after bucketing in type " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature value -> result

.setOutputCol("result")
.setSplits(splits)

val types = Seq(ShortType, IntegerType, LongType, FloatType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other tests for supporting numeric types have included DecimalType - often DecimalType(10, 0), as well as ByteType. See the various Estimator tests which use MLTestingUtils.genClassifDFWithNumericLabelCol and MLTestingUtils.genRegressionDFWithNumericLabelCol

@actuaryzhang
Copy link
Contributor Author

Thanks @yanboliang and @MLnick.
I have included the additional numeric types and updated the error msg.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76459 has finished for PR 17840 at commit d149350.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

Merged into master and branch-2.0. Thanks.

asfgit pushed a commit that referenced this pull request May 5, 2017
## What changes were proposed in this pull request?
Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types.

## How was this patch tested?
New test.

Author: Wayne Zhang <[email protected]>

Closes #17840 from actuaryzhang/bucketizer.

(cherry picked from commit 0d16faa)
Signed-off-by: Yanbo Liang <[email protected]>
@asfgit asfgit closed this in 0d16faa May 5, 2017
@actuaryzhang actuaryzhang deleted the bucketizer branch May 6, 2017 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants