[SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns #17819

viirya · 2017-05-01T14:17:21Z

What changes were proposed in this pull request?

Current ML's Bucketizer can only bin a column of continuous features. If a dataset has thousands of of continuous columns needed to bin, we will result in thousands of ML stages. It is inefficient regarding query planning and execution.

We should have a type of bucketizer that can bin a lot of columns all at once. It would need to accept an list of arrays of split points to correspond to the columns to bin, but it might make things more efficient by replacing thousands of stages with just one.

This current approach in this patch is to add a new MultipleBucketizerInterface for this purpose. Bucketizer now extends this new interface.

Performance

Benchmarking using the test dataset provided in JIRA SPARK-20392 (blockbuster.csv).

The ML pipeline includes 2 StringIndexers and 1 MultipleBucketizer or 137 Bucketizers to bin 137 input columns with the same splits. Then count the time to transform the dataset.

MultipleBucketizer: 3352 ms
Bucketizer: 51512 ms

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2017-05-01T16:33:47Z

Test build #76349 has finished for PR 17819 at commit e8f5d89.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class MultipleBucketizer @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)
class DoubleArrayArrayParam(

viirya · 2017-05-01T23:42:08Z

cc @MLnick @jkbradley for review. Thanks.

viirya · 2017-05-02T01:25:33Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   *
+   * @group untypedrel
+   * @since 2.3.0
+   */


I am wondering shall I make an individual PR for this SQL change. cc @cloud-fan

how about we make it private[spark]? I'm not sure if this API is good enough.

Sounds good to me.

MLnick · 2017-05-02T07:45:25Z

@viirya can you post some performance comparisons for this?

viirya · 2017-05-02T09:28:48Z

@MLnick Ok. Let me prepare the comparisons.

SparkQA · 2017-05-02T11:42:01Z

Test build #76379 has finished for PR 17819 at commit 38dce8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-03T05:31:44Z

Test build #76406 has finished for PR 17819 at commit 6ff9c79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-04T08:28:17Z

@MLnick I've done a benchmark using the test dataset provided in JIRA SPARK-20392 (blockbuster.csv).

The ML pipeline includes 2 StringIndexers and 1 MultipleBucketizer or 137 Bucketizers to bin 137 input columns with the same splits.

MultipleBucketizer: 3352 ms
Bucketizer: 51512 ms

MLnick · 2017-05-04T14:52:38Z

Thanks. Result does look good.

So the improvement is really coming from the new withColumns that avoids a bunch of projections in the plan in favor of one (more or less)? So the same approach can benefit any transformer that could operate on multiple cols (at least at transform time)?

viirya · 2017-05-05T01:05:03Z

The bunch of projections will be collapsed in optimization. So it doesn't affect query execution. However, every withColumn call creates new DataFrame along with a projection on previous logical plan. It is costly by creating new query execution, analyzing logical plan, creating encoder, etc. So the improvement is coming from saving the cost by doing this one time with withColumns, instead of multiple withColumn.

It can benefit other transformers that could work on multiple cols. I even have an idea to revamp the interface of Transformer. Because the transformation in Transformer is actually ending with a withColumn call to add/replace column. They are actually transforming columns in the dataset. We don't need to re-create a DataFrame with each transformation.

But the performance difference is obvious only when the number of transformation stages is large enough like the example of many Bucketizers. So it may not worth doing that. Just a thought.

viirya · 2017-05-05T01:24:36Z

Note: since in Transformer, there might be other manipulation to the dataset like dropping NaN values. The idea above won't work under that.

barrybecker4 · 2017-05-09T16:25:55Z

I don't see support for "withColumns" in spark 2.1.1 or on the spark tip. Which version or branch does it first appear? This work seems related to https://issues.apache.org/jira/browse/SPARK-12225.

viirya · 2017-05-11T05:26:59Z

@barrybecker4 withColumns API is first introduced in this PR. So you won't see it in Spark 2.1.1 or current codebase. Thanks for letting me know SPARK-12225. Yes, it is related.

viirya · 2017-05-18T06:28:02Z

ping @MLnick Do you have more comments on this? Thanks.

MLnick · 2017-05-18T08:34:49Z

I will try to take a look soon. My main concern is whether we should really have a new class - it starts to make things really messy if we introduce Multi versions of everything (e.g. we may want to add multi col support to StringIndexer, OneHotEncoder among others).

viirya · 2017-05-18T08:43:55Z

@MLnick That's right. I also have concern about this. However, to keep the original single-column Bucketizer and multiple-column Bucketizer in one class seems also producing a messy code.

I'd rethink it and see if there is a good way to incorporate both.

viirya · 2017-06-12T08:36:09Z

@MLnick I've updated the previous solution. The new API is implemented in an interface which Bucketizer extends now. So you can still use Bucketizer class. Depends on what parameters you set, it goes for single column or multiple bucketizing.

Please take a look if you have time. Thanks.

SparkQA · 2017-06-12T08:50:36Z

Test build #77924 has finished for PR 17819 at commit 4301314.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-12T13:47:53Z

Test build #77935 has finished for PR 17819 at commit 08cbfac.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-12T16:26:23Z

Test build #77937 has finished for PR 17819 at commit 8386d1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-11T08:11:25Z

ping @MLnick Can you have time to help review this recently? Thanks.

MLnick · 2017-10-10T12:11:34Z

Yes, fair enough

…

On Tue, 10 Oct 2017 at 14:09 Liang-Chi Hsieh ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala <#17819 (comment)>: > @@ -684,6 +684,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { } } I think as withColumn case, we can re-implement it with withColumns for metadata too. So this test case can cover it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SB1wjHVAXkr6b_xHp5URSmPjvFCd_ks5sq16HgaJpZM4NNEr5> .

…n cover both.

MLnick

The ML part looks pretty good. I left a few fairly minor comments.

The SQL part looks ok also - though will wait for others e.g. @gatorsmile to take a look too.

MLnick · 2017-10-10T12:14:13Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

 /**
- * `Bucketizer` maps a column of continuous features to a column of feature buckets.
+ * `Bucketizer` maps a column of continuous features to a column of feature buckets. Since 2.3.0,
+ * `Bucketizer` can also map multiple columns at once. Whether it goes to map a column or multiple


Perhaps:

Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, a log warning will be printed and only inputCol will take effect, while inputCols will be ignored.

Ok. Looks better.

MLnick · 2017-10-10T12:23:53Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+   * @group param
+   */
+  @Since("2.3.0")
+  final val outputCols: StringArrayParam = new StringArrayParam(this, "outputCols",


why are we making this final (and not others)? (also the getOutputCols?)

I guess similarly to shared params? I think it makes sense to add a shared param since this, Imputer and others will use it

Ah, I think the final is copied from previous multiple bucketizer trait. I'll remove it.

I will create HasOutputCols.

MLnick · 2017-10-10T12:33:59Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+    val newCols = inputColumns.zipWithIndex.map { case (inputCol, idx) =>
+      bucketizers(idx)(filteredDataset(inputCol).cast(DoubleType))
+    }
+    val newFields = outputColumns.zipWithIndex.map { case (outputCol, idx) =>


Have we not done this already in transformSchema? Can we just re-use the result of that?

MLnick · 2017-10-10T12:36:29Z

examples/src/main/java/org/apache/spark/examples/ml/JavaBucketizerExample.java

 import org.apache.spark.sql.types.StructField;
 import org.apache.spark.sql.types.StructType;
 // $example off$



No Scala example?

Added a Scala example.

MLnick · 2017-10-10T13:15:16Z

mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala

+    val data = (0 until validData1.length).map { idx =>
+      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
+    }
+    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")


toSeq not required here?

MLnick · 2017-10-10T13:33:42Z

mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala

+    val data = (0 until validData1.length).map { idx =>
+      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
+    }
+    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")


Same here, toSeq unnecessary.

MLnick · 2017-10-10T13:35:02Z

mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala

+    }
+  }
+
+  test("multiple columns:: read/write") {


Nit: two : here

MLnick · 2017-10-10T13:40:16Z

mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala

+    testDefaultReadWrite(t)
+  }
+
+  test("Bucketizer in a pipeline") {


It may be overkill - but we would expect a Bucketizer with multi cols set to have precisely the same operation as multiple Bucketizer. Perhaps a test comparing them?

Sure. Added a test for it.

MLnick · 2017-10-10T13:40:58Z

mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala

+    pl.transform(df).select("result1", "expected1", "result2", "expected2")
+      .collect().foreach {
+        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
+          assert(r1 === e1,


This logic is duplicated across a few test cases - perhaps we could factor it out into a utility method.

MLnick · 2017-10-10T13:42:14Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+ * `Bucketizer` maps a column of continuous features to a column of feature buckets. Since 2.3.0,
+ * `Bucketizer` can also map multiple columns at once. Whether it goes to map a column or multiple
+ * columns, it depends on which parameter of `inputCol` and `inputCols` is set. When both are set,
+ * a log warning will be printed and by default it chooses `inputCol`.


We should probably also mention that splits is only used for single column and splitsArray for multi column

SparkQA · 2017-10-10T15:43:27Z

Test build #82583 has finished for PR 17819 at commit 1889995.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-10-11T08:51:36Z

@MLnick Thanks for leaving the comments. I think I've addressed all of them. Please take a look if you are free. Thanks.

SparkQA · 2017-10-11T11:27:41Z

Test build #82632 has finished for PR 17819 at commit bb19708.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-10-19T02:31:41Z

@MLnick Any more comments or thoughts on this I need to address?

AFractalThought · 2017-10-27T19:22:38Z

Does this extension exist for QuantileDiscretizer as well?

huaxingao · 2017-10-27T23:42:33Z

@AFractalThought @viirya
I have made changes for QuantileDiscretizer based on this PR. Once this PR is merged, I will open a jira to submit the PR for QuantileDiscretizer.

viirya · 2017-10-29T14:03:26Z

@MLnick Is this ready to go?

MLnick · 2017-10-30T07:48:03Z

I've created https://issues.apache.org/jira/browse/SPARK-22397 to track the changes in QuantileDiscretizer. The PR can be submitted once we finalize this one.

viirya · 2017-10-30T08:06:33Z

Thanks @MLnick

AFractalThought · 2017-10-30T15:06:24Z

Thanks @huaxingao @MLnick @viirya this will be super helpful

MLnick · 2017-11-06T19:50:15Z

@viirya could you resolve conflicts?

viirya · 2017-11-06T23:43:34Z

@MLnick Conflicts resolved. Thanks.

SparkQA · 2017-11-07T03:07:05Z

Test build #83519 has finished for PR 17819 at commit a970723.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-11-08T18:57:06Z

@MarcKaminski by the way you mentioned a vector bucketizer. I think in principal that might be useful. I'm not sure if it would make sense to add vector type support to the existing Bucketizer or if it would need to be separate.

Perhaps you can post a link to code / design doc on this JIRA for the vector type version?

MLnick

LGTM. Will leave open for the rest of the day in case any other reviewer wants a final look.

viirya · 2017-11-09T02:49:54Z

Thanks @MLnick

viirya · 2017-11-09T02:57:13Z

About vector bucketizer, seems it might work similarly as multi-col bucketizer. But some behaviors such as Bucketizer.SKIP_INVALID need to address.

MLnick · 2017-11-09T14:35:50Z

Merged to master. Thanks @viirya and all the reviewers!

tengpeng · 2017-11-24T02:59:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

-      Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
-    }.withName("bucketizer")
+    val seqOfSplits = if (isBucketizeMultipleColumns()) {
+      $(splitsArray).toSeq


I am interested in the difference between .toSeq and Seq().

leliang65 · 2018-04-19T07:05:33Z

Is there any python example for this api?

viirya · 2018-04-19T10:44:22Z

@leliang65 The PySpark support is not added yet. Please refer to #19892.

Add a Bucketizer that can bin multiple columns.

e8f5d89

viirya changed the title ~~[SPARK-20542][ML][SQL][WIP] Add a Bucketizer that can bin multiple columns~~ [SPARK-20542][ML][SQL] Add a Bucketizer that can bin multiple columns May 1, 2017

viirya commented May 2, 2017

View reviewed changes

Make withColumns as private[spark].

38dce8b

Add test for MultipleBucketizer.

6ff9c79

viirya changed the title ~~[SPARK-20542][ML][SQL] Add a Bucketizer that can bin multiple columns~~ [SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns Jun 12, 2017

viirya force-pushed the SPARK-20542 branch from 4301314 to 08cbfac Compare June 12, 2017 13:27

Integrated multiple column bucketizing into Bucketizer.

8386d1e

viirya force-pushed the SPARK-20542 branch from 08cbfac to 8386d1e Compare June 12, 2017 13:50

Merge remote-tracking branch 'upstream/master' into SPARK-20542

f8dedd1

Re-implement withColumn with metadata by withColumns, so same test ca…

1889995

…n cover both.

MLnick suggested changes Oct 10, 2017

View reviewed changes

Address comments.

bb19708

Merge remote-tracking branch 'upstream/master' into SPARK-20542

a970723

MLnick approved these changes Nov 8, 2017

View reviewed changes

asfgit closed this in 77f7453 Nov 9, 2017

tengpeng reviewed Nov 24, 2017

View reviewed changes

viirya deleted the SPARK-20542 branch December 27, 2023 18:34

[SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns #17819

[SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns #17819

Uh oh!

Conversation

viirya commented May 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Performance

How was this patch tested?

Uh oh!

SparkQA commented May 1, 2017

Uh oh!

viirya commented May 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented May 2, 2017

Uh oh!

viirya commented May 2, 2017

Uh oh!

SparkQA commented May 2, 2017

Uh oh!

SparkQA commented May 3, 2017

Uh oh!

viirya commented May 4, 2017

Uh oh!

MLnick commented May 4, 2017

Uh oh!

viirya commented May 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 5, 2017

Uh oh!

barrybecker4 commented May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 11, 2017

Uh oh!

viirya commented May 18, 2017

Uh oh!

MLnick commented May 18, 2017

Uh oh!

viirya commented May 18, 2017

Uh oh!

viirya commented Jun 12, 2017

Uh oh!

SparkQA commented Jun 12, 2017

Uh oh!

SparkQA commented Jun 12, 2017

Uh oh!

SparkQA commented Jun 12, 2017

Uh oh!

viirya commented Sep 11, 2017

Uh oh!

MLnick commented Oct 10, 2017 via email

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

viirya commented May 1, 2017 •

edited

Loading

cloud-fan May 2, 2017 •

edited

Loading

viirya commented May 5, 2017 •

edited

Loading

barrybecker4 commented May 9, 2017 •

edited

Loading

viirya Oct 11, 2017 •

edited

Loading