-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns #17819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #76349 has finished for PR 17819 at commit
|
|
cc @MLnick @jkbradley for review. Thanks. |
| * | ||
| * @group untypedrel | ||
| * @since 2.3.0 | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering shall I make an individual PR for this SQL change. cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we make it private[spark]? I'm not sure if this API is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
|
@viirya can you post some performance comparisons for this? |
|
@MLnick Ok. Let me prepare the comparisons. |
|
Test build #76379 has finished for PR 17819 at commit
|
|
Test build #76406 has finished for PR 17819 at commit
|
|
@MLnick I've done a benchmark using the test dataset provided in JIRA SPARK-20392 (blockbuster.csv). The ML pipeline includes 2 MultipleBucketizer: 3352 ms |
|
Thanks. Result does look good. So the improvement is really coming from the new |
|
The bunch of projections will be collapsed in optimization. So it doesn't affect query execution. However, every It can benefit other transformers that could work on multiple cols. I even have an idea to revamp the interface of But the performance difference is obvious only when the number of transformation stages is large enough like the example of many |
|
Note: since in |
|
I don't see support for "withColumns" in spark 2.1.1 or on the spark tip. Which version or branch does it first appear? This work seems related to https://issues.apache.org/jira/browse/SPARK-12225. |
|
@barrybecker4 |
|
ping @MLnick Do you have more comments on this? Thanks. |
|
I will try to take a look soon. My main concern is whether we should really have a new class - it starts to make things really messy if we introduce |
|
@MLnick That's right. I also have concern about this. However, to keep the original single-column Bucketizer and multiple-column Bucketizer in one class seems also producing a messy code. I'd rethink it and see if there is a good way to incorporate both. |
|
@MLnick I've updated the previous solution. The new API is implemented in an interface which Please take a look if you have time. Thanks. |
|
Test build #77924 has finished for PR 17819 at commit
|
|
Test build #77935 has finished for PR 17819 at commit
|
|
Test build #77937 has finished for PR 17819 at commit
|
|
ping @MLnick Can you have time to help review this recently? Thanks. |
|
Yes, fair enough
…On Tue, 10 Oct 2017 at 14:09 Liang-Chi Hsieh ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
<#17819 (comment)>:
> @@ -684,6 +684,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
}
}
I think as withColumn case, we can re-implement it with withColumns for
metadata too. So this test case can cover it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17819 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB1wjHVAXkr6b_xHp5URSmPjvFCd_ks5sq16HgaJpZM4NNEr5>
.
|
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ML part looks pretty good. I left a few fairly minor comments.
The SQL part looks ok also - though will wait for others e.g. @gatorsmile to take a look too.
| /** | ||
| * `Bucketizer` maps a column of continuous features to a column of feature buckets. | ||
| * `Bucketizer` maps a column of continuous features to a column of feature buckets. Since 2.3.0, | ||
| * `Bucketizer` can also map multiple columns at once. Whether it goes to map a column or multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps:
Since 2.3.0,
Bucketizercan map multiple columns at once by setting theinputColsparameter. Note that when both theinputColandinputColsparameters are set, a log warning will be printed and onlyinputColwill take effect, whileinputColswill be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Looks better.
| * @group param | ||
| */ | ||
| @Since("2.3.0") | ||
| final val outputCols: StringArrayParam = new StringArrayParam(this, "outputCols", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we making this final (and not others)? (also the getOutputCols?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess similarly to shared params? I think it makes sense to add a shared param since this, Imputer and others will use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think the final is copied from previous multiple bucketizer trait. I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create HasOutputCols.
| val newCols = inputColumns.zipWithIndex.map { case (inputCol, idx) => | ||
| bucketizers(idx)(filteredDataset(inputCol).cast(DoubleType)) | ||
| } | ||
| val newFields = outputColumns.zipWithIndex.map { case (outputCol, idx) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we not done this already in transformSchema? Can we just re-use the result of that?
| import org.apache.spark.sql.types.StructField; | ||
| import org.apache.spark.sql.types.StructType; | ||
| // $example off$ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Scala example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a Scala example.
| val data = (0 until validData1.length).map { idx => | ||
| (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx)) | ||
| } | ||
| val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toSeq not required here?
| val data = (0 until validData1.length).map { idx => | ||
| (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx)) | ||
| } | ||
| val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, toSeq unnecessary.
| } | ||
| } | ||
|
|
||
| test("multiple columns:: read/write") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: two : here
| testDefaultReadWrite(t) | ||
| } | ||
|
|
||
| test("Bucketizer in a pipeline") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be overkill - but we would expect a Bucketizer with multi cols set to have precisely the same operation as multiple Bucketizer. Perhaps a test comparing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Added a test for it.
| pl.transform(df).select("result1", "expected1", "result2", "expected2") | ||
| .collect().foreach { | ||
| case Row(r1: Double, e1: Double, r2: Double, e2: Double) => | ||
| assert(r1 === e1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is duplicated across a few test cases - perhaps we could factor it out into a utility method.
| * `Bucketizer` maps a column of continuous features to a column of feature buckets. Since 2.3.0, | ||
| * `Bucketizer` can also map multiple columns at once. Whether it goes to map a column or multiple | ||
| * columns, it depends on which parameter of `inputCol` and `inputCols` is set. When both are set, | ||
| * a log warning will be printed and by default it chooses `inputCol`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also mention that splits is only used for single column and splitsArray for multi column
|
Test build #82583 has finished for PR 17819 at commit
|
|
@MLnick Thanks for leaving the comments. I think I've addressed all of them. Please take a look if you are free. Thanks. |
|
Test build #82632 has finished for PR 17819 at commit
|
|
@MLnick Any more comments or thoughts on this I need to address? |
|
Does this extension exist for QuantileDiscretizer as well? |
|
@AFractalThought @viirya |
|
@MLnick Is this ready to go? |
|
I've created https://issues.apache.org/jira/browse/SPARK-22397 to track the changes in |
|
Thanks @MLnick |
|
Thanks @huaxingao @MLnick @viirya this will be super helpful |
|
@viirya could you resolve conflicts? |
|
@MLnick Conflicts resolved. Thanks. |
|
Test build #83519 has finished for PR 17819 at commit
|
|
@MarcKaminski by the way you mentioned a vector bucketizer. I think in principal that might be useful. I'm not sure if it would make sense to add vector type support to the existing Perhaps you can post a link to code / design doc on this JIRA for the vector type version? |
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will leave open for the rest of the day in case any other reviewer wants a final look.
|
Thanks @MLnick |
|
About vector bucketizer, seems it might work similarly as multi-col bucketizer. But some behaviors such as |
|
Merged to master. Thanks @viirya and all the reviewers! |
| Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid) | ||
| }.withName("bucketizer") | ||
| val seqOfSplits = if (isBucketizeMultipleColumns()) { | ||
| $(splitsArray).toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am interested in the difference between .toSeq and Seq().
|
Is there any python example for this api? |
|
@leliang65 The PySpark support is not added yet. Please refer to #19892. |
What changes were proposed in this pull request?
Current ML's Bucketizer can only bin a column of continuous features. If a dataset has thousands of of continuous columns needed to bin, we will result in thousands of ML stages. It is inefficient regarding query planning and execution.
We should have a type of bucketizer that can bin a lot of columns all at once. It would need to accept an list of arrays of split points to correspond to the columns to bin, but it might make things more efficient by replacing thousands of stages with just one.
This current approach in this patch is to add a new
MultipleBucketizerInterfacefor this purpose.Bucketizernow extends this new interface.Performance
Benchmarking using the test dataset provided in JIRA SPARK-20392 (blockbuster.csv).
The ML pipeline includes 2
StringIndexers and 1MultipleBucketizeror 137Bucketizers to bin 137 input columns with the same splits. Then count the time to transform the dataset.MultipleBucketizer: 3352 ms
Bucketizer: 51512 ms
How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.