[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator #24880

Andrew-Crosby · 2019-06-15T19:55:40Z

What changes were proposed in this pull request?

Modifies the HuberAggregator class so that a copy of the coefficients vector isn't created every time that an instance is added. Follows the approach of LeastSquaresAggregator and uses transient lazy class variable to store the reused quantities. (See #14109 for explanation of the use of transient lazy variables)

On the test case in the linked JIRA, this change gives an order of magnitude performance improvement reducing the time taken to fit the model from 540 to 47 seconds.

How was this patch tested?

Existing unit tests.
See https://issues.apache.org/jira/browse/SPARK-28062 for results from running a benchmark script.

…s added Follows approach used in LeastSquaresAggregator

Andrew-Crosby · 2019-06-17T19:34:39Z

@yanboliang @sethah does this look reasonable to you?

mgaido91 · 2019-06-18T07:06:05Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala

    0.0
  }
+  // make transient so we do not serialize between aggregation stages
+  @transient private lazy val featuresStd = bcFeaturesStd.value


I don't think this change makes sense. It is just getting the broadcast, not a big overhead...

Yes, this one isn't necessary. coefficients looks good. But it doesn't need to be lazy.

Thanks for the feedback. I've removed the unnecessary change to featuresStd.

@srowen I tried removing the lazy modifier, but that causes both the unit tests and my test case to fail with the following NPE. I don't understand why.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 11, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.ml.optim.aggregator.HuberAggregator.$anonfun$add$3(HuberAggregator.scala:109) at org.apache.spark.ml.linalg.SparseVector.foreachActive(Vectors.scala:613) at org.apache.spark.ml.optim.aggregator.HuberAggregator.add(HuberAggregator.scala:107)

Oh I get it. You wouldn't want to eagerly evaluate the broadcast as it might eval on the driver. OK I think this is a reasonable solution.

mgaido91 · 2019-06-18T21:04:54Z

LGTM, @srowen may you please start the CI? Thanks.

SparkQA · 2019-06-18T22:20:59Z

Test build #4800 has finished for PR 24880 at commit cdb49c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-06-19T13:57:17Z

Merged to master

Avoid unnecessary copy of coefficients vector each time an instance i…

b835c53

…s added Follows approach used in LeastSquaresAggregator

dongjoon-hyun added the ML label Jun 15, 2019

mgaido91 reviewed Jun 18, 2019

View reviewed changes

Revert unnecessary change

cdb49c0

srowen approved these changes Jun 18, 2019

View reviewed changes

srowen closed this in 36b327d Jun 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator #24880

[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator #24880

Uh oh!

Andrew-Crosby commented Jun 15, 2019 •

edited

Loading

Uh oh!

Andrew-Crosby commented Jun 17, 2019

Uh oh!

mgaido91 Jun 18, 2019

Uh oh!

srowen Jun 18, 2019

Uh oh!

Andrew-Crosby Jun 18, 2019 •

edited

Loading

Uh oh!

srowen Jun 18, 2019

Uh oh!

mgaido91 commented Jun 18, 2019

Uh oh!

SparkQA commented Jun 18, 2019

Uh oh!

srowen commented Jun 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator #24880

[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator #24880

Uh oh!

Conversation

Andrew-Crosby commented Jun 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Andrew-Crosby commented Jun 17, 2019

Uh oh!

mgaido91 Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

Andrew-Crosby Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Jun 18, 2019

Uh oh!

SparkQA commented Jun 18, 2019

Uh oh!

srowen commented Jun 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Andrew-Crosby commented Jun 15, 2019 •

edited

Loading

Andrew-Crosby Jun 18, 2019 •

edited

Loading