[SPARK-16933] [ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data. #14519

yanboliang · 2016-08-06T13:43:59Z

What changes were proposed in this pull request?

Similar to LeastSquaresAggregator in #14109, AFTAggregator used for AFTSurvivalRegression ends up serializing the parameters and featuresStd, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109.

How was this patch tested?

I tested this locally and verified the serialization reduction.

Before patch

After patch

SparkQA · 2016-08-06T14:33:37Z

Test build #63314 has finished for PR 14519 at commit d152a3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-08-06T14:35:49Z

cc @sethah @dbtsai

srowen · 2016-08-06T15:57:12Z

Let's put this into #14109

dbtsai · 2016-08-08T07:29:27Z

mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala

  // sigma is the scale parameter of the AFT model
-  private val sigma = math.exp(parameters(0))
+  @transient private lazy val sigma = math.exp(parameters(0))



In line 506,

private val gradientSumArray = Array.ofDim[Double](parameters.length)

the code will evaluate the lazy parameters in the driver.

BTW, after thinking a bit, some of the lazy is not needed. lazy is for avoiding doing computation in the driver; however
@transient private val parameters = bcParameters.value should work without lazy. Also, sigma or intercept may not need lazy. Thanks.

@dbtsai I addressed the parameters.length issue. But I can not remove lazy from @transient private lazy val parameters = bcParameters.value and intercept/sigma. Otherwise, it complains NullPointerException. If I removed both @transient and lazy, it works well, but this does not coincide with our requirements. It's a little weird and I'm still work on to figure out the root cause, can you give me some suggestion? Thanks.

You are right. In scala, when we use @transient private val, that lazy evaluation will be only evaluated once even after serialization/deserialization cycle. As a result, after the AFTAggregator is broadcasted into executors, the variable will be be evaluated again, and will be default to null.

SparkQA · 2016-08-08T14:28:28Z

Test build #63362 has finished for PR 14519 at commit 287f153.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-08-08T15:47:56Z

@yanboliang Can you do a quick test to make sure the shuffle write size is the expected size? For example, in logistic regression only the gradient should be serialized which is an array of numFeatures doubles. The expected shuffle write size is then roughly numFeatures * 8 bytes for each task. It would be nice to check before/after.

dbtsai · 2016-08-08T17:58:51Z

LGTM. Will be nice to see the compassion of shuffle write size, and then will be ready to merge. Thanks.

yanboliang · 2016-08-09T10:36:54Z

I tested this locally and verified the serialization reduction. I posted the shuffle size comparison diagram in the PR description. I will merge this into master. Thanks for your review! @dbtsai @sethah

Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data.

d152a3a

dbtsai reviewed Aug 8, 2016
View reviewed changes

Make parameters.length as a variable.

287f153

MLnick mentioned this pull request Aug 8, 2016

[SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression #14520

Closed

asfgit closed this in 182e119 Aug 9, 2016

yanboliang deleted the spark-16933 branch August 9, 2016 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16933] [ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data. #14519

[SPARK-16933] [ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data. #14519

Uh oh!

yanboliang commented Aug 6, 2016 •

edited

Loading

Uh oh!

SparkQA commented Aug 6, 2016

Uh oh!

yanboliang commented Aug 6, 2016

Uh oh!

srowen commented Aug 6, 2016

Uh oh!

dbtsai Aug 8, 2016 •

edited

Loading

Uh oh!

dbtsai Aug 8, 2016

Uh oh!

yanboliang Aug 8, 2016

Uh oh!

dbtsai Aug 8, 2016

Uh oh!

SparkQA commented Aug 8, 2016

Uh oh!

sethah commented Aug 8, 2016

Uh oh!

dbtsai commented Aug 8, 2016

Uh oh!

yanboliang commented Aug 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-16933] [ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data. #14519

[SPARK-16933] [ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data. #14519

Uh oh!

Conversation

yanboliang commented Aug 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 6, 2016

Uh oh!

yanboliang commented Aug 6, 2016

Uh oh!

srowen commented Aug 6, 2016

Uh oh!

dbtsai Aug 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai Aug 8, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Aug 8, 2016

Choose a reason for hiding this comment

Uh oh!

dbtsai Aug 8, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 8, 2016

Uh oh!

sethah commented Aug 8, 2016

Uh oh!

dbtsai commented Aug 8, 2016

Uh oh!

yanboliang commented Aug 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yanboliang commented Aug 6, 2016 •

edited

Loading

dbtsai Aug 8, 2016 •

edited

Loading