[SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression #14520

WeichenXu123 · 2016-08-06T15:44:47Z

What changes were proposed in this pull request?

Update LogisticCostAggregator serialization code to make it consistent with #14109

How was this patch tested?

MLlib 2.0:

After this PR:

WeichenXu123 · 2016-08-06T15:45:27Z

cc @sethah

srowen · 2016-08-06T15:57:20Z

Let's put this into #14109

WeichenXu123 · 2016-08-06T16:07:56Z

Oh..its another algorithm and there are several different details so in order to make it clear I create a separated PR to discuss it , thanks!

SparkQA · 2016-08-06T16:33:06Z

Test build #63315 has finished for PR 14520 at commit 417aa1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-08-08T13:49:03Z

@WeichenXu123 I believe #13729 already took care of the actual serialization issue. Out of interest have you tested this impl here for a difference in shuffle data read/write?

However, #14109 and now #14519 do take a slightly different approach with BC vars and transient, so it is probably worthwhile to make them all consistent as per #14109 (comment).

cc @sethah @yanboliang @dbtsai

sethah · 2016-08-08T15:03:12Z

Well, I suppose this won't go into another PR since the other one got merged. I think it's correct to make this match the approach taken in Linear Regression. The current code doesn't quite match though, so could you take a look at #14109 and make this line up. Also, I'd like to see a comparison of MLlib and this patch to verify that the shuffle write size is the same for the tasks, to make sure we haven't undone anything. It's a good sanity check.

WeichenXu123 · 2016-08-09T02:11:31Z

@MLnick
The main improvement here is about localFeaturesStd,
in previous code, each call on CostFun.calculate will do a serialization and broadcast on vector.
mark localFeaturesStd as spark broadcast var will avoid this problem.
Thanks!

sethah · 2016-08-09T03:33:50Z

The change here does not really affect serialization. Spark automatically broadcasts the coefficients each time calculate is called before, and marking it as a broadcast variable explicitly won't likely have much of a performance effect (based on my own testing and the description here). What we need to do here is to change the structure of the aggregator to match up with the fix for LeastSquaresAggregator by passing the featuresStd and coefficients as constructor args, but marking them as @transient lazy val.

I'm in favor of explicitly broadcasting the coefficients too, as was done in LeastSquaresAggregator, but we should explicitly destroy them as well. Thanks for working on this!

WeichenXu123 · 2016-08-09T06:08:27Z

@sethah
Thanks for your careful review!
The PR here already passing the bcFeaturesStd and bcCoeffs as constructor args to the LogisticAggregator, like your PR #14109

You mean add another two member into LogisticAggregator like
@transient lazy val featureStd = bcFeatureStd.value
@transient lazy val coeffs = bcCoeff.value
?

And explicitly destroy broadcast I will add it soon!
Thanks.

WeichenXu123 · 2016-08-13T07:59:52Z

cc @yanboliang Thanks!

SparkQA · 2016-08-13T08:08:17Z

Test build #63719 has finished for PR 14520 at commit 90b981f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-13T08:53:42Z

Test build #63720 has finished for PR 14520 at commit 04baa3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-13T08:58:23Z

Test build #63722 has finished for PR 14520 at commit 2b6c867.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-08-14T13:52:05Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

  override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {
    val numFeatures = featuresStd.length
    val coeffs = Vectors.fromBreeze(coefficients)
+    val bcCoeffs = instances.context.broadcast(coeffs)


We should explicitly destroy bcCoeffs at the end of calculate by bcCoeffs.destroy(blocking = false) for each iteration.

WeichenXu123 · 2016-08-14T14:22:28Z

@sethah I attach the test result and it looks good.

WeichenXu123 · 2016-08-14T14:23:10Z

@yanboliang Thanks for carefully review!

SparkQA · 2016-08-14T15:08:57Z

Test build #63744 has finished for PR 14520 at commit 87f5417.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-08-15T06:56:36Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

 * Two LogisticAggregator can be merged together to have a summary of loss and gradient of
 * the corresponding joint dataset.
 *
+ * @param bcCoeffs The broadcast coefficients corresponding to the features.


Call it bcCoefficients for consistency with other changes.

MLnick · 2016-08-15T07:03:10Z

@WeichenXu123 would you mind updating the title of the JIRA and PR, as well as the description, to reflect the fact that this is not actually affecting serialization, but more to update the approach to be consistent with the other changes made in #14109?

This is just for record-keeping to avoid any confusion in future. Thanks!

SparkQA · 2016-08-15T07:48:53Z

Test build #63778 has finished for PR 14520 at commit e7ff240.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-15T08:53:21Z

Test build #63779 has finished for PR 14520 at commit efe3d38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-08-15T09:08:19Z

@yanboliang are you OK with this?

yanboliang · 2016-08-15T13:38:25Z

LGTM, merged into master. Thanks!

update

417aa1e

WeichenXu123 changed the title ~~[SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoid redundant serielization~~ [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoid redundant serialization Aug 6, 2016

explicitly release bcFeatureStd broadcast

2b6c867

WeichenXu123 force-pushed the improve_logistic_regression_costfun branch 2 times, most recently from 04baa3c to 2b6c867 Compare August 13, 2016 08:07

WeichenXu123 added 2 commits August 13, 2016 21:33

some minor update.

87f5417

minor update.

efe3d38

yanboliang reviewed Aug 14, 2016
View reviewed changes

dbtsai reviewed Aug 15, 2016
View reviewed changes

WeichenXu123 changed the title ~~[SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoid redundant serialization~~ [SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression Aug 15, 2016

WeichenXu123 force-pushed the improve_logistic_regression_costfun branch from e7ff240 to efe3d38 Compare August 15, 2016 08:03

asfgit closed this in 3d8bfe7 Aug 15, 2016

sethah mentioned this pull request Aug 16, 2016

[SPARK-7159][ML] Add multiclass logistic regression to Spark ML #13796

Closed

WeichenXu123 deleted the improve_logistic_regression_costfun branch April 24, 2019 21:18

[SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression #14520

[SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression #14520

Uh oh!

Conversation

WeichenXu123 commented Aug 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

WeichenXu123 commented Aug 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Aug 6, 2016

Uh oh!

WeichenXu123 commented Aug 6, 2016

Uh oh!

SparkQA commented Aug 6, 2016

Uh oh!

MLnick commented Aug 8, 2016

Uh oh!

sethah commented Aug 8, 2016

Uh oh!

WeichenXu123 commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sethah commented Aug 9, 2016

Uh oh!

WeichenXu123 commented Aug 9, 2016

Uh oh!

WeichenXu123 commented Aug 13, 2016

Uh oh!

SparkQA commented Aug 13, 2016

Uh oh!

SparkQA commented Aug 13, 2016

Uh oh!

SparkQA commented Aug 13, 2016

Uh oh!

yanboliang Aug 14, 2016

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Aug 14, 2016

Uh oh!

WeichenXu123 commented Aug 14, 2016

Uh oh!

SparkQA commented Aug 14, 2016

Uh oh!

dbtsai Aug 15, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick commented Aug 15, 2016

Uh oh!

SparkQA commented Aug 15, 2016

Uh oh!

SparkQA commented Aug 15, 2016

Uh oh!

srowen commented Aug 15, 2016

Uh oh!

yanboliang commented Aug 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

WeichenXu123 commented Aug 6, 2016 •

edited

Loading

WeichenXu123 commented Aug 6, 2016 •

edited

Loading

WeichenXu123 commented Aug 9, 2016 •

edited

Loading