Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Aug 6, 2016

What changes were proposed in this pull request?

Update LogisticCostAggregator serialization code to make it consistent with #14109

How was this patch tested?

MLlib 2.0:
image

After this PR:
image

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Aug 6, 2016

cc @sethah

@WeichenXu123 WeichenXu123 changed the title [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoid redundant serielization [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoid redundant serialization Aug 6, 2016
@srowen
Copy link
Member

srowen commented Aug 6, 2016

Let's put this into #14109

@WeichenXu123
Copy link
Contributor Author

Oh..its another algorithm and there are several different details so in order to make it clear I create a separated PR to discuss it , thanks!

@SparkQA
Copy link

SparkQA commented Aug 6, 2016

Test build #63315 has finished for PR 14520 at commit 417aa1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Aug 8, 2016

@WeichenXu123 I believe #13729 already took care of the actual serialization issue. Out of interest have you tested this impl here for a difference in shuffle data read/write?

However, #14109 and now #14519 do take a slightly different approach with BC vars and transient, so it is probably worthwhile to make them all consistent as per #14109 (comment).

cc @sethah @yanboliang @dbtsai

@sethah
Copy link
Contributor

sethah commented Aug 8, 2016

Well, I suppose this won't go into another PR since the other one got merged. I think it's correct to make this match the approach taken in Linear Regression. The current code doesn't quite match though, so could you take a look at #14109 and make this line up. Also, I'd like to see a comparison of MLlib and this patch to verify that the shuffle write size is the same for the tasks, to make sure we haven't undone anything. It's a good sanity check.

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Aug 9, 2016

@MLnick
The main improvement here is about localFeaturesStd,
in previous code, each call on CostFun.calculate will do a serialization and broadcast on vector.
mark localFeaturesStd as spark broadcast var will avoid this problem.
Thanks!

@sethah
Copy link
Contributor

sethah commented Aug 9, 2016

The change here does not really affect serialization. Spark automatically broadcasts the coefficients each time calculate is called before, and marking it as a broadcast variable explicitly won't likely have much of a performance effect (based on my own testing and the description here). What we need to do here is to change the structure of the aggregator to match up with the fix for LeastSquaresAggregator by passing the featuresStd and coefficients as constructor args, but marking them as @transient lazy val.

I'm in favor of explicitly broadcasting the coefficients too, as was done in LeastSquaresAggregator, but we should explicitly destroy them as well. Thanks for working on this!

@WeichenXu123
Copy link
Contributor Author

@sethah
Thanks for your careful review!
The PR here already passing the bcFeaturesStd and bcCoeffs as constructor args to the LogisticAggregator, like your PR #14109

You mean add another two member into LogisticAggregator like
@transient lazy val featureStd = bcFeatureStd.value
@transient lazy val coeffs = bcCoeff.value
?

And explicitly destroy broadcast I will add it soon!
Thanks.

@WeichenXu123
Copy link
Contributor Author

cc @yanboliang Thanks!

@WeichenXu123 WeichenXu123 force-pushed the improve_logistic_regression_costfun branch 2 times, most recently from 04baa3c to 2b6c867 Compare August 13, 2016 08:07
@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63719 has finished for PR 14520 at commit 90b981f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63720 has finished for PR 14520 at commit 04baa3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63722 has finished for PR 14520 at commit 2b6c867.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {
val numFeatures = featuresStd.length
val coeffs = Vectors.fromBreeze(coefficients)
val bcCoeffs = instances.context.broadcast(coeffs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly destroy bcCoeffs at the end of calculate by bcCoeffs.destroy(blocking = false) for each iteration.

@WeichenXu123
Copy link
Contributor Author

@sethah I attach the test result and it looks good.

@WeichenXu123
Copy link
Contributor Author

@yanboliang Thanks for carefully review!

@SparkQA
Copy link

SparkQA commented Aug 14, 2016

Test build #63744 has finished for PR 14520 at commit 87f5417.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Two LogisticAggregator can be merged together to have a summary of loss and gradient of
* the corresponding joint dataset.
*
* @param bcCoeffs The broadcast coefficients corresponding to the features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it bcCoefficients for consistency with other changes.

@MLnick
Copy link
Contributor

MLnick commented Aug 15, 2016

@WeichenXu123 would you mind updating the title of the JIRA and PR, as well as the description, to reflect the fact that this is not actually affecting serialization, but more to update the approach to be consistent with the other changes made in #14109?

This is just for record-keeping to avoid any confusion in future. Thanks!

@WeichenXu123 WeichenXu123 changed the title [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoid redundant serialization [SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression Aug 15, 2016
@SparkQA
Copy link

SparkQA commented Aug 15, 2016

Test build #63778 has finished for PR 14520 at commit e7ff240.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 force-pushed the improve_logistic_regression_costfun branch from e7ff240 to efe3d38 Compare August 15, 2016 08:03
@SparkQA
Copy link

SparkQA commented Aug 15, 2016

Test build #63779 has finished for PR 14520 at commit efe3d38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 15, 2016

@yanboliang are you OK with this?

@yanboliang
Copy link
Contributor

LGTM, merged into master. Thanks!

@asfgit asfgit closed this in 3d8bfe7 Aug 15, 2016
@WeichenXu123 WeichenXu123 deleted the improve_logistic_regression_costfun branch April 24, 2019 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants