-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression #14520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @sethah |
|
Let's put this into #14109 |
|
Oh..its another algorithm and there are several different details so in order to make it clear I create a separated PR to discuss it , thanks! |
|
Test build #63315 has finished for PR 14520 at commit
|
|
@WeichenXu123 I believe #13729 already took care of the actual serialization issue. Out of interest have you tested this impl here for a difference in shuffle data read/write? However, #14109 and now #14519 do take a slightly different approach with BC vars and |
|
Well, I suppose this won't go into another PR since the other one got merged. I think it's correct to make this match the approach taken in Linear Regression. The current code doesn't quite match though, so could you take a look at #14109 and make this line up. Also, I'd like to see a comparison of MLlib and this patch to verify that the shuffle write size is the same for the tasks, to make sure we haven't undone anything. It's a good sanity check. |
|
@MLnick |
|
The change here does not really affect serialization. Spark automatically broadcasts the coefficients each time calculate is called before, and marking it as a broadcast variable explicitly won't likely have much of a performance effect (based on my own testing and the description here). What we need to do here is to change the structure of the aggregator to match up with the fix for I'm in favor of explicitly broadcasting the coefficients too, as was done in |
|
@sethah You mean add another two member into And explicitly destroy broadcast I will add it soon! |
|
cc @yanboliang Thanks! |
04baa3c to
2b6c867
Compare
|
Test build #63719 has finished for PR 14520 at commit
|
|
Test build #63720 has finished for PR 14520 at commit
|
|
Test build #63722 has finished for PR 14520 at commit
|
| override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = { | ||
| val numFeatures = featuresStd.length | ||
| val coeffs = Vectors.fromBreeze(coefficients) | ||
| val bcCoeffs = instances.context.broadcast(coeffs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explicitly destroy bcCoeffs at the end of calculate by bcCoeffs.destroy(blocking = false) for each iteration.
|
@sethah I attach the test result and it looks good. |
|
@yanboliang Thanks for carefully review! |
|
Test build #63744 has finished for PR 14520 at commit
|
| * Two LogisticAggregator can be merged together to have a summary of loss and gradient of | ||
| * the corresponding joint dataset. | ||
| * | ||
| * @param bcCoeffs The broadcast coefficients corresponding to the features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call it bcCoefficients for consistency with other changes.
|
@WeichenXu123 would you mind updating the title of the JIRA and PR, as well as the description, to reflect the fact that this is not actually affecting serialization, but more to update the approach to be consistent with the other changes made in #14109? This is just for record-keeping to avoid any confusion in future. Thanks! |
|
Test build #63778 has finished for PR 14520 at commit
|
e7ff240 to
efe3d38
Compare
|
Test build #63779 has finished for PR 14520 at commit
|
|
@yanboliang are you OK with this? |
|
LGTM, merged into master. Thanks! |
What changes were proposed in this pull request?
Update LogisticCostAggregator serialization code to make it consistent with #14109
How was this patch tested?
MLlib 2.0:

After this PR:
