-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, these values were assigned simultaneously implicitly using a pattern match. It turns out that marking it as @transient lazy val doesn't work because the unapply method generates a Tuple2 which does not contain the transient tag. The individual vals are still transient, but the tuple is not and thus gets serialized. This obscure/hidden consequence of pattern matching is one good argument not to use the @transient approach. e.g. the following doesn't work
@transient private lazy val (x, y) = {
...
(x, y)
}|
ping @dbtsai I implemented this patch using |
|
Test build #62003 has finished for PR 14109 at commit
|
|
Jenkins, test this please. |
|
Test build #63123 has finished for PR 14109 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coefAndOffset should be effectiveCoefficientsAndOffset be better? It's effective coefficients rather than coefficients.
|
@sethah I left two inline comments. Otherwise, LGTM. Thanks! |
|
@yanboliang Do you have thoughts on my comments regarding the trade-offs with using I'll address your other comments shortly. Thanks! |
|
Test build #63175 has finished for PR 14109 at commit
|
| val coefficientsArray = coefficients.toArray.clone() | ||
| private val dim = bcCoefficients.value.size | ||
| @transient private lazy val featuresStd = bcFeaturesStd.value | ||
| @transient private lazy val effectiveCoefAndOffset = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about @transient private lazy val (effectiveCoefficientsVector: Vector, offset: Double) = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah has explained this issue in comment which has been folded.
@transient private lazy val (effectiveCoefficientsVector: Vector, offset: Double) will generates a Tuple2 which does not contain the transient tag. The individual vals are still transient, but the tuple is not and thus gets serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is indeed obscure. I like the fact that using @transient will tell those fields are not being serialized. However, this can be difficulty to debug. How about have the documentation written in the code? Or we can do def initializeEffectiveCoefficientsVectorAndOffset, and call it in the add method for the first time? I don't have strong opinion about this.
|
@sethah In my opinion, I think using |
|
@dbtsai @yanboliang I went ahead and added a couple comments so someone will not mistakenly change this behavior in the future. Let me know if you see anything else, thanks! |
|
|
||
| val totalGradientArray = leastSquaresAggregator.gradient.toArray | ||
| bcCoeffs.destroy(blocking = false) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why do we not explicitly destroy bcFeaturesStd and bcFeaturesMean here? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot destroy them here because they are used on every iteration. I just added a commit to fix this so that, after the algorithm is run, we destroy the broadcast variables.
|
Test build #63282 has finished for PR 14109 at commit
|
|
Test build #63288 has finished for PR 14109 at commit
|
|
The current fix for broadcast variable destroy is ok. LGTM. Thanks! |
|
LGTM. Merged into master. Great work! Thanks. |
|
Would be great to have LOR sharing similar style and destroy mean and variance after usage. Thanks. |
…es unnecessary data. ## What changes were proposed in this pull request? Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109. ## How was this patch tested? I tested this locally and verified the serialization reduction. Before patch  After patch  Author: Yanbo Liang <[email protected]> Closes #14519 from yanboliang/spark-16933.
…code to make it consistent with LinearRegression ## What changes were proposed in this pull request? Update LogisticCostAggregator serialization code to make it consistent with #14109 ## How was this patch tested? MLlib 2.0:  After this PR:  Author: WeichenXu <[email protected]> Closes #14520 from WeichenXu123/improve_logistic_regression_costfun.
What changes were proposed in this pull request?
Similar to
LogisticAggregator,LeastSquaresAggregatorused for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization.In #13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of
@transient lazy vals which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark.How was this patch tested?
MLlib

ML without patch

ML with patch
