-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16008][ML] Remove unnecessary serialization in logistic regression #13729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #60681 has finished for PR 13729 at commit
|
| private val dim = if (fitIntercept) coefficientsArray.length - 1 else coefficientsArray.length | ||
|
|
||
| private val gradientSumArray = Array.ofDim[Double](coefficientsArray.length) | ||
| private val dim = numFeatures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need dim here, or can just reference numFeatures later in the class?
I had to look twice at the line below to make sure the logic wasn't reversed from before but I see why it works out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it because this logic will likely change when multiclass is added. dim is used to check that the overall coefficients array is the correct length, which won't be numFeatures for multiclass. Still, I can remove it here if that seems better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I like your suggestion. I updated it accordingly.
|
I think that makes sense. |
|
@srowen Thanks for the review! I responded to your comments, let me know what you think. |
|
Test build #60710 has finished for PR 13729 at commit
|
|
Nice catch and LGTM! Merging into master and branch-2.0. Thanks! |
…sion JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008) ## What changes were proposed in this pull request? `LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller). This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization. ## How was this patch tested? I tested this locally and verified the serialization reduction.  Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup. Author: sethah <[email protected]> Closes #13729 from sethah/lr_improvement. (cherry picked from commit 1f0a469) Signed-off-by: Xiangrui Meng <[email protected]>
|
Test build #60712 has finished for PR 13729 at commit
|
|
@sethah Late comment. Great improvement for high dimensional problems. I didn't test it out myself, and I wonder whether |
|
Hi @dbtsai, I assisted @sethah with some serialization issues during this PR. I know we considered using transient but can't recall exactly why we ended up not. |
|
Could you test in Linear Regression, if Thanks. |
|
@dbtsai I'll take a look later this week |
## What changes were proposed in this pull request? Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. In #13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark. ## How was this patch tested? **MLlib**  **ML without patch**  **ML with patch**  Author: sethah <[email protected]> Closes #14109 from sethah/LIR_serialize.
JIRA: SPARK-16008
What changes were proposed in this pull request?
LogisticAggregatorstores references to two arrays of dimensionnumFeatureswhich are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).This patch modifies
LogisticAggregator.addto accept the two arrays as method parameters which avoids the serialization.How was this patch tested?
I tested this locally and verified the serialization reduction.
Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.