[SPARK-16008][ML] Remove unnecessary serialization in logistic regression #13729

sethah · 2016-06-17T03:58:31Z

What changes were proposed in this pull request?

LogisticAggregator stores references to two arrays of dimension numFeatures which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).

This patch modifies LogisticAggregator.add to accept the two arrays as method parameters which avoids the serialization.

How was this patch tested?

I tested this locally and verified the serialization reduction.

Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.

sethah · 2016-06-17T04:00:50Z

cc @jkbradley @dbtsai

SparkQA · 2016-06-17T04:52:34Z

Test build #60681 has finished for PR 13729 at commit ef8fdea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-06-17T09:06:38Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

-  private val dim = if (fitIntercept) coefficientsArray.length - 1 else coefficientsArray.length
-
-  private val gradientSumArray = Array.ofDim[Double](coefficientsArray.length)
+  private val dim = numFeatures


Do you need dim here, or can just reference numFeatures later in the class?
I had to look twice at the line below to make sure the logic wasn't reversed from before but I see why it works out.

I left it because this logic will likely change when multiclass is added. dim is used to check that the overall coefficients array is the correct length, which won't be numFeatures for multiclass. Still, I can remove it here if that seems better.

On second thought, I like your suggestion. I updated it accordingly.

srowen · 2016-06-17T09:08:29Z

I think that makes sense.

sethah · 2016-06-17T15:32:42Z

@srowen Thanks for the review! I responded to your comments, let me know what you think.

SparkQA · 2016-06-17T16:17:27Z

Test build #60710 has finished for PR 13729 at commit 96b0a45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-06-17T16:58:33Z

Nice catch and LGTM! Merging into master and branch-2.0. Thanks!

…sion JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008) ## What changes were proposed in this pull request? `LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller). This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization. ## How was this patch tested? I tested this locally and verified the serialization reduction. ![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png) Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup. Author: sethah <[email protected]> Closes #13729 from sethah/lr_improvement. (cherry picked from commit 1f0a469) Signed-off-by: Xiangrui Meng <[email protected]>

SparkQA · 2016-06-17T17:06:50Z

Test build #60712 has finished for PR 13729 at commit 5d668a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-07-04T22:47:28Z

@sethah Late comment. Great improvement for high dimensional problems. I didn't test it out myself, and I wonder whether @transient annotation works in the constructor of LogisticAggregator. Thus, the code will be cleaner with using c.add(instance). Thanks.

jodersky · 2016-07-05T23:30:42Z

Hi @dbtsai, I assisted @sethah with some serialization issues during this PR. I know we considered using transient but can't recall exactly why we ended up not.
My knowledge about the bigger picture of this PR is quite limited, but one explanation that comes to mind is that the coefficients and featuresStd parameters are only used within the add method. So the reasoning was to keep parameters as local as possible.

dbtsai · 2016-07-06T01:38:46Z

Hi @jodersky @sethah

Could you test in Linear Regression, if @transient helps the performance for the same serialization issue?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

Thanks.

sethah · 2016-07-06T02:17:22Z

@dbtsai I'll take a look later this week

## What changes were proposed in this pull request? Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. In #13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark. ## How was this patch tested? **MLlib** ![image](https://cloud.githubusercontent.com/assets/7275795/16703660/436f79fa-4524-11e6-9022-ef00058ec718.png) **ML without patch** ![image](https://cloud.githubusercontent.com/assets/7275795/16703831/c4d50b9e-4525-11e6-80cb-9b58c850cd41.png) **ML with patch** ![image](https://cloud.githubusercontent.com/assets/7275795/16703675/63e0cf40-4524-11e6-9120-1f512a70e083.png) Author: sethah <[email protected]> Closes #14109 from sethah/LIR_serialize.

sethah added 3 commits June 16, 2016 16:24

remove unnecessary serialization

2a1e632

comments

86505b7

dimension corrections

ef8fdea

sethah changed the title ~~[SPARK-16008] Remove unnecessary serialization in logistic regression~~ [SPARK-16008][ML] Remove unnecessary serialization in logistic regression Jun 17, 2016

srowen reviewed Jun 17, 2016
View reviewed changes

spark style

96b0a45

removing dim

5d668a6

asfgit closed this in 1f0a469 Jun 17, 2016

sethah mentioned this pull request Jul 8, 2016

[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109

Closed

MLnick mentioned this pull request Aug 8, 2016

[SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression #14520

Closed

[SPARK-16008][ML] Remove unnecessary serialization in logistic regression #13729

[SPARK-16008][ML] Remove unnecessary serialization in logistic regression #13729

Uh oh!

Conversation

sethah commented Jun 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sethah commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

srowen Jun 17, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Jun 17, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Jun 17, 2016

Choose a reason for hiding this comment

Uh oh!

srowen commented Jun 17, 2016

Uh oh!

sethah commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

mengxr commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

dbtsai commented Jul 4, 2016

Uh oh!

jodersky commented Jul 5, 2016

Uh oh!

dbtsai commented Jul 6, 2016

Uh oh!

sethah commented Jul 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sethah commented Jun 17, 2016 •

edited

Loading