Skip to content

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented Jun 17, 2016

JIRA: SPARK-16008

What changes were proposed in this pull request?

LogisticAggregator stores references to two arrays of dimension numFeatures which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).

This patch modifies LogisticAggregator.add to accept the two arrays as method parameters which avoids the serialization.

How was this patch tested?

I tested this locally and verified the serialization reduction.

image

Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.

@sethah sethah changed the title [SPARK-16008] Remove unnecessary serialization in logistic regression [SPARK-16008][ML] Remove unnecessary serialization in logistic regression Jun 17, 2016
@sethah
Copy link
Contributor Author

sethah commented Jun 17, 2016

cc @jkbradley @dbtsai

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60681 has finished for PR 13729 at commit ef8fdea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private val dim = if (fitIntercept) coefficientsArray.length - 1 else coefficientsArray.length

private val gradientSumArray = Array.ofDim[Double](coefficientsArray.length)
private val dim = numFeatures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need dim here, or can just reference numFeatures later in the class?
I had to look twice at the line below to make sure the logic wasn't reversed from before but I see why it works out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it because this logic will likely change when multiclass is added. dim is used to check that the overall coefficients array is the correct length, which won't be numFeatures for multiclass. Still, I can remove it here if that seems better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I like your suggestion. I updated it accordingly.

@srowen
Copy link
Member

srowen commented Jun 17, 2016

I think that makes sense.

@sethah
Copy link
Contributor Author

sethah commented Jun 17, 2016

@srowen Thanks for the review! I responded to your comments, let me know what you think.

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60710 has finished for PR 13729 at commit 96b0a45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jun 17, 2016

Nice catch and LGTM! Merging into master and branch-2.0. Thanks!

asfgit pushed a commit that referenced this pull request Jun 17, 2016
…sion

JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008)

## What changes were proposed in this pull request?
`LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).

This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization.

## How was this patch tested?

I tested this locally and verified the serialization reduction.

![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png)

Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.

Author: sethah <[email protected]>

Closes #13729 from sethah/lr_improvement.

(cherry picked from commit 1f0a469)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in 1f0a469 Jun 17, 2016
@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60712 has finished for PR 13729 at commit 5d668a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Jul 4, 2016

@sethah Late comment. Great improvement for high dimensional problems. I didn't test it out myself, and I wonder whether @transient annotation works in the constructor of LogisticAggregator. Thus, the code will be cleaner with using c.add(instance). Thanks.

@jodersky
Copy link
Member

jodersky commented Jul 5, 2016

Hi @dbtsai, I assisted @sethah with some serialization issues during this PR. I know we considered using transient but can't recall exactly why we ended up not.
My knowledge about the bigger picture of this PR is quite limited, but one explanation that comes to mind is that the coefficients and featuresStd parameters are only used within the add method. So the reasoning was to keep parameters as local as possible.

@dbtsai
Copy link
Member

dbtsai commented Jul 6, 2016

Hi @jodersky @sethah

Could you test in Linear Regression, if @transient helps the performance for the same serialization issue?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

Thanks.

@sethah
Copy link
Contributor Author

sethah commented Jul 6, 2016

@dbtsai I'll take a look later this week

asfgit pushed a commit that referenced this pull request Aug 8, 2016
## What changes were proposed in this pull request?
Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization.

In #13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark.

## How was this patch tested?

**MLlib**
![image](https://cloud.githubusercontent.com/assets/7275795/16703660/436f79fa-4524-11e6-9022-ef00058ec718.png)

**ML without patch**
![image](https://cloud.githubusercontent.com/assets/7275795/16703831/c4d50b9e-4525-11e6-80cb-9b58c850cd41.png)

**ML with patch**
![image](https://cloud.githubusercontent.com/assets/7275795/16703675/63e0cf40-4524-11e6-9120-1f512a70e083.png)

Author: sethah <[email protected]>

Closes #14109 from sethah/LIR_serialize.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants