[SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance #3435

dbtsai · 2014-11-25T00:00:40Z

The following optimizations are done to improve the StandardScaler model
transformation performance.

Covert Breeze dense vector to primitive vector to reduce the overhead.
Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
Have a local reference to shift and factor array so JVM can locate the value with one operation call.
In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to
make the codebase cleaner.

Benchmark with mnist8m dataset:

Before,
DenseVector withMean and withStd: 50.97secs
DenseVector withMean and withoutStd: 42.11secs
DenseVector withoutMean and withStd: 8.75secs
SparseVector withoutMean and withStd: 5.437secs

With this PR,
DenseVector withMean and withStd: 5.76secs
DenseVector withMean and withoutStd: 5.28secs
DenseVector withoutMean and withStd: 5.30secs
SparseVector withoutMean and withStd: 1.27secs

Note that without the local reference copy of factor and shift arrays,
the runtime is almost three time slower.

DenseVector withMean and withStd: 18.15secs
DenseVector withMean and withoutStd: 18.05secs
DenseVector withoutMean and withStd: 18.54secs
SparseVector withoutMean and withStd: 2.01secs

The following code,

while (i < size) {
   values(i) = (values(i) - shift(i)) * factor(i)
   i += 1
}

will generate the bytecode

   L13
    LINENUMBER 106 L13
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 7
    ILOAD 6
    IF_ICMPGE L14
   L15
    LINENUMBER 107 L15
    ALOAD 5
    ILOAD 7
    ALOAD 5
    ILOAD 7
    DALOAD
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D
    ILOAD 7
    DALOAD
    DSUB
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ILOAD 7
    DALOAD
    DMUL
    DASTORE
   L16
    LINENUMBER 108 L16
    ILOAD 7
    ICONST_1
    IADD
    ISTORE 7
    GOTO L13

, while with the local reference of the shift and factor arrays, the bytecode will be

   L14
    LINENUMBER 107 L14
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ASTORE 9
   L15
    LINENUMBER 108 L15
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L16
   L17
    LINENUMBER 109 L17
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 2
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 9
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L18
    LINENUMBER 110 L18
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L15

You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling INVOKESPECIAL.

SparkQA · 2014-11-25T00:05:11Z

Test build #23800 has started for PR 3435 at commit 5bffd3d.

This patch merges cleanly.

SparkQA · 2014-11-25T00:10:04Z

Test build #23801 has started for PR 3435 at commit fc795e4.

This patch merges cleanly.

SparkQA · 2014-11-25T00:50:32Z

Test build #23803 has started for PR 3435 at commit 9c51eef.

This patch merges cleanly.

mengxr · 2014-11-25T01:13:15Z

@dbtsai Did you measure the performance gain from the following change?

Have a local reference to shift and factor array so JVM can locate the value with one operation call.

Could you post the generated bytecode?

SparkQA · 2014-11-25T01:25:07Z

Test build #23805 has started for PR 3435 at commit cdb5cef.

This patch merges cleanly.

SparkQA · 2014-11-25T01:32:30Z

Test build #23800 has finished for PR 3435 at commit 5bffd3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T01:32:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23800/
Test PASSed.

SparkQA · 2014-11-25T01:37:46Z

Test build #23801 has finished for PR 3435 at commit fc795e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T01:37:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23801/
Test PASSed.

SparkQA · 2014-11-25T02:13:38Z

Test build #23803 has finished for PR 3435 at commit 9c51eef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T02:13:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23803/
Test PASSed.

SparkQA · 2014-11-25T02:51:52Z

Test build #23805 has finished for PR 3435 at commit cdb5cef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T02:51:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23805/
Test PASSed.

dbtsai · 2014-11-25T03:21:52Z

@mengxr

Without the local reference copy of factor and shift arrays, the runtime is almost three time slower.

DenseVector withMean and withStd: 18.15secs
DenseVector withMean and withoutStd: 18.05secs
DenseVector withoutMean and withStd: 18.54secs
SparseVector withoutMean and withStd: 2.01secs

The following code,

while (i < size) {
   values(i) = (values(i) - shift(i)) * factor(i)
   i += 1
}

will generate the bytecode

   L13
    LINENUMBER 106 L13
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 7
    ILOAD 6
    IF_ICMPGE L14
   L15
    LINENUMBER 107 L15
    ALOAD 5
    ILOAD 7
    ALOAD 5
    ILOAD 7
    DALOAD
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D
    ILOAD 7
    DALOAD
    DSUB
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ILOAD 7
    DALOAD
    DMUL
    DASTORE
   L16
    LINENUMBER 108 L16
    ILOAD 7
    ICONST_1
    IADD
    ISTORE 7
    GOTO L13

, while with the local reference of the shift and factor arrays, the bytecode will be

   L14
    LINENUMBER 107 L14
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ASTORE 9
   L15
    LINENUMBER 108 L15
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L16
   L17
    LINENUMBER 109 L17
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 2
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 9
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L18
    LINENUMBER 110 L18
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L15

You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling INVOKESPECIAL.

dbtsai · 2014-11-25T03:24:01Z

PS, we may want to go though the mllib codebase, and find things like this. This issue impacts the performance quite a lot.

mengxr · 2014-11-25T03:57:08Z

@dbtsai What if we mark factor and shift as private[this]?

dbtsai · 2014-11-25T04:19:57Z

Wow, with

  private[this] val factor: Array[Double] = {
    val f = Array.ofDim[Double](variance.size)
    var i = 0
    while (i < f.size) {
      f(i) = if (variance(i) != 0.0) 1.0 / math.sqrt(variance(i)) else 0.0
      i += 1
    }
    f
  }

  private[this] val shift: Array[Double] = mean.toArray

and

            while (i < size) {
              values(i) = (values(i) - shift(i)) * factor(i)
              i += 1
            }

, I got different bytecode as the following

   L14
    LINENUMBER 108 L14
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L15
   L16
    LINENUMBER 109 L16
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 0
    GETFIELD org/apache/spark/mllib/feature/StandardScalerModel.shift : [D
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 0
    GETFIELD org/apache/spark/mllib/feature/StandardScalerModel.factor : [D
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L17
    LINENUMBER 110 L17
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L14

It's slightly slower than the local reference version.
DenseVector withMean and withStd: 5.92secs
DenseVector withMean and withoutStd: 5.36secs
DenseVector withoutMean and withStd: 5.51secs
SparseVector withoutMean and withStd: 1.30secs

Instead of calling INVOKESPECIAL, it's now calling GETFIELD.
What's difference between private[this] and private? Also, it doesn't
work with private [this] lazy val which will generate the same bytecode
as private lazy val. As a result, shift and factor will be always evaluated
when we create the model.

mengxr · 2014-11-25T04:48:33Z

By default, Scala generates Java methods for members, no matter whether you use val or def. That's why you saw invokespecial for shift and factor. But if a member is marked as private[this], it generates the code as a private field, and hence you see getfield in the bytecode.

I like using private[this] instead of having local references for code simplicity, but no strong preference. The current form looks good to me.

mengxr · 2014-11-25T04:52:39Z

mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala

We don't need lazy here, because mean.toArray is not expensive.

SparkQA · 2014-11-25T05:27:54Z

Test build #23817 has started for PR 3435 at commit daf2b06.

This patch merges cleanly.

SparkQA · 2014-11-25T06:48:08Z

Test build #23817 has finished for PR 3435 at commit daf2b06.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T06:48:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23817/
Test PASSed.

mengxr · 2014-11-25T07:43:09Z

mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala

shift is only used in this branch. Shall we just put val shift = mean.toArray here instead of having a member variable?

Oh, I'll change it back to lazy since it will not be evaluated in those branches which don't use shift. I don't want to create shift array/object for each sample since shift will always be the same.

shift only holds a reference to mean.values. We don't really need to define it as a member and make it lazy. It should give the same performance if we only define it inside the if branch.

For different implementation of vector, toArray can be very expensive. For example, toArray for sparse vector requires to create a new array object and loop through all the non zero values. As a result, we can have a global lazy shift which can prevent this happens.

SparkQA · 2014-11-25T08:00:06Z

Test build #23825 has started for PR 3435 at commit 85885a9.

This patch merges cleanly.

SparkQA · 2014-11-25T09:27:01Z

Test build #23825 has finished for PR 3435 at commit 85885a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T09:27:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23825/
Test PASSed.

mengxr · 2014-11-25T19:07:47Z

LGTM. Merged into master and branch-1.2. Thanks!

…rmation performance The following optimizations are done to improve the StandardScaler model transformation performance. 1) Covert Breeze dense vector to primitive vector to reduce the overhead. 2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector. 3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call. 4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to make the codebase cleaner. Benchmark with mnist8m dataset: Before, DenseVector withMean and withStd: 50.97secs DenseVector withMean and withoutStd: 42.11secs DenseVector withoutMean and withStd: 8.75secs SparseVector withoutMean and withStd: 5.437secs With this PR, DenseVector withMean and withStd: 5.76secs DenseVector withMean and withoutStd: 5.28secs DenseVector withoutMean and withStd: 5.30secs SparseVector withoutMean and withStd: 1.27secs Note that without the local reference copy of `factor` and `shift` arrays, the runtime is almost three time slower. DenseVector withMean and withStd: 18.15secs DenseVector withMean and withoutStd: 18.05secs DenseVector withoutMean and withStd: 18.54secs SparseVector withoutMean and withStd: 2.01secs The following code, ```scala while (i < size) { values(i) = (values(i) - shift(i)) * factor(i) i += 1 } ``` will generate the bytecode ``` L13 LINENUMBER 106 L13 FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] [] ILOAD 7 ILOAD 6 IF_ICMPGE L14 L15 LINENUMBER 107 L15 ALOAD 5 ILOAD 7 ALOAD 5 ILOAD 7 DALOAD ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D ILOAD 7 DALOAD DSUB ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D ILOAD 7 DALOAD DMUL DASTORE L16 LINENUMBER 108 L16 ILOAD 7 ICONST_1 IADD ISTORE 7 GOTO L13 ``` , while with the local reference of the `shift` and `factor` arrays, the bytecode will be ``` L14 LINENUMBER 107 L14 ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D ASTORE 9 L15 LINENUMBER 108 L15 FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] [] ILOAD 8 ILOAD 7 IF_ICMPGE L16 L17 LINENUMBER 109 L17 ALOAD 6 ILOAD 8 ALOAD 6 ILOAD 8 DALOAD ALOAD 2 ILOAD 8 DALOAD DSUB ALOAD 9 ILOAD 8 DALOAD DMUL DASTORE L18 LINENUMBER 110 L18 ILOAD 8 ICONST_1 IADD ISTORE 8 GOTO L15 ``` You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling `INVOKESPECIAL`. Author: DB Tsai <[email protected]> Closes #3435 from dbtsai/standardscaler and squashes the following commits: 85885a9 [DB Tsai] revert to have lazy in shift array. daf2b06 [DB Tsai] Address the feedback cdb5cef [DB Tsai] small change 9c51eef [DB Tsai] style fc795e4 [DB Tsai] update 5bffd3d [DB Tsai] first commit (cherry picked from commit bf1a6aa) Signed-off-by: Xiangrui Meng <[email protected]>

first commit

5bffd3d

update

fc795e4

style

9c51eef

small change

cdb5cef

mengxr reviewed Nov 25, 2014
View reviewed changes

Address the feedback

daf2b06

mengxr reviewed Nov 25, 2014
View reviewed changes

revert to have lazy in shift array.

85885a9

asfgit closed this in bf1a6aa Nov 25, 2014

dbtsai deleted the standardscaler branch November 25, 2014 21:33

dbtsai mentioned this pull request Dec 9, 2014

[SPARK-2309][MLlib] Generalize the binary logistic regression into multinomial logistic regression #1379

Merged

[SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance #3435

[SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance #3435

Uh oh!

Conversation

dbtsai commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

mengxr commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

AmplabJenkins commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

AmplabJenkins commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

AmplabJenkins commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

AmplabJenkins commented Nov 25, 2014

Uh oh!

dbtsai commented Nov 25, 2014

Uh oh!

dbtsai commented Nov 25, 2014

Uh oh!

mengxr commented Nov 25, 2014

Uh oh!

dbtsai commented Nov 25, 2014

Uh oh!

mengxr commented Nov 25, 2014

Uh oh!

mengxr Nov 25, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

AmplabJenkins commented Nov 25, 2014

Uh oh!

mengxr Nov 25, 2014

Choose a reason for hiding this comment

Uh oh!

dbtsai Nov 25, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Nov 25, 2014

Choose a reason for hiding this comment

Uh oh!

dbtsai Nov 25, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

AmplabJenkins commented Nov 25, 2014

Uh oh!

mengxr commented Nov 25, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants