Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 12, 2020

What changes were proposed in this pull request?

1, use maxBlockSizeInMB instead of blockSize(#rows) to control the stacking of vectors;
2, infer an appropriate maxBlockSizeInMB if set 0;

Why are the changes needed?

the performance gain is mainly related to the nnz of block.

f2jBLAS                          
Duration(millisecond) branch 3.0 Impl blockSizeInMB=0.0625 blockSizeInMB=0.125 blockSizeInMB=0.25 blockSizeInMB=0.5 blockSizeInMB=1 blockSizeInMB=2 blockSizeInMB=4 blockSizeInMB=8 blockSizeInMB=16 blockSizeInMB=32 blockSizeInMB=64 blockSizeInMB=128
epsilon(100%) 326481 26143 25710 24726 25395 25840 26846 25927 27431 26190 26056 26347 27204
epsilon3000(67%) 455247 35893 34366 34985 38387 38901 40426 40044 39161 38767 39965 39523 39108
epsilon4000(50%) 306390 42256 41164 43748 48638 50892 50986 51091 51072 51289 51652 53312 52146
epsilon5000(40%) 307619 43639 42992 44743 50800 51939 51871 52190 53850 52607 51062 52509 51570
epsilon10000(20%) 310070 58371 55921 56317 56618 53694 52131 51768 51728 52233 51881 51653 52440
epsilon20000(10%) 316565 109193 95121 82764 69653 60764 56066 53371 52822 52872 52769 52527 53508
epsilon200000(1%) 336181 1569721 1069355 673718 375043 218230 145393 110926 94327 87039 83926 81890 81787
                           
                           
  Speedup                        
epsilon(100%) 1 12.48827602 12.69859977 13.20395535 12.85611341 12.63471362 12.16125307 12.59231689 11.90189931 12.46586483 12.5299739 12.39158158 12.00121306
epsilon3000(67%) 1 12.68344803 13.2470174 13.01263399 11.85940553 11.70270687 11.26124276 11.36866946 11.62500958 11.74315784 11.39114225 11.51853351 11.64076404
epsilon4000(50%) 1 7.250804619 7.443154212 7.003520161 6.299395534 6.020396133 6.00929667 5.996946625 5.999177632 5.973795551 5.931812902 5.747111345 5.875618456
epsilon5000(40%) 1 7.049176196 7.155261444 6.875243055 6.055492126 5.92269778 5.930462108 5.894213451 5.712516249 5.847491779 6.024421292 5.858405226 5.965076595
epsilon10000(20%) 1 5.312055644 5.544786395 5.505797539 5.4765269 5.774760681 5.947900481 5.98960748 5.994239097 5.93628549 5.976561747 6.002942714 5.912852784
epsilon20000(10%) 1 2.899132728 3.328024306 3.824911797 4.544886796 5.209745902 5.64629187 5.931404695 5.993052137 5.987384627 5.999071425 6.026710073 5.916218136
epsilon200000(1%) 1 0.214166084 0.314377358 0.498993644 0.896379882 1.540489392 2.312222734 3.03067811 3.563995463 3.862417997 4.005683578 4.105275369 4.110445425
OpenBLAS                          
Duration(millisecond) branch 3.0 Impl blockSizeInMB=0.0625 blockSizeInMB=0.125 blockSizeInMB=0.25 blockSizeInMB=0.5 blockSizeInMB=1 blockSizeInMB=2 blockSizeInMB=4 blockSizeInMB=8 blockSizeInMB=16 blockSizeInMB=32 blockSizeInMB=64 blockSizeInMB=128
epsilon(100%) 299119 26047 25049 25239 28001 35138 36438 36279 36114 35111 35428 36295 35197
epsilon3000(67%) 439798 33321 34423 34336 38906 51756 54138 54085 53412 54766 54425 54221 54842
epsilon4000(50%) 302963 42960 40678 43483 48254 50888 54990 52647 51947 51843 52891 53410 52020
epsilon5000(40%) 303569 44225 44961 45065 51768 52776 51930 53587 53104 51833 52138 52574 53756
epsilon10000(20%) 307403 58447 55993 56757 56694 54038 52734 52073 52051 52150 51986 52407 52390
epsilon20000(10%) 313344 107580 94679 83329 70226 60996 57130 55461 54641 52712 52541 53101 53312
epsilon200000(1%) 334679 1642726 1073148 654481 364974 213881 140248 107579 91757 85090 81940 80492 80250
                           
                           
  Speedup                        
epsilon(100%) 1 11.48381771 11.94135494 11.85146004 10.68243991 8.512692811 8.208985125 8.244962651 8.282632774 8.519238985 8.443011178 8.241328007 8.498423161
epsilon3000(67%) 1 13.19882356 12.7762833 12.80865564 11.30411762 8.497526857 8.123646976 8.131607655 8.234067251 8.030493372 8.080808452 8.111211523 8.01936472
epsilon4000(50%) 1 7.052211359 7.44783421 6.967389555 6.278505409 5.953525389 5.509419895 5.754610899 5.832155851 5.843855487 5.728063376 5.672402172 5.823971549
epsilon5000(40%) 1 6.86419446 6.751829363 6.736247642 5.864027971 5.752027437 5.845734643 5.664974714 5.716499699 5.856674319 5.822413595 5.774127896 5.647164968
epsilon10000(20%) 1 5.259517169 5.490025539 5.416124883 5.422143437 5.688645028 5.829313157 5.903308816 5.905803923 5.894592522 5.913188166 5.865685882 5.867589235
epsilon20000(10%) 1 2.912660346 3.309540658 3.760323537 4.461937174 5.137123746 5.48475407 5.649807973 5.734594901 5.944452876 5.963799699 5.900905821 5.87755102
epsilon200000(1%) 1 0.203733915 0.311866583 0.511365494 0.916994087 1.564790701 2.38633706 3.111006795 3.647449241 3.933235398 4.084439834 4.157916315 4.170454829

Does this PR introduce any user-facing change?

yes, param blockSize -> blockSizeInMB in master

How was this patch tested?

added testsuites and performance test (result attached in ticket)

@zhengruifeng
Copy link
Contributor Author

ping @WeichenXu123

@zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed:

mypy --no-incremental --config python/mypy.ini python/pyspark
python/pyspark/ml/linalg/__init__.pyi:25: error: misplaced type annotation

I installed mypy by sudo apt install mypy in ubuntu 18.04,
I am not very similar to mypy, do I need to configure it somewhere?

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34257/

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34257/

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Test build #129653 has finished for PR 30009 at commit eb0cf6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasBlockSizeInMB extends Params
  • class HasBlockSizeInMB(Params):

@zero323
Copy link
Member

zero323 commented Oct 12, 2020

@zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed:

mypy --no-incremental --config python/mypy.ini python/pyspark
python/pyspark/ml/linalg/__init__.pyi:25: error: misplaced type annotation

I installed mypy by sudo apt install mypy in ubuntu 18.04,
I am not very similar to mypy, do I need to configure it somewhere?

No additional configuration should be required, but the version from Ubuntu errors is pretty old, and at first glance it doesn't support error codes ([import] part).

Personally I'd recommend either venv or miniconda, but if you want quick fix, installing pip and making user install should do the trick

sudo apt purge mypy
sudo apt install python3-pip
pip install mypy

I've checked things on my side (mypy 0.790, current stable), for both master and this PR, and things look good.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made first pass.
Overall good.

@zhengruifeng
Copy link
Contributor Author

@zero323 Yes, that is because the version installed via sudo apt install mypy is too old (0.560).
pip install mypy works for me. Thank you!

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34330/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34330/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Test build #129724 has finished for PR 30009 at commit 9cd1053.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34346/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34346/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34348/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Test build #129740 has finished for PR 30009 at commit 08cf27d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Test build #129742 has finished for PR 30009 at commit 9245263.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34348/

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify logic:

new Iterator[T] {
  override def hasNext: Boolean = rowIter.hasNext()
  override def next(): T = {
     val buff = ..
     val buffNnz = 0
     while (rowIter.hasNext() && estimateSize(...) < maxMemUsage) {
        val row = rowIter.next()
        buff.append(row)
        nnz += ...
     }
     // the block mem usage may slightly exceed threshold, not a big issue.
     // and this ensure even if one row exceed block limit, each block has one row
     InstanceBlock.fromBuff(buff)
  }
}

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 14, 2020

Test build #129756 has finished for PR 30009 at commit df02e98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34362/

@SparkQA
Copy link

SparkQA commented Oct 14, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34362/

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129786 has finished for PR 30009 at commit df02e98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34393/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34393/

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's paste benchmark result on the PR description.

Have you benchmark on other BLAS besides f2jBLAS ?

s"which may hurt performance in high-level BLAS.")
}
if (actualBlockSizeInMB == 0) {
val avgNNZ = summarizer.numNonzeros.activeIterator.map(_._2 / summarizer.count).sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the additional summarizer consume time ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, one more metric numNonZeros will be computed.
Since it still need only one pass, I think the additional time should not be significant.

Comment on lines 186 to 202
if (dim <= avgNNZ * 3) {
0.25
} else {
64.0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document why choose the value ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current strategy is quitely simple, I think we may use a complex costmodel if necessay in the future.

@WeichenXu123
Copy link
Contributor

@mengxr Do you want to take a look ?


// instances larger than maxMemUsage
val bigInstance = Instance(-1.0, 2.0, Vectors.dense(Array.fill(10000)(1.0)))
InstanceBlock.blokifyWithMaxMemUsage(Iterator.fill(10)(bigInstance), 64).size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify block contains 1 row.

intercept[IllegalArgumentException] {
InstanceBlock.blokifyWithMaxMemUsage(Iterator.apply(instance1, bigInstance), 64).size
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test:

  • Generate a sparse and dense instance mixed list (a list which some segment is dense but others are very sparse), verify each block size won't exceed the blockMem limit too much. (Such as: (actual block mem size)/confg <= 1.1 ?)

@zhengruifeng
Copy link
Contributor Author

Have you benchmark on other BLAS besides f2jBLAS ?

@WeichenXu123 both f2jBlas and openBlas were benchmarked, and recorded in the result excel file.

@zhengruifeng zhengruifeng force-pushed the adaptively_blockify_linear_svc_II branch from a82e5f5 to a69ca83 Compare November 12, 2020 03:25
@zhengruifeng zhengruifeng added PYSPARK and removed CORE labels Nov 12, 2020
@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35564/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130958 has finished for PR 30009 at commit a82e5f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35564/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35566/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130960 has finished for PR 30009 at commit a69ca83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasMaxBlockSizeInMB extends Params
  • class HasMaxBlockSizeInMB(Params):

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35566/

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130969 has finished for PR 30009 at commit a69ca83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasMaxBlockSizeInMB extends Params
  • class HasMaxBlockSizeInMB(Params):

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35575/

@zhengruifeng zhengruifeng changed the title [SPARK-32907][ML] adaptively blockify instances - LinearSVC [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC Nov 12, 2020
@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35575/

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130977 has finished for PR 30009 at commit a69ca83.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasMaxBlockSizeInMB extends Params
  • class HasMaxBlockSizeInMB(Params):

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35582/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35582/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130981 has finished for PR 30009 at commit a69ca83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasMaxBlockSizeInMB extends Params
  • class HasMaxBlockSizeInMB(Params):

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35587/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35587/

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WeichenXu123
Copy link
Contributor

Merged to master. Thanks!

@zhengruifeng zhengruifeng deleted the adaptively_blockify_linear_svc_II branch November 12, 2020 11:57
@zhengruifeng
Copy link
Contributor Author

Thanks @WeichenXu123 @mengxr @zero323 for review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants