[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC #30009

zhengruifeng · 2020-10-12T02:39:01Z

What changes were proposed in this pull request?

1, use maxBlockSizeInMB instead of blockSize(#rows) to control the stacking of vectors;
2, infer an appropriate maxBlockSizeInMB if set 0;

Why are the changes needed?

the performance gain is mainly related to the nnz of block.

f2jBLAS
Duration(millisecond)	branch 3.0 Impl	blockSizeInMB=0.0625	blockSizeInMB=0.125	blockSizeInMB=0.25	blockSizeInMB=0.5	blockSizeInMB=1	blockSizeInMB=2	blockSizeInMB=4	blockSizeInMB=8	blockSizeInMB=16	blockSizeInMB=32	blockSizeInMB=64	blockSizeInMB=128
epsilon(100%)	326481	26143	25710	24726	25395	25840	26846	25927	27431	26190	26056	26347	27204
epsilon3000(67%)	455247	35893	34366	34985	38387	38901	40426	40044	39161	38767	39965	39523	39108
epsilon4000(50%)	306390	42256	41164	43748	48638	50892	50986	51091	51072	51289	51652	53312	52146
epsilon5000(40%)	307619	43639	42992	44743	50800	51939	51871	52190	53850	52607	51062	52509	51570
epsilon10000(20%)	310070	58371	55921	56317	56618	53694	52131	51768	51728	52233	51881	51653	52440
epsilon20000(10%)	316565	109193	95121	82764	69653	60764	56066	53371	52822	52872	52769	52527	53508
epsilon200000(1%)	336181	1569721	1069355	673718	375043	218230	145393	110926	94327	87039	83926	81890	81787


	Speedup
epsilon(100%)	1	12.48827602	12.69859977	13.20395535	12.85611341	12.63471362	12.16125307	12.59231689	11.90189931	12.46586483	12.5299739	12.39158158	12.00121306
epsilon3000(67%)	1	12.68344803	13.2470174	13.01263399	11.85940553	11.70270687	11.26124276	11.36866946	11.62500958	11.74315784	11.39114225	11.51853351	11.64076404
epsilon4000(50%)	1	7.250804619	7.443154212	7.003520161	6.299395534	6.020396133	6.00929667	5.996946625	5.999177632	5.973795551	5.931812902	5.747111345	5.875618456
epsilon5000(40%)	1	7.049176196	7.155261444	6.875243055	6.055492126	5.92269778	5.930462108	5.894213451	5.712516249	5.847491779	6.024421292	5.858405226	5.965076595
epsilon10000(20%)	1	5.312055644	5.544786395	5.505797539	5.4765269	5.774760681	5.947900481	5.98960748	5.994239097	5.93628549	5.976561747	6.002942714	5.912852784
epsilon20000(10%)	1	2.899132728	3.328024306	3.824911797	4.544886796	5.209745902	5.64629187	5.931404695	5.993052137	5.987384627	5.999071425	6.026710073	5.916218136
epsilon200000(1%)	1	0.214166084	0.314377358	0.498993644	0.896379882	1.540489392	2.312222734	3.03067811	3.563995463	3.862417997	4.005683578	4.105275369	4.110445425

OpenBLAS
Duration(millisecond)	branch 3.0 Impl	blockSizeInMB=0.0625	blockSizeInMB=0.125	blockSizeInMB=0.25	blockSizeInMB=0.5	blockSizeInMB=1	blockSizeInMB=2	blockSizeInMB=4	blockSizeInMB=8	blockSizeInMB=16	blockSizeInMB=32	blockSizeInMB=64	blockSizeInMB=128
epsilon(100%)	299119	26047	25049	25239	28001	35138	36438	36279	36114	35111	35428	36295	35197
epsilon3000(67%)	439798	33321	34423	34336	38906	51756	54138	54085	53412	54766	54425	54221	54842
epsilon4000(50%)	302963	42960	40678	43483	48254	50888	54990	52647	51947	51843	52891	53410	52020
epsilon5000(40%)	303569	44225	44961	45065	51768	52776	51930	53587	53104	51833	52138	52574	53756
epsilon10000(20%)	307403	58447	55993	56757	56694	54038	52734	52073	52051	52150	51986	52407	52390
epsilon20000(10%)	313344	107580	94679	83329	70226	60996	57130	55461	54641	52712	52541	53101	53312
epsilon200000(1%)	334679	1642726	1073148	654481	364974	213881	140248	107579	91757	85090	81940	80492	80250


	Speedup
epsilon(100%)	1	11.48381771	11.94135494	11.85146004	10.68243991	8.512692811	8.208985125	8.244962651	8.282632774	8.519238985	8.443011178	8.241328007	8.498423161
epsilon3000(67%)	1	13.19882356	12.7762833	12.80865564	11.30411762	8.497526857	8.123646976	8.131607655	8.234067251	8.030493372	8.080808452	8.111211523	8.01936472
epsilon4000(50%)	1	7.052211359	7.44783421	6.967389555	6.278505409	5.953525389	5.509419895	5.754610899	5.832155851	5.843855487	5.728063376	5.672402172	5.823971549
epsilon5000(40%)	1	6.86419446	6.751829363	6.736247642	5.864027971	5.752027437	5.845734643	5.664974714	5.716499699	5.856674319	5.822413595	5.774127896	5.647164968
epsilon10000(20%)	1	5.259517169	5.490025539	5.416124883	5.422143437	5.688645028	5.829313157	5.903308816	5.905803923	5.894592522	5.913188166	5.865685882	5.867589235
epsilon20000(10%)	1	2.912660346	3.309540658	3.760323537	4.461937174	5.137123746	5.48475407	5.649807973	5.734594901	5.944452876	5.963799699	5.900905821	5.87755102
epsilon200000(1%)	1	0.203733915	0.311866583	0.511365494	0.916994087	1.564790701	2.38633706	3.111006795	3.647449241	3.933235398	4.084439834	4.157916315	4.170454829

Does this PR introduce any user-facing change?

yes, param blockSize -> blockSizeInMB in master

How was this patch tested?

added testsuites and performance test (result attached in ticket)

zhengruifeng · 2020-10-12T02:45:14Z

ping @WeichenXu123

@zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed:

mypy --no-incremental --config python/mypy.ini python/pyspark
python/pyspark/ml/linalg/__init__.pyi:25: error: misplaced type annotation

I installed mypy by sudo apt install mypy in ubuntu 18.04,
I am not very similar to mypy, do I need to configure it somewhere?

SparkQA · 2020-10-12T03:22:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34257/

SparkQA · 2020-10-12T03:39:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34257/

SparkQA · 2020-10-12T04:08:22Z

Test build #129653 has finished for PR 30009 at commit eb0cf6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait HasBlockSizeInMB extends Params
class HasBlockSizeInMB(Params):

zero323 · 2020-10-12T07:03:04Z

@zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed:
mypy --no-incremental --config python/mypy.ini python/pyspark
python/pyspark/ml/linalg/__init__.pyi:25: error: misplaced type annotation
I installed mypy by sudo apt install mypy in ubuntu 18.04,
I am not very similar to mypy, do I need to configure it somewhere?

No additional configuration should be required, but the version from Ubuntu errors is pretty old, and at first glance it doesn't support error codes ([import] part).

Personally I'd recommend either venv or miniconda, but if you want quick fix, installing pip and making user install should do the trick

sudo apt purge mypy
sudo apt install python3-pip
pip install mypy

I've checked things on my side (mypy 0.790, current stable), for both master and this PR, and things look good.

WeichenXu123

Made first pass.
Overall good.

mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala

zhengruifeng · 2020-10-12T10:12:09Z

@zero323 Yes, that is because the version installed via sudo apt install mypy is too old (0.560).
pip install mypy works for me. Thank you!

SparkQA · 2020-10-13T04:00:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34330/

SparkQA · 2020-10-13T04:17:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34330/

SparkQA · 2020-10-13T04:28:59Z

Test build #129724 has finished for PR 30009 at commit 9cd1053.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-13T10:23:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34346/

SparkQA · 2020-10-13T10:42:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34346/

SparkQA · 2020-10-13T10:42:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34348/

SparkQA · 2020-10-13T10:48:58Z

Test build #129740 has finished for PR 30009 at commit 08cf27d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-13T10:58:05Z

Test build #129742 has finished for PR 30009 at commit 9245263.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-13T11:06:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34348/

WeichenXu123

Let's simplify logic:

new Iterator[T] {
  override def hasNext: Boolean = rowIter.hasNext()
  override def next(): T = {
     val buff = ..
     val buffNnz = 0
     while (rowIter.hasNext() && estimateSize(...) < maxMemUsage) {
        val row = rowIter.next()
        buff.append(row)
        nnz += ...
     }
     // the block mem usage may slightly exceed threshold, not a big issue.
     // and this ensure even if one row exceed block limit, each block has one row
     InstanceBlock.fromBuff(buff)
  }
}

mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala

zhengruifeng · 2020-10-14T08:25:43Z

retest this please

SparkQA · 2020-10-14T20:46:09Z

Test build #129756 has finished for PR 30009 at commit df02e98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-14T21:58:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34362/

SparkQA · 2020-10-14T22:15:42Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34362/

zhengruifeng · 2020-10-15T02:12:18Z

retest this please

SparkQA · 2020-10-15T04:09:47Z

Test build #129786 has finished for PR 30009 at commit df02e98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-15T04:24:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34393/

SparkQA · 2020-10-15T04:42:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34393/

WeichenXu123

Let's paste benchmark result on the PR description.

Have you benchmark on other BLAS besides f2jBLAS ?

WeichenXu123 · 2020-10-16T03:12:00Z

mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala

-          s"which may hurt performance in high-level BLAS.")
-      }
+    if (actualBlockSizeInMB == 0) {
+      val avgNNZ = summarizer.numNonzeros.activeIterator.map(_._2 / summarizer.count).sum


will the additional summarizer consume time ?

yes, one more metric numNonZeros will be computed.
Since it still need only one pass, I think the additional time should not be significant.

WeichenXu123 · 2020-10-16T03:12:39Z

mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala

+    if (dim <= avgNNZ * 3) {
+      0.25
+    } else {
+      64.0
+    }


Document why choose the value ?

Current strategy is quitely simple, I think we may use a complex costmodel if necessay in the future.

WeichenXu123 · 2020-10-16T03:15:44Z

@mengxr Do you want to take a look ?

WeichenXu123 · 2020-10-16T03:19:40Z

mllib/src/test/scala/org/apache/spark/ml/feature/InstanceSuite.scala

+
+    // instances larger than maxMemUsage
+    val bigInstance = Instance(-1.0, 2.0, Vectors.dense(Array.fill(10000)(1.0)))
+    InstanceBlock.blokifyWithMaxMemUsage(Iterator.fill(10)(bigInstance), 64).size


Verify block contains 1 row.

WeichenXu123 · 2020-10-16T03:21:40Z

mllib/src/test/scala/org/apache/spark/ml/feature/InstanceSuite.scala

+    intercept[IllegalArgumentException] {
+      InstanceBlock.blokifyWithMaxMemUsage(Iterator.apply(instance1, bigInstance), 64).size
+    }
+  }


add test:

Generate a sparse and dense instance mixed list (a list which some segment is dense but others are very sparse), verify each block size won't exceed the blockMem limit too much. (Such as: (actual block mem size)/confg <= 1.1 ?)

zhengruifeng · 2020-10-16T03:45:48Z

Have you benchmark on other BLAS besides f2jBLAS ?

@WeichenXu123 both f2jBlas and openBlas were benchmarked, and recorded in the result excel file.

SparkQA · 2020-11-12T03:44:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35564/

SparkQA · 2020-11-12T04:00:51Z

Test build #130958 has finished for PR 30009 at commit a82e5f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-12T04:05:42Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35564/

SparkQA · 2020-11-12T04:17:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35566/

SparkQA · 2020-11-12T04:38:13Z

Test build #130960 has finished for PR 30009 at commit a69ca83.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait HasMaxBlockSizeInMB extends Params
class HasMaxBlockSizeInMB(Params):

SparkQA · 2020-11-12T04:42:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35566/

zhengruifeng · 2020-11-12T05:29:58Z

retest this please

SparkQA · 2020-11-12T06:36:31Z

Test build #130969 has finished for PR 30009 at commit a69ca83.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait HasMaxBlockSizeInMB extends Params
class HasMaxBlockSizeInMB(Params):

SparkQA · 2020-11-12T06:42:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35575/

SparkQA · 2020-11-12T07:05:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35575/

zhengruifeng · 2020-11-12T07:40:19Z

retest this please

SparkQA · 2020-11-12T08:05:02Z

Test build #130977 has finished for PR 30009 at commit a69ca83.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait HasMaxBlockSizeInMB extends Params
class HasMaxBlockSizeInMB(Params):

zhengruifeng · 2020-11-12T08:09:25Z

retest this please

SparkQA · 2020-11-12T08:22:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35582/

SparkQA · 2020-11-12T08:54:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35582/

SparkQA · 2020-11-12T09:19:37Z

Test build #130981 has finished for PR 30009 at commit a69ca83.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait HasMaxBlockSizeInMB extends Params
class HasMaxBlockSizeInMB(Params):

SparkQA · 2020-11-12T09:34:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35587/

SparkQA · 2020-11-12T10:03:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35587/

WeichenXu123

LGTM

WeichenXu123 · 2020-11-12T11:17:45Z

Merged to master. Thanks!

zhengruifeng · 2020-11-12T11:57:56Z

Thanks @WeichenXu123 @mengxr @zero323 for review!

zhengruifeng added the ML label Oct 12, 2020

WeichenXu123 reviewed Oct 12, 2020

View reviewed changes

WeichenXu123 reviewed Oct 13, 2020

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala Outdated Show resolved Hide resolved

WeichenXu123 reviewed Oct 16, 2020

View reviewed changes

zhengruifeng force-pushed the adaptively_blockify_linear_svc_II branch from a82e5f5 to a69ca83 Compare November 12, 2020 03:25

github-actions bot added CORE PYTHON labels Nov 12, 2020

zhengruifeng added PYSPARK and removed CORE labels Nov 12, 2020

zhengruifeng changed the title ~~[SPARK-32907][ML] adaptively blockify instances - LinearSVC~~ [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC Nov 12, 2020

WeichenXu123 approved these changes Nov 12, 2020

View reviewed changes

WeichenXu123 closed this in a288716 Nov 12, 2020

zhengruifeng deleted the adaptively_blockify_linear_svc_II branch November 12, 2020 11:57

zhengruifeng mentioned this pull request Dec 18, 2020

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

Closed

[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC #30009

[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC #30009

Uh oh!

Conversation

zhengruifeng commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

zero323 commented Oct 12, 2020

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhengruifeng commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

WeichenXu123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhengruifeng commented Oct 14, 2020

Uh oh!

SparkQA commented Oct 14, 2020

Uh oh!

SparkQA commented Oct 14, 2020

Uh oh!

SparkQA commented Oct 14, 2020

Uh oh!

zhengruifeng commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Oct 16, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Oct 16, 2020

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Oct 16, 2020

zhengruifeng commented Oct 12, 2020 •

edited

Loading

WeichenXu123 left a comment •

edited

Loading