-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC #30009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC #30009
Conversation
|
ping @WeichenXu123 @zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed: I installed |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129653 has finished for PR 30009 at commit
|
No additional configuration should be required, but the version from Ubuntu errors is pretty old, and at first glance it doesn't support error codes ( Personally I'd recommend either venv or miniconda, but if you want quick fix, installing pip and making user install should do the trick I've checked things on my side (mypy 0.790, current stable), for both master and this PR, and things look good. |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made first pass.
Overall good.
mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
Outdated
Show resolved
Hide resolved
|
@zero323 Yes, that is because the version installed via |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129724 has finished for PR 30009 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Test build #129740 has finished for PR 30009 at commit
|
|
Test build #129742 has finished for PR 30009 at commit
|
|
Kubernetes integration test status success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's simplify logic:
new Iterator[T] {
override def hasNext: Boolean = rowIter.hasNext()
override def next(): T = {
val buff = ..
val buffNnz = 0
while (rowIter.hasNext() && estimateSize(...) < maxMemUsage) {
val row = rowIter.next()
buff.append(row)
nnz += ...
}
// the block mem usage may slightly exceed threshold, not a big issue.
// and this ensure even if one row exceed block limit, each block has one row
InstanceBlock.fromBuff(buff)
}
}
mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
Outdated
Show resolved
Hide resolved
|
retest this please |
|
Test build #129756 has finished for PR 30009 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
retest this please |
|
Test build #129786 has finished for PR 30009 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's paste benchmark result on the PR description.
Have you benchmark on other BLAS besides f2jBLAS ?
| s"which may hurt performance in high-level BLAS.") | ||
| } | ||
| if (actualBlockSizeInMB == 0) { | ||
| val avgNNZ = summarizer.numNonzeros.activeIterator.map(_._2 / summarizer.count).sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the additional summarizer consume time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, one more metric numNonZeros will be computed.
Since it still need only one pass, I think the additional time should not be significant.
| if (dim <= avgNNZ * 3) { | ||
| 0.25 | ||
| } else { | ||
| 64.0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document why choose the value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current strategy is quitely simple, I think we may use a complex costmodel if necessay in the future.
|
@mengxr Do you want to take a look ? |
|
|
||
| // instances larger than maxMemUsage | ||
| val bigInstance = Instance(-1.0, 2.0, Vectors.dense(Array.fill(10000)(1.0))) | ||
| InstanceBlock.blokifyWithMaxMemUsage(Iterator.fill(10)(bigInstance), 64).size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify block contains 1 row.
| intercept[IllegalArgumentException] { | ||
| InstanceBlock.blokifyWithMaxMemUsage(Iterator.apply(instance1, bigInstance), 64).size | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test:
- Generate a sparse and dense instance mixed list (a list which some segment is dense but others are very sparse), verify each block size won't exceed the blockMem limit too much. (Such as: (actual block mem size)/confg <= 1.1 ?)
@WeichenXu123 both f2jBlas and openBlas were benchmarked, and recorded in the result excel file. |
a82e5f5 to
a69ca83
Compare
|
Kubernetes integration test starting |
|
Test build #130958 has finished for PR 30009 at commit
|
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Test build #130960 has finished for PR 30009 at commit
|
|
Kubernetes integration test status failure |
|
retest this please |
|
Test build #130969 has finished for PR 30009 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
retest this please |
|
Test build #130977 has finished for PR 30009 at commit
|
|
retest this please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130981 has finished for PR 30009 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master. Thanks! |
|
Thanks @WeichenXu123 @mengxr @zero323 for review! |
What changes were proposed in this pull request?
1, use
maxBlockSizeInMBinstead ofblockSize(#rows) to control the stacking of vectors;2, infer an appropriate
maxBlockSizeInMBif set 0;Why are the changes needed?
the performance gain is mainly related to the nnz of block.
Does this PR introduce any user-facing change?
yes, param
blockSize->blockSizeInMBin masterHow was this patch tested?
added testsuites and performance test (result attached in ticket)