-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5186] [MLLIB] Vector.equals and Vector.hashCode are very inefficient #3997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
add to whitelist |
|
ok to test |
|
Test build #25399 has started for PR 3997 at commit
|
|
Test build #25399 has finished for PR 3997 at commit
|
|
Test PASSed. |
|
Thanks, can someone help review please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must override hashCode too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @srowen for the comment. Glad to discuss it with someone.
Vector: override def hashCode(): Int = util.Arrays.hashCode(this.toArray)
I understand it's the general guideline to override hashCode at the same time.
Yet intentionally or not, the original code promises that DenseVector and SparseVector would return the same results of equals and hashCode for the same array content. And that makes some senses.
As in the description of the PR, I don’t want to introduce breaking changes. And if we want to keep the original design, the current implementation of hashCode in Vector is one of the best choices. That’s why hashCode was intentionally left out of the PR. (maybe I should add some comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking of performance too. That is a fair point of course about not changing the semantics. I think the hashCode impl could be changed in both places to produce the same result while being faster for sparse vectors: what if the hash were defined only over nonzero entries? I think equals could likewise be sped up also for the case of comparing sparse to dense? that is I think this can be taken a step further than just optimizing sparse == sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @srowen 's suggestion. We can design a hashing scheme based on the size and nonzero entries (with their indices). But I think we could do that in a separate PR. Does it sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the nonzero entries idea did cross my mind. Maybe it's overcautious that I think it might become a complexity if we want to have another kind of Vector in the future, which don't have handy internal structure to scan for the nonzero entries. Again, this can be overcautious.
And the dense == sparse idea looks good, maybe that suits into a util method better as it would not introduce the existence of DenseVector to SparseVector and vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that this did change the behavior of the original equals, because we allow values to have explicit zeros. Maybe we can use Vectors.sqdist(this, other) == 0 to implement equals().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion. This should be your contribution and I'll close the PR. Thanks.
|
There's a better idea from mengxr. |
|
@hhbyyh Although I don't wish to speak for @mengxr I strongly suspect he would welcome you implementing the idea. I'm not sure we should use |
|
Agree with @srowen ! @hhbyyh It would be great if you want to re-open this PR and implement a faster and correct |
|
Thanks, I'll try on it. |
|
Test build #25607 has started for PR 3997 at commit
|
|
Test build #25607 has finished for PR 3997 at commit
|
|
Test FAILed. |
|
Just send an update. I didn't use Current implementation is still based on the comparison for indices and values, just with the handling of the explicit 0. I gave some tests to the implementation and add a few ut. Any comment will be welcome! |
|
Test build #25608 has started for PR 3997 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered to myself whether this could be simplified to not have while (true), the dummy Exception, etc. The best I could do was with a helper function:
...
var k1 = nextNonzero(this.values, 0)
var k2 = nextNonzero(v.values, 0)
while (k1 < this.values.size && k2 < v.values.size) {
if (this.indices(k1) != v.indices(k2) || this.values(k1) != v.values(k2)) {
return false
}
k1 = nextNonzero(this.values, k1 + 1)
k2 = nextNonzero(v.values, k2 + 1)
}
return (k1 == this.values.size && k2 == v.values.size)
...
def nextNonzero(values: Array[Double], from: Int): Int = {
var index = from
while (index < this.values.size && this.values(index) == 0.0) index += 1
index
}
I'm not sure it's better, just food for thought.
So the idea would be to specialize hashCode as well, and also handle DenseVector right? and even remove the implementations in the parent?
|
Test build #25608 has finished for PR 3997 at commit
|
|
Test PASSed. |
|
@srowen Thanks a lot for the improvement. The code do look cleaner and the helper function I'll use it if you don't mind. And |
|
@hhbyyh Oops yes that's a typo. The helper function should refer to the local argument |
|
@hhbyyh @srowen There are some performance issues if we use unnecessary index lookup. Having many var ii0 = this.indices
var vv0 = this.values
var ii1 = this.indices
var vv1 = this.values
var j0 = 0
var j1 = 0
var i0 = 0
var i1 = 0
var v0 = 0.0
var v1 = 0.0
var pj0 = -1
var pj1 = -1
var allEqual = true
while(allEqual && j0 < sz0 && j1 < sz1) {
if (pj0 < j0) {
i0 = ii0(j0)
v0 = vv0(j0)
pj0 = j0
}
if (pj1 < j1) {
i1 = ii1(j1)
v1 = vv1(j1)
pj1 = j1
}
if (i0 == i1) {
allEqual &&= v0 == v1
j0 += 1
j1 += 1
} else if (i0 < i1) {
allEqual &&= v0 == 0.0
j0 += 1
} else {
allEqual &&= v1 == 0.0
j1 += 1
}
while (allEqual & j0 < sz0) {
allEqual &&= vv0(j0) == 0.0
j0 += 1
}
while (allEqual & j1 < sz1) {
allEqual &&= vv1(j1) == 0.0
j1 += 1
}
allEqual |
|
Test build #25645 has started for PR 3997 at commit
|
|
Oh I didn't see your comment before the update. @mengxr I surely found a lot of code using the pattern (first assign locally, then access) in Spark, and we can follow it. Just to be honest, I'm not sure I completely understand the reason, is it for sparing a memory addressing? I ran some local perf test and got no obvious difference. Would you please share some insight? Thanks a lot! |
|
Test build #25645 has finished for PR 3997 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I like this version. Still pretty clear, and I think it's fine to optimize this a bit. There's still an opportunity to optimize sparse vs dense comparison, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, about the sparse vs dense.
My current thought is put a new function in Vectors, maybe something like sqdist(sparse, dense), and call it from Vector.equals, does it sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean a function structured like sqdist, not a distance-based function right? yes that sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and I found the sparse vs dense is actually quite similar to sparse vs sparse, I'm trying to unify them.
|
Test build #25663 has finished for PR 3997 at commit
|
|
Test PASSed. |
|
Test build #25714 has started for PR 3997 at commit
|
|
Test build #25714 has finished for PR 3997 at commit
|
|
Test PASSed. |
|
Sent an update unifying the "sparse vs sparse" and "sparse vs dense". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looking good. I have only minor things to say, which don't necessarily need to change. For example the utility method could be called Vectors.equals. Should there even be a catch-all case? I wonder if failing fast is good, if no other implementations are expected now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review, @srowen
Yes, I can change the name. Just like the name in util.Arrays.equals.
And for the catch-all case, I was thinking util.Arrays.equals is also fail-fast and it can cover the dense vs dense case well, and with some extensiblity. Maybe I don't catch your point. Let me know if still we should change this. Thanks
|
Test build #25739 has started for PR 3997 at commit
|
|
Test build #25739 has finished for PR 3997 at commit
|
|
Test PASSed. |
|
test this please |
|
Test build #25843 has started for PR 3997 at commit
|
|
Test build #25843 has finished for PR 3997 at commit
|
|
Test PASSed. |
|
LGTM. Merged into master. Thanks! |
|
Thanks |
…icient JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5186 Currently SparseVector is using the inherited equals from Vector, which will create a full-size array for even the sparse vector. The pull request contains a specialized equals optimization that improves on both time and space. 1. The implementation will be consistent with the original. Especially it will keep equality comparison between SparseVector and DenseVector. Author: Yuhao Yang <[email protected]> Author: Yuhao Yang <[email protected]> Closes apache#3997 from hhbyyh/master and squashes the following commits: 0d9d130 [Yuhao Yang] function name change and ut update 93f0d46 [Yuhao Yang] unify sparse vs dense vectors 985e160 [Yuhao Yang] improve locality for equals bdf8789 [Yuhao Yang] improve equals and rewrite hashCode for Vector a6952c3 [Yuhao Yang] fix scala style for comments 50abef3 [Yuhao Yang] fix ut for sparse vector with explicit 0 f41b135 [Yuhao Yang] iterative equals for sparse vector 5741144 [Yuhao Yang] Specialized equals for SparseVector
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5186
Currently SparseVector is using the inherited equals from Vector, which will create a full-size array for even the sparse vector. The pull request contains a specialized equals optimization that improves on both time and space.