-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7140][MLLIB] only scan the first 16 entries in Vector.hashCode #5697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't mean to have a case here do you? just a closure not partial function
|
Test build #30951 has finished for PR 5697 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should SparseVector use the first 16 non-zero values instead of the first 16? For sparse data, it would not be surprising for a bunch of vectors to have all-zeros in the first 16 elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dense vectors, it may contains many zeros at the beginning, and then the cost is still high if we look for the first 16 nonzeros. The hashCode is not used in any MLlib code. This is just to reduce the overhead of Pyrolite serializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds fine. Curious: Why do DenseVector and SparseVector need to implement the same hash code? I was thinking DenseVectors could stay as they are (the first 16 elements), while SparseVectors could look at the first 16 non-zeros to reduce collisions. But I'm not familiar with this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a contract in Java (http://en.wikipedia.org/wiki/Java_hashCode()):
The general contract for overridden implementations of this method is that they behave in a way consistent with the same object's equals() method: that a given object must consistently report the same hash value (unless it is changed so that the new version is no longer considered "equal" to the old), and that two objects which equals() says are equal must report the same hash value.
So sv == dv -> sv.## == dv.##.
|
Test build #712 has finished for PR 5697 at commit
|
The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen Author: Xiangrui Meng <[email protected]> Closes #5697 from mengxr/SPARK-7140 and squashes the following commits: 2abc86d [Xiangrui Meng] typo 8fb7d74 [Xiangrui Meng] update impl 1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode (cherry picked from commit b14cd23) Signed-off-by: Xiangrui Meng <[email protected]>
|
Merged into master and branch-1.3. |
The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen Author: Xiangrui Meng <[email protected]> Closes apache#5697 from mengxr/SPARK-7140 and squashes the following commits: 2abc86d [Xiangrui Meng] typo 8fb7d74 [Xiangrui Meng] update impl 1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode
The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen Author: Xiangrui Meng <[email protected]> Closes apache#5697 from mengxr/SPARK-7140 and squashes the following commits: 2abc86d [Xiangrui Meng] typo 8fb7d74 [Xiangrui Meng] update impl 1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode
The Python SerDe calls
Object.hashCode, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. @srowen