[SPARK-7140][MLLIB] only scan the first 16 entries in Vector.hashCode #5697

mengxr · 2015-04-25T05:33:21Z

The Python SerDe calls Object.hashCode, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. @srowen

rxin · 2015-04-25T07:14:16Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

you don't mean to have a case here do you? just a closure not partial function

SparkQA · 2015-04-25T08:00:52Z

Test build #30951 has finished for PR 5697 at commit 1ebad60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch adds the following new dependencies:
- tachyon-0.6.4.jar
- tachyon-client-0.6.4.jar
This patch removes the following dependencies:
- tachyon-0.5.0.jar
- tachyon-client-0.5.0.jar

jkbradley · 2015-04-25T23:39:23Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

Should SparseVector use the first 16 non-zero values instead of the first 16? For sparse data, it would not be surprising for a bunch of vectors to have all-zeros in the first 16 elements.

For dense vectors, it may contains many zeros at the beginning, and then the cost is still high if we look for the first 16 nonzeros. The hashCode is not used in any MLlib code. This is just to reduce the overhead of Pyrolite serializer.

OK, sounds fine. Curious: Why do DenseVector and SparseVector need to implement the same hash code? I was thinking DenseVectors could stay as they are (the first 16 elements), while SparseVectors could look at the first 16 non-zeros to reduce collisions. But I'm not familiar with this code.

This is a contract in Java (http://en.wikipedia.org/wiki/Java_hashCode()):

The general contract for overridden implementations of this method is that they behave in a way consistent with the same object's equals() method: that a given object must consistently report the same hash value (unless it is changed so that the new version is no longer considered "equal" to the old), and that two objects which equals() says are equal must report the same hash value.

So sv == dv -> sv.## == dv.##.

SparkQA · 2015-04-27T01:56:44Z

Test build #712 has finished for PR 5697 at commit 2abc86d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen Author: Xiangrui Meng <[email protected]> Closes #5697 from mengxr/SPARK-7140 and squashes the following commits: 2abc86d [Xiangrui Meng] typo 8fb7d74 [Xiangrui Meng] update impl 1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode (cherry picked from commit b14cd23) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2015-04-28T17:07:38Z

Merged into master and branch-1.3.

The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen Author: Xiangrui Meng <[email protected]> Closes apache#5697 from mengxr/SPARK-7140 and squashes the following commits: 2abc86d [Xiangrui Meng] typo 8fb7d74 [Xiangrui Meng] update impl 1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode

only scan the first 16 nonzeros in Vector.hashCode

1ebad60

rxin reviewed Apr 25, 2015
View reviewed changes

update impl

8fb7d74

jkbradley reviewed Apr 25, 2015
View reviewed changes

typo

2abc86d

mengxr changed the title ~~[SPARK-7140][MLLIB] only scan the first 16 nonzeros in Vector.hashCode~~ [SPARK-7140][MLLIB] only scan the first 16 entries in Vector.hashCode Apr 26, 2015

asfgit closed this in b14cd23 Apr 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-7140][MLLIB] only scan the first 16 entries in Vector.hashCode #5697

[SPARK-7140][MLLIB] only scan the first 16 entries in Vector.hashCode #5697

Uh oh!

mengxr commented Apr 25, 2015

Uh oh!

rxin Apr 25, 2015

Uh oh!

SparkQA commented Apr 25, 2015

Uh oh!

jkbradley Apr 25, 2015

Uh oh!

mengxr Apr 26, 2015

Uh oh!

jkbradley Apr 27, 2015

Uh oh!

mengxr Apr 27, 2015

Uh oh!

SparkQA commented Apr 27, 2015

Uh oh!

mengxr commented Apr 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-7140][MLLIB] only scan the first 16 entries in Vector.hashCode #5697

[SPARK-7140][MLLIB] only scan the first 16 entries in Vector.hashCode #5697

Uh oh!

Conversation

mengxr commented Apr 25, 2015

Uh oh!

rxin Apr 25, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 25, 2015

Uh oh!

jkbradley Apr 25, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr Apr 26, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 27, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr Apr 27, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 27, 2015

Uh oh!

mengxr commented Apr 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants