Skip to content

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented Apr 25, 2015

The Python SerDe calls Object.hashCode, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. @srowen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't mean to have a case here do you? just a closure not partial function

@SparkQA
Copy link

SparkQA commented Apr 25, 2015

Test build #30951 has finished for PR 5697 at commit 1ebad60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch adds the following new dependencies:
    • tachyon-0.6.4.jar
    • tachyon-client-0.6.4.jar
  • This patch removes the following dependencies:
    • tachyon-0.5.0.jar
    • tachyon-client-0.5.0.jar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should SparseVector use the first 16 non-zero values instead of the first 16? For sparse data, it would not be surprising for a bunch of vectors to have all-zeros in the first 16 elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dense vectors, it may contains many zeros at the beginning, and then the cost is still high if we look for the first 16 nonzeros. The hashCode is not used in any MLlib code. This is just to reduce the overhead of Pyrolite serializer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds fine. Curious: Why do DenseVector and SparseVector need to implement the same hash code? I was thinking DenseVectors could stay as they are (the first 16 elements), while SparseVectors could look at the first 16 non-zeros to reduce collisions. But I'm not familiar with this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a contract in Java (http://en.wikipedia.org/wiki/Java_hashCode()):

The general contract for overridden implementations of this method is that they behave in a way consistent with the same object's equals() method: that a given object must consistently report the same hash value (unless it is changed so that the new version is no longer considered "equal" to the old), and that two objects which equals() says are equal must report the same hash value.

So sv == dv -> sv.## == dv.##.

@mengxr mengxr changed the title [SPARK-7140][MLLIB] only scan the first 16 nonzeros in Vector.hashCode [SPARK-7140][MLLIB] only scan the first 16 entries in Vector.hashCode Apr 26, 2015
@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #712 has finished for PR 5697 at commit 2abc86d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@asfgit asfgit closed this in b14cd23 Apr 28, 2015
asfgit pushed a commit that referenced this pull request Apr 28, 2015
The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen

Author: Xiangrui Meng <[email protected]>

Closes #5697 from mengxr/SPARK-7140 and squashes the following commits:

2abc86d [Xiangrui Meng] typo
8fb7d74 [Xiangrui Meng] update impl
1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode

(cherry picked from commit b14cd23)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor Author

mengxr commented Apr 28, 2015

Merged into master and branch-1.3.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen

Author: Xiangrui Meng <[email protected]>

Closes apache#5697 from mengxr/SPARK-7140 and squashes the following commits:

2abc86d [Xiangrui Meng] typo
8fb7d74 [Xiangrui Meng] update impl
1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen

Author: Xiangrui Meng <[email protected]>

Closes apache#5697 from mengxr/SPARK-7140 and squashes the following commits:

2abc86d [Xiangrui Meng] typo
8fb7d74 [Xiangrui Meng] update impl
1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants