[SPARK-5186] [MLLIB] Vector.equals and Vector.hashCode are very inefficient #3997

hhbyyh · 2015-01-12T05:23:08Z

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5186

Currently SparseVector is using the inherited equals from Vector, which will create a full-size array for even the sparse vector. The pull request contains a specialized equals optimization that improves on both time and space.

The implementation will be consistent with the original. Especially it will keep equality comparison between SparseVector and DenseVector.

AmplabJenkins · 2015-01-12T05:27:08Z

Can one of the admins verify this patch?

mengxr · 2015-01-12T05:52:53Z

add to whitelist

mengxr · 2015-01-12T05:52:57Z

ok to test

SparkQA · 2015-01-12T05:57:34Z

Test build #25399 has started for PR 3997 at commit 5741144.

This patch merges cleanly.

SparkQA · 2015-01-12T07:07:04Z

Test build #25399 has finished for PR 3997 at commit 5741144.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-12T07:07:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25399/
Test PASSed.

hhbyyh · 2015-01-12T07:20:38Z

Thanks, can someone help review please?

srowen · 2015-01-12T10:53:29Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

You must override hashCode too!

Thanks @srowen for the comment. Glad to discuss it with someone.

Vector: override def hashCode(): Int = util.Arrays.hashCode(this.toArray)

I understand it's the general guideline to override hashCode at the same time.
Yet intentionally or not, the original code promises that DenseVector and SparseVector would return the same results of equals and hashCode for the same array content. And that makes some senses.

As in the description of the PR, I don’t want to introduce breaking changes. And if we want to keep the original design, the current implementation of hashCode in Vector is one of the best choices. That’s why hashCode was intentionally left out of the PR. (maybe I should add some comment)

Yeah I was thinking of performance too. That is a fair point of course about not changing the semantics. I think the hashCode impl could be changed in both places to produce the same result while being faster for sparse vectors: what if the hash were defined only over nonzero entries? I think equals could likewise be sped up also for the case of comparing sparse to dense? that is I think this can be taken a step further than just optimizing sparse == sparse.

+1 on @srowen 's suggestion. We can design a hashing scheme based on the size and nonzero entries (with their indices). But I think we could do that in a separate PR. Does it sound good?

Yes, the nonzero entries idea did cross my mind. Maybe it's overcautious that I think it might become a complexity if we want to have another kind of Vector in the future, which don't have handy internal structure to scan for the nonzero entries. Again, this can be overcautious.

And the dense == sparse idea looks good, maybe that suits into a util method better as it would not introduce the existence of DenseVector to SparseVector and vice versa.

mengxr · 2015-01-13T20:47:53Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

Just realized that this did change the behavior of the original equals, because we allow values to have explicit zeros. Maybe we can use Vectors.sqdist(this, other) == 0 to implement equals().

Nice suggestion. This should be your contribution and I'll close the PR. Thanks.

hhbyyh · 2015-01-14T00:31:18Z

There's a better idea from mengxr.

srowen · 2015-01-14T16:56:42Z

@hhbyyh Although I don't wish to speak for @mengxr I strongly suspect he would welcome you implementing the idea. I'm not sure we should use sqdist since it looks like it doesn't require vectors of the same size, and that seems essential for equals and hashCode? But I think it can indeed handle the cases of sparse vs dense similarly and efficiently.

mengxr · 2015-01-15T01:58:57Z

Agree with @srowen ! @hhbyyh It would be great if you want to re-open this PR and implement a faster and correct Vector.equals. If sqdist doesn't check vector sizes, we should add checks there. It is also nice to add a unit test in VectorsSuite to show it works for two sparse vectors with different values array but equal.

hhbyyh · 2015-01-15T02:07:17Z

Thanks, I'll try on it.

SparkQA · 2015-01-15T15:57:38Z

Test build #25607 has started for PR 3997 at commit 50abef3.

This patch merges cleanly.

SparkQA · 2015-01-15T15:58:29Z

Test build #25607 has finished for PR 3997 at commit 50abef3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-15T15:58:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25607/
Test FAILed.

hhbyyh · 2015-01-15T16:06:15Z

Just send an update. I didn't usesqdist due to the performance concern. Since the original equals is actually a fail-fast comparison, yet sqdist will inevitably compute through the vectors even if the first element is different. The performance will be hard to accept for scenarios like doc2Vec over a large vocabulary.

Current implementation is still based on the comparison for indices and values, just with the handling of the explicit 0. I gave some tests to the implementation and add a few ut. Any comment will be welcome!

SparkQA · 2015-01-15T16:32:43Z

Test build #25608 has started for PR 3997 at commit a6952c3.

This patch merges cleanly.

srowen · 2015-01-15T17:32:47Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

I wondered to myself whether this could be simplified to not have while (true), the dummy Exception, etc. The best I could do was with a helper function:

... var k1 = nextNonzero(this.values, 0) var k2 = nextNonzero(v.values, 0) while (k1 < this.values.size && k2 < v.values.size) { if (this.indices(k1) != v.indices(k2) || this.values(k1) != v.values(k2)) { return false } k1 = nextNonzero(this.values, k1 + 1) k2 = nextNonzero(v.values, k2 + 1) } return (k1 == this.values.size && k2 == v.values.size) ... def nextNonzero(values: Array[Double], from: Int): Int = { var index = from while (index < this.values.size && this.values(index) == 0.0) index += 1 index }

I'm not sure it's better, just food for thought.

So the idea would be to specialize hashCode as well, and also handle DenseVector right? and even remove the implementations in the parent?

SparkQA · 2015-01-15T17:41:29Z

Test build #25608 has finished for PR 3997 at commit a6952c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-15T17:41:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25608/
Test PASSed.

hhbyyh · 2015-01-16T02:16:52Z

@srowen Thanks a lot for the improvement. The code do look cleaner and the helper function nextNonzero can also be used in the hashCode function. Great idea.

I'll use it if you don't mind. And this in the nextNonzero can be removed, right?

srowen · 2015-01-16T03:59:10Z

@hhbyyh Oops yes that's a typo. The helper function should refer to the local argument values only!

mengxr · 2015-01-16T07:24:48Z

@hhbyyh @srowen There are some performance issues if we use unnecessary index lookup. Having many other.values(i) calls is slower than val otherValues = other.values and then many otherValues(i) calls. I'm suggesting some code like the following (I didn't try compiling the code):

var ii0 = this.indices
var vv0 = this.values
var ii1 = this.indices
var vv1 = this.values
var j0 = 0
var j1 = 0
var i0 = 0
var i1 = 0
var v0 = 0.0
var v1 = 0.0
var pj0 = -1
var pj1 = -1
var allEqual = true
while(allEqual && j0 < sz0 && j1 < sz1) {
  if (pj0 < j0) {
    i0 = ii0(j0)
    v0 = vv0(j0)
    pj0 = j0
  }
  if (pj1 < j1) {
    i1 = ii1(j1)
    v1 = vv1(j1)
    pj1 = j1
  }
  if (i0 == i1) {   
    allEqual &&= v0 == v1
    j0 += 1
    j1 += 1
  } else if (i0 < i1) {
    allEqual &&= v0 == 0.0
    j0 += 1
  } else {
    allEqual &&= v1 == 0.0
    j1 += 1
  }
while (allEqual & j0 < sz0) {
  allEqual &&= vv0(j0) == 0.0
  j0 += 1
}
while (allEqual & j1 < sz1) {
  allEqual &&= vv1(j1) == 0.0
  j1 += 1
}
allEqual

SparkQA · 2015-01-16T08:02:42Z

Test build #25645 has started for PR 3997 at commit bdf8789.

This patch merges cleanly.

hhbyyh · 2015-01-16T08:52:19Z

Oh I didn't see your comment before the update. @mengxr I surely found a lot of code using the pattern (first assign locally, then access) in Spark, and we can follow it.

Just to be honest, I'm not sure I completely understand the reason, is it for sparing a memory addressing? I ran some local perf test and got no obvious difference. Would you please share some insight? Thanks a lot!

SparkQA · 2015-01-16T09:14:23Z

Test build #25645 has finished for PR 3997 at commit bdf8789.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-01-16T15:41:45Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

Yes I like this version. Still pretty clear, and I think it's fine to optimize this a bit. There's still an opportunity to optimize sparse vs dense comparison, right?

Yes, about the sparse vs dense.
My current thought is put a new function in Vectors, maybe something like sqdist(sparse, dense), and call it from Vector.equals, does it sound good?

You mean a function structured like sqdist, not a distance-based function right? yes that sounds good to me.

yes, and I found the sparse vs dense is actually quite similar to sparse vs sparse, I'm trying to unify them.

SparkQA · 2015-01-16T16:36:59Z

Test build #25663 has finished for PR 3997 at commit 985e160.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-16T16:37:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25663/
Test PASSed.

SparkQA · 2015-01-18T07:07:40Z

Test build #25714 has started for PR 3997 at commit 93f0d46.

This patch merges cleanly.

SparkQA · 2015-01-18T08:13:01Z

Test build #25714 has finished for PR 3997 at commit 93f0d46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-18T08:13:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25714/
Test PASSed.

hhbyyh · 2015-01-18T09:44:43Z

Sent an update unifying the "sparse vs sparse" and "sparse vs dense".

srowen · 2015-01-18T19:22:09Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

Yeah looking good. I have only minor things to say, which don't necessarily need to change. For example the utility method could be called Vectors.equals. Should there even be a catch-all case? I wonder if failing fast is good, if no other implementations are expected now.

Thanks for review, @srowen
Yes, I can change the name. Just like the name in util.Arrays.equals.
And for the catch-all case, I was thinking util.Arrays.equals is also fail-fast and it can cover the dense vs dense case well, and with some extensiblity. Maybe I don't catch your point. Let me know if still we should change this. Thanks

SparkQA · 2015-01-19T05:57:42Z

Test build #25739 has started for PR 3997 at commit 0d9d130.

This patch merges cleanly.

SparkQA · 2015-01-19T07:06:17Z

Test build #25739 has finished for PR 3997 at commit 0d9d130.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-19T07:06:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25739/
Test PASSed.

hhbyyh · 2015-01-19T15:08:15Z

Update added for ut and function name change. @mengxr @srowen

mengxr · 2015-01-20T20:59:48Z

test this please

SparkQA · 2015-01-20T21:02:50Z

Test build #25843 has started for PR 3997 at commit 0d9d130.

This patch merges cleanly.

SparkQA · 2015-01-20T22:07:31Z

Test build #25843 has finished for PR 3997 at commit 0d9d130.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-20T22:07:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25843/
Test PASSed.

mengxr · 2015-01-20T23:21:04Z

LGTM. Merged into master. Thanks!

hhbyyh · 2015-01-21T01:15:05Z

Thanks

…icient JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5186 Currently SparseVector is using the inherited equals from Vector, which will create a full-size array for even the sparse vector. The pull request contains a specialized equals optimization that improves on both time and space. 1. The implementation will be consistent with the original. Especially it will keep equality comparison between SparseVector and DenseVector. Author: Yuhao Yang <[email protected]> Author: Yuhao Yang <[email protected]> Closes apache#3997 from hhbyyh/master and squashes the following commits: 0d9d130 [Yuhao Yang] function name change and ut update 93f0d46 [Yuhao Yang] unify sparse vs dense vectors 985e160 [Yuhao Yang] improve locality for equals bdf8789 [Yuhao Yang] improve equals and rewrite hashCode for Vector a6952c3 [Yuhao Yang] fix scala style for comments 50abef3 [Yuhao Yang] fix ut for sparse vector with explicit 0 f41b135 [Yuhao Yang] iterative equals for sparse vector 5741144 [Yuhao Yang] Specialized equals for SparseVector

srowen reviewed Jan 12, 2015
View reviewed changes

Specialized equals for SparseVector

5741144

mengxr reviewed Jan 13, 2015
View reviewed changes

hhbyyh closed this Jan 14, 2015

hhbyyh reopened this Jan 15, 2015

srowen reviewed Jan 15, 2015
View reviewed changes

srowen reviewed Jan 16, 2015
View reviewed changes

hhbyyh added 2 commits January 16, 2015 23:47

fix ut for sparse vector with explicit 0

50abef3

fix scala style for comments

a6952c3

hhbyyh added 2 commits January 17, 2015 15:57

improve equals and rewrite hashCode for Vector

bdf8789

improve locality for equals

985e160

srowen reviewed Jan 18, 2015
View reviewed changes

unify sparse vs dense vectors

93f0d46

function name change and ut update

0d9d130

asfgit closed this in 2f82c84 Jan 20, 2015

[SPARK-5186] [MLLIB] Vector.equals and Vector.hashCode are very inefficient #3997

[SPARK-5186] [MLLIB] Vector.equals and Vector.hashCode are very inefficient #3997

Uh oh!

Conversation

hhbyyh commented Jan 12, 2015

Uh oh!

AmplabJenkins commented Jan 12, 2015

Uh oh!

mengxr commented Jan 12, 2015

Uh oh!

mengxr commented Jan 12, 2015

Uh oh!

SparkQA commented Jan 12, 2015

Uh oh!

SparkQA commented Jan 12, 2015

Uh oh!

AmplabJenkins commented Jan 12, 2015

Uh oh!

hhbyyh commented Jan 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Jan 14, 2015

Uh oh!

srowen commented Jan 14, 2015

Uh oh!

mengxr commented Jan 15, 2015

Uh oh!

hhbyyh commented Jan 15, 2015

Uh oh!

SparkQA commented Jan 15, 2015

Uh oh!

SparkQA commented Jan 15, 2015

Uh oh!

AmplabJenkins commented Jan 15, 2015

Uh oh!

hhbyyh commented Jan 15, 2015

Uh oh!

SparkQA commented Jan 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2015

Uh oh!

AmplabJenkins commented Jan 15, 2015

Uh oh!

hhbyyh commented Jan 16, 2015

Uh oh!

srowen commented Jan 16, 2015

Uh oh!

mengxr commented Jan 16, 2015

Uh oh!

SparkQA commented Jan 16, 2015

Uh oh!

hhbyyh commented Jan 16, 2015

Uh oh!

SparkQA commented Jan 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 16, 2015

Uh oh!