Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Jan 12, 2015

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5186

Currently SparseVector is using the inherited equals from Vector, which will create a full-size array for even the sparse vector. The pull request contains a specialized equals optimization that improves on both time and space.

  1. The implementation will be consistent with the original. Especially it will keep equality comparison between SparseVector and DenseVector.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mengxr
Copy link
Contributor

mengxr commented Jan 12, 2015

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Jan 12, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25399 has started for PR 3997 at commit 5741144.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25399 has finished for PR 3997 at commit 5741144.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25399/
Test PASSed.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 12, 2015

Thanks, can someone help review please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must override hashCode too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srowen for the comment. Glad to discuss it with someone.

Vector: override def hashCode(): Int = util.Arrays.hashCode(this.toArray)

I understand it's the general guideline to override hashCode at the same time.
Yet intentionally or not, the original code promises that DenseVector and SparseVector would return the same results of equals and hashCode for the same array content. And that makes some senses.

As in the description of the PR, I don’t want to introduce breaking changes. And if we want to keep the original design, the current implementation of hashCode in Vector is one of the best choices. That’s why hashCode was intentionally left out of the PR. (maybe I should add some comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking of performance too. That is a fair point of course about not changing the semantics. I think the hashCode impl could be changed in both places to produce the same result while being faster for sparse vectors: what if the hash were defined only over nonzero entries? I think equals could likewise be sped up also for the case of comparing sparse to dense? that is I think this can be taken a step further than just optimizing sparse == sparse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @srowen 's suggestion. We can design a hashing scheme based on the size and nonzero entries (with their indices). But I think we could do that in a separate PR. Does it sound good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the nonzero entries idea did cross my mind. Maybe it's overcautious that I think it might become a complexity if we want to have another kind of Vector in the future, which don't have handy internal structure to scan for the nonzero entries. Again, this can be overcautious.

And the dense == sparse idea looks good, maybe that suits into a util method better as it would not introduce the existence of DenseVector to SparseVector and vice versa.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that this did change the behavior of the original equals, because we allow values to have explicit zeros. Maybe we can use Vectors.sqdist(this, other) == 0 to implement equals().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. This should be your contribution and I'll close the PR. Thanks.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 14, 2015

There's a better idea from mengxr.

@hhbyyh hhbyyh closed this Jan 14, 2015
@srowen
Copy link
Member

srowen commented Jan 14, 2015

@hhbyyh Although I don't wish to speak for @mengxr I strongly suspect he would welcome you implementing the idea. I'm not sure we should use sqdist since it looks like it doesn't require vectors of the same size, and that seems essential for equals and hashCode? But I think it can indeed handle the cases of sparse vs dense similarly and efficiently.

@mengxr
Copy link
Contributor

mengxr commented Jan 15, 2015

Agree with @srowen ! @hhbyyh It would be great if you want to re-open this PR and implement a faster and correct Vector.equals. If sqdist doesn't check vector sizes, we should add checks there. It is also nice to add a unit test in VectorsSuite to show it works for two sparse vectors with different values array but equal.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 15, 2015

Thanks, I'll try on it.

@hhbyyh hhbyyh reopened this Jan 15, 2015
@SparkQA
Copy link

SparkQA commented Jan 15, 2015

Test build #25607 has started for PR 3997 at commit 50abef3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 15, 2015

Test build #25607 has finished for PR 3997 at commit 50abef3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25607/
Test FAILed.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 15, 2015

Just send an update. I didn't usesqdist due to the performance concern. Since the original equals is actually a fail-fast comparison, yet sqdist will inevitably compute through the vectors even if the first element is different. The performance will be hard to accept for scenarios like doc2Vec over a large vocabulary.

Current implementation is still based on the comparison for indices and values, just with the handling of the explicit 0. I gave some tests to the implementation and add a few ut. Any comment will be welcome!

@SparkQA
Copy link

SparkQA commented Jan 15, 2015

Test build #25608 has started for PR 3997 at commit a6952c3.

  • This patch merges cleanly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered to myself whether this could be simplified to not have while (true), the dummy Exception, etc. The best I could do was with a helper function:

...
var k1 = nextNonzero(this.values, 0)
var k2 = nextNonzero(v.values, 0)

while (k1 < this.values.size && k2 < v.values.size) {
  if (this.indices(k1) != v.indices(k2) || this.values(k1) != v.values(k2)) {
    return false
  }
  k1 = nextNonzero(this.values, k1 + 1)
  k2 = nextNonzero(v.values, k2 + 1)
}

return (k1 == this.values.size && k2 == v.values.size) 
...

def nextNonzero(values: Array[Double], from: Int): Int = {
  var index = from
  while (index < this.values.size && this.values(index) == 0.0) index += 1
  index
}

I'm not sure it's better, just food for thought.

So the idea would be to specialize hashCode as well, and also handle DenseVector right? and even remove the implementations in the parent?

@SparkQA
Copy link

SparkQA commented Jan 15, 2015

Test build #25608 has finished for PR 3997 at commit a6952c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25608/
Test PASSed.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 16, 2015

@srowen Thanks a lot for the improvement. The code do look cleaner and the helper function nextNonzero can also be used in the hashCode function. Great idea.

I'll use it if you don't mind. And this in the nextNonzero can be removed, right?

@srowen
Copy link
Member

srowen commented Jan 16, 2015

@hhbyyh Oops yes that's a typo. The helper function should refer to the local argument values only!

@mengxr
Copy link
Contributor

mengxr commented Jan 16, 2015

@hhbyyh @srowen There are some performance issues if we use unnecessary index lookup. Having many other.values(i) calls is slower than val otherValues = other.values and then many otherValues(i) calls. I'm suggesting some code like the following (I didn't try compiling the code):

var ii0 = this.indices
var vv0 = this.values
var ii1 = this.indices
var vv1 = this.values
var j0 = 0
var j1 = 0
var i0 = 0
var i1 = 0
var v0 = 0.0
var v1 = 0.0
var pj0 = -1
var pj1 = -1
var allEqual = true
while(allEqual && j0 < sz0 && j1 < sz1) {
  if (pj0 < j0) {
    i0 = ii0(j0)
    v0 = vv0(j0)
    pj0 = j0
  }
  if (pj1 < j1) {
    i1 = ii1(j1)
    v1 = vv1(j1)
    pj1 = j1
  }
  if (i0 == i1) {   
    allEqual &&= v0 == v1
    j0 += 1
    j1 += 1
  } else if (i0 < i1) {
    allEqual &&= v0 == 0.0
    j0 += 1
  } else {
    allEqual &&= v1 == 0.0
    j1 += 1
  }
while (allEqual & j0 < sz0) {
  allEqual &&= vv0(j0) == 0.0
  j0 += 1
}
while (allEqual & j1 < sz1) {
  allEqual &&= vv1(j1) == 0.0
  j1 += 1
}
allEqual

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25645 has started for PR 3997 at commit bdf8789.

  • This patch merges cleanly.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 16, 2015

Oh I didn't see your comment before the update. @mengxr I surely found a lot of code using the pattern (first assign locally, then access) in Spark, and we can follow it.

Just to be honest, I'm not sure I completely understand the reason, is it for sparing a memory addressing? I ran some local perf test and got no obvious difference. Would you please share some insight? Thanks a lot!

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25645 has finished for PR 3997 at commit bdf8789.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I like this version. Still pretty clear, and I think it's fine to optimize this a bit. There's still an opportunity to optimize sparse vs dense comparison, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, about the sparse vs dense.
My current thought is put a new function in Vectors, maybe something like sqdist(sparse, dense), and call it from Vector.equals, does it sound good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean a function structured like sqdist, not a distance-based function right? yes that sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and I found the sparse vs dense is actually quite similar to sparse vs sparse, I'm trying to unify them.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25663 has finished for PR 3997 at commit 985e160.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25663/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 18, 2015

Test build #25714 has started for PR 3997 at commit 93f0d46.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 18, 2015

Test build #25714 has finished for PR 3997 at commit 93f0d46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25714/
Test PASSed.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 18, 2015

Sent an update unifying the "sparse vs sparse" and "sparse vs dense".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looking good. I have only minor things to say, which don't necessarily need to change. For example the utility method could be called Vectors.equals. Should there even be a catch-all case? I wonder if failing fast is good, if no other implementations are expected now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review, @srowen
Yes, I can change the name. Just like the name in util.Arrays.equals.
And for the catch-all case, I was thinking util.Arrays.equals is also fail-fast and it can cover the dense vs dense case well, and with some extensiblity. Maybe I don't catch your point. Let me know if still we should change this. Thanks

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25739 has started for PR 3997 at commit 0d9d130.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25739 has finished for PR 3997 at commit 0d9d130.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25739/
Test PASSed.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 19, 2015

Update added for ut and function name change. @mengxr @srowen

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2015

test this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2015

Test build #25843 has started for PR 3997 at commit 0d9d130.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 20, 2015

Test build #25843 has finished for PR 3997 at commit 0d9d130.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25843/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 2f82c84 Jan 20, 2015
@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 21, 2015

Thanks

bomeng pushed a commit to Huawei-Spark/spark that referenced this pull request Jan 22, 2015
…icient

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5186

Currently SparseVector is using the inherited equals from Vector, which will create a full-size array for even the sparse vector. The pull request contains a specialized equals optimization that improves on both time and space.

1. The implementation will be consistent with the original. Especially it will keep equality comparison between SparseVector and DenseVector.

Author: Yuhao Yang <[email protected]>
Author: Yuhao Yang <[email protected]>

Closes apache#3997 from hhbyyh/master and squashes the following commits:

0d9d130 [Yuhao Yang] function name change and ut update
93f0d46 [Yuhao Yang] unify sparse vs dense vectors
985e160 [Yuhao Yang] improve locality for equals
bdf8789 [Yuhao Yang] improve equals and rewrite hashCode for Vector
a6952c3 [Yuhao Yang] fix scala style for comments
50abef3 [Yuhao Yang] fix ut for sparse vector with explicit 0
f41b135 [Yuhao Yang] iterative equals for sparse vector
5741144 [Yuhao Yang] Specialized equals for SparseVector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants