-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31436][ML] MinHash keyDistance optimization #28206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
testcode: results: if both vectors are dense, new impl is 9.09x faster; |
|
Test build #121206 has finished for PR 28206 at commit
|
|
Test build #121207 has finished for PR 28206 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK if tests pass
| xIndex = if (xIter.hasNext) { xSize += 1; xIter.next } else -1 | ||
| } | ||
| } else if (xIndex != -1) { | ||
| while (xIter.hasNext) { xIndex = xIter.next; xSize += 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to update xIndex in this case, and yIndex below? looks like you're just counting the remaining size of the iterator. It's possible .size() is as fast from here.
use iter.size for remaining elements
| hashValues.map(Vectors.dense(_)) | ||
| } | ||
|
|
||
| @Since("2.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keyDistance was not exposed to the end users, is this since annotation needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's protected so a little more visible to callers. I wouldn't remove it just for its own sake.
| 1 - intersectionSize.toDouble / unionSize | ||
| } | ||
|
|
||
| @Since("2.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
Test build #121248 has finished for PR 28206 at commit
|
|
Merged to master |
|
Thanks @srowen for reviewing! |
What changes were proposed in this pull request?
re-impl
keyDistance:if both vectors are dense, new impl is 9.09x faster;
if both vectors are sparse, new impl is 5.66x faster;
if one is dense and the other is sparse, new impl is 7.8x faster;
Why are the changes needed?
current implementation based on set operations is inefficient
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing testsuites