-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18450][ML] Scala API Change for LSH AND-amplification #17092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #73550 has finished for PR 17092 at commit
|
|
@jkbradley @MLnick Here is a clean PR. Sorry for messing up the previous one! @merlintang I am happy to continue our discussion here: https://issues.apache.org/jira/browse/SPARK-19771 as OR-AND amplification requires much more changes than SPARK-18450 |
|
@Yunni ok, let us discuss the further optimization step in other ticket. Let me manually check and test this patch, because I have one concern here. I will let you know later. |
|
@Yunni I test this patch locally, it can work, but I have one idea to improve it. We can discuss it in other ticket. |
|
@jkbradley @sethah Please take a review when you have time. Thanks! |
|
Ping. |
|
@MLnick @jkbradley @sethah Could you take a review? Thanks! |
|
@jkbradley @MLnick @sethah @Yunni @merlintang @akatz |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supposing we want to support OR-AND amplification in the future, how will the API be added or changed ? Add an boolean parameter to specify OR-AND / AND-OR ?
and maybe the names of numHashFunctions and numHashTables are a little confusing for users.
| @Since("2.1.0") | ||
| override def setNumHashTables(value: Int): this.type = super.setNumHashTables(value) | ||
|
|
||
| @Since("2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Since("2.4.0")
| @Since("2.1.0") | ||
| override def setNumHashTables(value: Int): this.type = super.setNumHashTables(value) | ||
|
|
||
| @Since("2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
|
ping @Yunni |
What changes were proposed in this pull request?
Implemented a new Param numHashFunctions as the dimension of AND-amplification for Locality Sensitive Hashing. Now the hash of each feature in LSH is an array of size numHashTables while each element in the array is a vector of size numHashFunctions.
Two features are in the same hash bucket iff ANY pair of the vectors are equal (OR-amplification). Two vectors are equal iff ALL pair of the vector entries are equal (AND-amplification).
Will create follow-up PRs for Python API and Doc/Examples.
How was this patch tested?
By running unit tests MinHashLSHSuite and BucketedRandomProjectionLSHSuite.