Skip to content

Conversation

@Yunni
Copy link
Contributor

@Yunni Yunni commented Feb 28, 2017

What changes were proposed in this pull request?

Implemented a new Param numHashFunctions as the dimension of AND-amplification for Locality Sensitive Hashing. Now the hash of each feature in LSH is an array of size numHashTables while each element in the array is a vector of size numHashFunctions.

Two features are in the same hash bucket iff ANY pair of the vectors are equal (OR-amplification). Two vectors are equal iff ALL pair of the vector entries are equal (AND-amplification).

Will create follow-up PRs for Python API and Doc/Examples.

How was this patch tested?

By running unit tests MinHashLSHSuite and BucketedRandomProjectionLSHSuite.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73550 has finished for PR 17092 at commit 9dd87ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Yunni
Copy link
Contributor Author

Yunni commented Feb 28, 2017

@jkbradley @MLnick Here is a clean PR. Sorry for messing up the previous one!

@merlintang I am happy to continue our discussion here: https://issues.apache.org/jira/browse/SPARK-19771 as OR-AND amplification requires much more changes than SPARK-18450

@merlintang
Copy link

merlintang commented Feb 28, 2017

@Yunni ok, let us discuss the further optimization step in other ticket. Let me manually check and test this patch, because I have one concern here. I will let you know later.

@merlintang
Copy link

@Yunni I test this patch locally, it can work, but I have one idea to improve it. We can discuss it in other ticket.

@Yunni
Copy link
Contributor Author

Yunni commented Mar 9, 2017

@jkbradley @sethah Please take a review when you have time. Thanks!

@Yunni
Copy link
Contributor Author

Yunni commented Apr 6, 2017

Ping.

@Yunni
Copy link
Contributor Author

Yunni commented May 6, 2017

@MLnick @jkbradley @sethah Could you take a review? Thanks!

@kturgut
Copy link

kturgut commented Nov 2, 2017

@jkbradley @MLnick @sethah @Yunni @merlintang @akatz
It seems LSH will be a perfect fit for matching patient records, if only I can figure out how to assign different weights to each column of the patient record that I am comparing. For instance, each record may have 0 to many identifiers. if the identifiers match exactly, we consider a solid match. However if ID's do not strongly match, we also look at additional set of fields such as name, birthdate, address at different weights.
For instance, if the names exactly match, it is stronger than if they match with small typos.
To give different weights for each field we are comparing, should I have to write custom distance calculator?
Or perhaps, should I do a MinHashing and then LSH as a second step as described in this document: http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf?
It does not look like the AND-OR amplification would help with that, as it takes the number of hash-functions as input, and it does not seem like we have control over the sensitivity of the hash-functions.
I will really appreciate your guidance.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supposing we want to support OR-AND amplification in the future, how will the API be added or changed ? Add an boolean parameter to specify OR-AND / AND-OR ?

and maybe the names of numHashFunctions and numHashTables are a little confusing for users.

@Since("2.1.0")
override def setNumHashTables(value: Int): this.type = super.setNumHashTables(value)

@Since("2.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Since("2.4.0")

@Since("2.1.0")
override def setNumHashTables(value: Int): this.type = super.setNumHashTables(value)

@Since("2.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@HyukjinKwon
Copy link
Member

ping @Yunni

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants