Skip to content

Conversation

@softwaredoug
Copy link
Contributor

The minhash algorithm implements locality-sensitive hashing. Basically, every token is hashed into a hash table. The hash table is then sampled to generate a smaller set of representative tokens.

This has several applications

  • Dimensionality reduction that preserves information that can be used in similarity calculations. For example, efficient deduplication over large sets of text, more like this, etc.

The minhash filter has 4 parameters

  • bucket_size (def 512) how many buckets to use in the hash table. Will also be the number of tokens output from the minhash filter. More buckets means slower performance and more space, but greater precision.
  • num_hash_tables (def 1). As hash tables are sampled by looking at the first item in the bucket, adding more hash tables will provide added precision
  • bucket_depth (def 1). How many items should be stored in each bucket?
  • rotation (def true if num_buckets > 1). For any empty buckets, fills them with the contents of the next bucket

Implementation Notes/Questions

  • Are tests needed above the Lucene tests that already exist?
  • Do I need to provide documentation as well?
  • You'll note the fwd'ing to the Lucene filter factory. This is because of an annoying issue with the MinHashFilter. See this Lucene issue

import java.util.Map;


public class MinHashTokenFilterFactory extends AbstractTokenFilterFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be final?

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

Thanks for doing this!! I left some comments, I also wonder if we can get a simple test for all the parameters etc. like org.elasticsearch.index.analysis.KeepFilterFactoryTests. It also seems that you haven't signed the CLA, maybe you used a different mail address than in your git commits?

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

Do I need to provide documentation as well?

that would be fantastic

@s1monw s1monw self-assigned this Sep 6, 2016
@jasontedor
Copy link
Member

Isn't there already an open PR for this, #20206?

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

Isn't there already an open PR for this, #20206?

I agree the one you linked looks more complete, I think we should close this one?

@softwaredoug
Copy link
Contributor Author

@jasontedor so there is! I swear I searched for minhash, now I see it. Sorry about the dup. @s1monw not sure if you have a procedure for closing dups?

@s1monw I'll look into the CLA for future PRs.

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

thanks anyway @softwaredoug ;) doesn't happen too often - keep it coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants