11[[analysis-minhash-tokenfilter]]
2- === Minhash Token Filter
2+ === MinHash Token Filter
33
4- A token filter of type `min_hash` hashes each token of the token stream and divides
4+ The `min_hash` token filter hashes each token of the token stream and divides
55the resulting hashes into buckets, keeping the lowest-valued hashes per
66bucket. It then returns these hashes as tokens.
77
@@ -20,3 +20,120 @@ The following are settings that can be set for a `min_hash` token filter.
2020bucket to its circular right. Only takes effect if hash_set_size is equal to one.
2121Defaults to `true` if bucket_count is greater than one, else `false`.
2222|=======================================================================
23+
24+ Some points to consider while setting up a `min_hash` filter:
25+
26+ * `min_hash` filter input tokens should typically be k-words shingles produced
27+ from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
28+ choose `k` large enough so that the probability of any given shingle
29+ occurring in a document is low. At the same time, as
30+ internally each shingle is hashed into to 128-bit hash, you should choose
31+ `k` small enough so that all possible
32+ different k-words shingles can be hashed to 128-bit hash with
33+ minimal collision. 5-word shingles typically work well.
34+
35+ * choosing the right settings for `hash_count`, `bucket_count` and
36+ `hash_set_size` needs some experimentation.
37+ ** to improve the precision, you should increase `bucket_count` or
38+ `hash_set_size`. Higher values of `bucket_count` or `hash_set_size`
39+ will provide a higher guarantee that different tokens are
40+ indexed to different buckets.
41+ ** to improve the recall,
42+ you should increase `hash_token` parameter. For example,
43+ setting `hash_count=2`, will make each token to be hashed in
44+ two different ways, thus increasing the number of potential
45+ candidates for search.
46+
47+ * the default settings makes the `min_hash` filter to produce for
48+ each document 512 `min_hash` tokens, each is of size 16 bytes.
49+ Thus, each document's size will be increased by around 8Kb.
50+
51+ * `min_hash` filter is used to hash for Jaccard similarity. This means
52+ that it doesn't matter how many times a document contains a certain token,
53+ only that if it contains it or not.
54+
55+ ==== Theory
56+ MinHash token filter allows you to hash documents for similarity search.
57+ Similarity search, or nearest neighbor search is a complex problem.
58+ A naive solution requires an exhaustive pairwise comparison between a query
59+ document and every document in an index. This is a prohibitive operation
60+ if the index is large. A number of approximate nearest neighbor search
61+ solutions have been developed to make similarity search more practical and
62+ computationally feasible. One of these solutions involves hashing of documents.
63+
64+ Documents are hashed in a way that similar documents are more likely
65+ to produce the same hash code and are put into the same hash bucket,
66+ while dissimilar documents are more likely to be hashed into
67+ different hash buckets. This type of hashing is known as
68+ locality sensitive hashing (LSH).
69+
70+ Depending on what constitutes the similarity between documents,
71+ various LSH functions https://arxiv.org/abs/1408.2927[have been proposed].
72+ For https://en.wikipedia.org/wiki/Jaccard_index[Jaccard similarity], a popular
73+ LSH function is https://en.wikipedia.org/wiki/MinHash[MinHash].
74+ A general idea of the way MinHash produces a signature for a document
75+ is by applying a random permutation over the whole index vocabulary (random
76+ numbering for the vocabulary), and recording the minimum value for this permutation
77+ for the document (the minimum number for a vocabulary word that is present
78+ in the document). The permutations are run several times;
79+ combining the minimum values for all of them will constitute a
80+ signature for the document.
81+
82+ In practice, instead of random permutations, a number of hash functions
83+ are chosen. A hash function calculates a hash code for each of a
84+ document's tokens and chooses the minimum hash code among them.
85+ The minimum hash codes from all hash functions are combined
86+ to form a signature for the document.
87+
88+
89+ ==== Example of setting MinHash Token Filter in Elasticsearch
90+ Here is an example of setting up a `min_hash` filter:
91+
92+ [source,js]
93+ --------------------------------------------------
94+ POST /index1
95+ {
96+ "settings": {
97+ "analysis": {
98+ "filter": {
99+ "my_shingle_filter": { <1>
100+ "type": "shingle",
101+ "min_shingle_size": 5,
102+ "max_shingle_size": 5,
103+ "output_unigrams": false
104+ },
105+ "my_minhash_filter": {
106+ "type": "min_hash",
107+ "hash_count": 1, <2>
108+ "bucket_count": 512, <3>
109+ "hash_set_size": 1, <4>
110+ "with_rotation": true <5>
111+ }
112+ },
113+ "analyzer": {
114+ "my_analyzer": {
115+ "tokenizer": "standard",
116+ "filter": [
117+ "my_shingle_filter",
118+ "my_minhash_filter"
119+ ]
120+ }
121+ }
122+ }
123+ },
124+ "mappings": {
125+ "properties": {
126+ "text": {
127+ "fingerprint": "text",
128+ "analyzer": "my_analyzer"
129+ }
130+ }
131+ }
132+ }
133+ --------------------------------------------------
134+ // NOTCONSOLE
135+ <1> setting a shingle filter with 5-word shingles
136+ <2> setting min_hash filter to hash with 1 hash
137+ <3> setting min_hash filter to hash tokens into 512 buckets
138+ <4> setting min_hash filter to keep only a single smallest hash in each bucket
139+ <5> setting min_hash filter to fill empty buckets with values from neighboring buckets
0 commit comments