Skip to content

Commit 5b852fa

Browse files
Add documentation for min_hash filter (#39671)
* Add documentation for min_hash filter Closes #20757
1 parent 8657e6e commit 5b852fa

File tree

1 file changed

+119
-2
lines changed

1 file changed

+119
-2
lines changed

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc

Lines changed: 119 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[[analysis-minhash-tokenfilter]]
2-
=== Minhash Token Filter
2+
=== MinHash Token Filter
33

4-
A token filter of type `min_hash` hashes each token of the token stream and divides
4+
The `min_hash` token filter hashes each token of the token stream and divides
55
the resulting hashes into buckets, keeping the lowest-valued hashes per
66
bucket. It then returns these hashes as tokens.
77

@@ -20,3 +20,120 @@ The following are settings that can be set for a `min_hash` token filter.
2020
bucket to its circular right. Only takes effect if hash_set_size is equal to one.
2121
Defaults to `true` if bucket_count is greater than one, else `false`.
2222
|=======================================================================
23+
24+
Some points to consider while setting up a `min_hash` filter:
25+
26+
* `min_hash` filter input tokens should typically be k-words shingles produced
27+
from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
28+
choose `k` large enough so that the probability of any given shingle
29+
occurring in a document is low. At the same time, as
30+
internally each shingle is hashed into to 128-bit hash, you should choose
31+
`k` small enough so that all possible
32+
different k-words shingles can be hashed to 128-bit hash with
33+
minimal collision. 5-word shingles typically work well.
34+
35+
* choosing the right settings for `hash_count`, `bucket_count` and
36+
`hash_set_size` needs some experimentation.
37+
** to improve the precision, you should increase `bucket_count` or
38+
`hash_set_size`. Higher values of `bucket_count` or `hash_set_size`
39+
will provide a higher guarantee that different tokens are
40+
indexed to different buckets.
41+
** to improve the recall,
42+
you should increase `hash_token` parameter. For example,
43+
setting `hash_count=2`, will make each token to be hashed in
44+
two different ways, thus increasing the number of potential
45+
candidates for search.
46+
47+
* the default settings makes the `min_hash` filter to produce for
48+
each document 512 `min_hash` tokens, each is of size 16 bytes.
49+
Thus, each document's size will be increased by around 8Kb.
50+
51+
* `min_hash` filter is used to hash for Jaccard similarity. This means
52+
that it doesn't matter how many times a document contains a certain token,
53+
only that if it contains it or not.
54+
55+
==== Theory
56+
MinHash token filter allows you to hash documents for similarity search.
57+
Similarity search, or nearest neighbor search is a complex problem.
58+
A naive solution requires an exhaustive pairwise comparison between a query
59+
document and every document in an index. This is a prohibitive operation
60+
if the index is large. A number of approximate nearest neighbor search
61+
solutions have been developed to make similarity search more practical and
62+
computationally feasible. One of these solutions involves hashing of documents.
63+
64+
Documents are hashed in a way that similar documents are more likely
65+
to produce the same hash code and are put into the same hash bucket,
66+
while dissimilar documents are more likely to be hashed into
67+
different hash buckets. This type of hashing is known as
68+
locality sensitive hashing (LSH).
69+
70+
Depending on what constitutes the similarity between documents,
71+
various LSH functions https://arxiv.org/abs/1408.2927[have been proposed].
72+
For https://en.wikipedia.org/wiki/Jaccard_index[Jaccard similarity], a popular
73+
LSH function is https://en.wikipedia.org/wiki/MinHash[MinHash].
74+
A general idea of the way MinHash produces a signature for a document
75+
is by applying a random permutation over the whole index vocabulary (random
76+
numbering for the vocabulary), and recording the minimum value for this permutation
77+
for the document (the minimum number for a vocabulary word that is present
78+
in the document). The permutations are run several times;
79+
combining the minimum values for all of them will constitute a
80+
signature for the document.
81+
82+
In practice, instead of random permutations, a number of hash functions
83+
are chosen. A hash function calculates a hash code for each of a
84+
document's tokens and chooses the minimum hash code among them.
85+
The minimum hash codes from all hash functions are combined
86+
to form a signature for the document.
87+
88+
89+
==== Example of setting MinHash Token Filter in Elasticsearch
90+
Here is an example of setting up a `min_hash` filter:
91+
92+
[source,js]
93+
--------------------------------------------------
94+
POST /index1
95+
{
96+
"settings": {
97+
"analysis": {
98+
"filter": {
99+
"my_shingle_filter": { <1>
100+
"type": "shingle",
101+
"min_shingle_size": 5,
102+
"max_shingle_size": 5,
103+
"output_unigrams": false
104+
},
105+
"my_minhash_filter": {
106+
"type": "min_hash",
107+
"hash_count": 1, <2>
108+
"bucket_count": 512, <3>
109+
"hash_set_size": 1, <4>
110+
"with_rotation": true <5>
111+
}
112+
},
113+
"analyzer": {
114+
"my_analyzer": {
115+
"tokenizer": "standard",
116+
"filter": [
117+
"my_shingle_filter",
118+
"my_minhash_filter"
119+
]
120+
}
121+
}
122+
}
123+
},
124+
"mappings": {
125+
"properties": {
126+
"text": {
127+
"fingerprint": "text",
128+
"analyzer": "my_analyzer"
129+
}
130+
}
131+
}
132+
}
133+
--------------------------------------------------
134+
// NOTCONSOLE
135+
<1> setting a shingle filter with 5-word shingles
136+
<2> setting min_hash filter to hash with 1 hash
137+
<3> setting min_hash filter to hash tokens into 512 buckets
138+
<4> setting min_hash filter to keep only a single smallest hash in each bucket
139+
<5> setting min_hash filter to fill empty buckets with values from neighboring buckets

0 commit comments

Comments
 (0)