-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
A potentially useful processor is one that can generate one or more "fingerprints" from an incoming document. This could aid in finding duplicates, detecting plagarism, or clustering similar documents together.
I think there are two realms of fingerprinting: content fingerprinting and structural fingerprinting.
Content Fingerprinting
Hashes the content of fields to generate a fingerprint-per-field, and optionally, a fingerprint that represents all the fields. Could use simple hashing, or perhaps something more sophisticated like MinHash, SimHash, Winnowing or Probabilistic Fingerprinting.
The API could look something like:
{
"fingerprint": {
"type": "content",
"fields": ["foo", "bar"],
"hash": "minhash",
"hash_all": true
}
}E.g. specify the type of fingerprinting we want to do (content), a list of fields to hash, the style of hashing and if we should also hash all the hashes together. The output would then be the document + new fingerprint fields:
{
"foo": "...",
"bar": "...",
"fingerprint_foo": 1283891,
"fingerprint_bar": 8372038,
"fingerprint_all": 3817273
}Structural Fingerprinting
The other mode of fingerprinting could be structural in nature (this is the one I'm more interested in, tbh). Instead of fingerprinting the content of fields, we are actually fingerprinting the structure of the document itself. Essentially, we would recursively parse the JSON and hash the keys at each level in the JSON tree. These hashes then become a fingerprint for the structure of the document.
Importantly, this type of fingerprinting ignores the leaf values...we just want to fingerprint the JSON keys themselves.
{
"fingerprint": {
"type": "structure",
"root": ["foo"],
"recursive": true,
"hash": "murmur3",
"hash_all": true
}
}root: defines where to start recursing, in case you only care about a portion of the document. Could be omitted or set to"*"to process the entire documentrecursiveif you want the processor to fingerprint all the layers. False if you just want the top-level of keys hashed.hash: murmur, minhash, etchash_all: if all the hashes should be hashed together to build a final fingerprint
And the new document:
{
"foo": {
"bar": {
"baz": "buzz"
},
"beep": {
"boop": "bop"
}
},
"fingerprint_level1": 001734,
"fingerprint_level2": 992727,
"fingerprint_level3": 110293,
"fingerprint_all": 235240
}Instead of a fingerprint-per-field, we now have one per "level".
I can think of a number of objections to both of these, but this should at least kick off the discussion :)
/cc @martijnvg