Fingerprinting Ingest Processor

A potentially useful processor is one that can generate one or more "fingerprints" from an incoming document.  This could aid in finding duplicates, detecting plagarism, or clustering similar documents together.

I think there are two realms of fingerprinting: content fingerprinting and structural fingerprinting.
### Content Fingerprinting

Hashes the content of fields to generate a fingerprint-per-field, and optionally, a fingerprint that represents all the fields.  Could use simple hashing, or perhaps something more sophisticated like MinHash, SimHash, [Winnowing](http://igm.univ-mlv.fr/~mac/ENS/DOC/sigmod03-1.pdf) or [Probabilistic Fingerprinting](http://www.cs.cmu.edu/~gveda/reports/IITK/cs497_DocFingerprinting.pdf).  

The API could look something like:

``` json
{
  "fingerprint": {
    "type": "content",
    "fields": ["foo", "bar"],
    "hash": "minhash",
    "hash_all": true
  }
}
```

E.g. specify the type of fingerprinting we want to do (`content`), a list of fields to hash, the style of hashing and if we should also hash all the hashes together.  The output would then be the document + new fingerprint fields:

``` json
{
  "foo": "...",
  "bar": "...",
  "fingerprint_foo": 1283891,
  "fingerprint_bar": 8372038,
  "fingerprint_all": 3817273
}
```
### Structural Fingerprinting

The other mode of fingerprinting could be structural in nature (this is the one I'm more interested in, tbh).  Instead of fingerprinting the content of fields, we are actually fingerprinting the structure of the document itself.  Essentially, we would recursively parse the JSON and hash the keys at each level in the JSON tree.  These hashes then become a fingerprint for the structure of the document.

Importantly, this type of fingerprinting ignores the leaf values...we just want to fingerprint the JSON keys themselves.

``` json
{
  "fingerprint": {
    "type": "structure",
    "root": ["foo"],
    "recursive": true,
    "hash": "murmur3",
    "hash_all": true
  }
}
```
- `root`: defines where to start recursing, in case you only care about a portion of the document.  Could be omitted or set to `"*"` to process the entire document
- `recursive` if you want the processor to fingerprint all the layers.  False if you just want the top-level of keys hashed.
- `hash`: murmur, minhash, etc
- `hash_all`: if all the hashes should be hashed together to build a final fingerprint

And the new document:

``` json
{
  "foo": {
    "bar": {
      "baz": "buzz"
    },
    "beep": {
      "boop": "bop"
    }
  },
  "fingerprint_level1": 001734,
  "fingerprint_level2": 992727,
  "fingerprint_level3": 110293,
  "fingerprint_all": 235240
}
```

Instead of a fingerprint-per-field, we now have one per "level".  

I can think of a number of objections to both of these, but this should at least kick off the discussion :)

/cc @martijnvg 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fingerprinting Ingest Processor #16938

Content Fingerprinting

Structural Fingerprinting

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fingerprinting Ingest Processor #16938

Description

Content Fingerprinting

Structural Fingerprinting

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions