Skip to content

Fingerprinting Ingest Processor #16938

@polyfractal

Description

@polyfractal

A potentially useful processor is one that can generate one or more "fingerprints" from an incoming document. This could aid in finding duplicates, detecting plagarism, or clustering similar documents together.

I think there are two realms of fingerprinting: content fingerprinting and structural fingerprinting.

Content Fingerprinting

Hashes the content of fields to generate a fingerprint-per-field, and optionally, a fingerprint that represents all the fields. Could use simple hashing, or perhaps something more sophisticated like MinHash, SimHash, Winnowing or Probabilistic Fingerprinting.

The API could look something like:

{
  "fingerprint": {
    "type": "content",
    "fields": ["foo", "bar"],
    "hash": "minhash",
    "hash_all": true
  }
}

E.g. specify the type of fingerprinting we want to do (content), a list of fields to hash, the style of hashing and if we should also hash all the hashes together. The output would then be the document + new fingerprint fields:

{
  "foo": "...",
  "bar": "...",
  "fingerprint_foo": 1283891,
  "fingerprint_bar": 8372038,
  "fingerprint_all": 3817273
}

Structural Fingerprinting

The other mode of fingerprinting could be structural in nature (this is the one I'm more interested in, tbh). Instead of fingerprinting the content of fields, we are actually fingerprinting the structure of the document itself. Essentially, we would recursively parse the JSON and hash the keys at each level in the JSON tree. These hashes then become a fingerprint for the structure of the document.

Importantly, this type of fingerprinting ignores the leaf values...we just want to fingerprint the JSON keys themselves.

{
  "fingerprint": {
    "type": "structure",
    "root": ["foo"],
    "recursive": true,
    "hash": "murmur3",
    "hash_all": true
  }
}
  • root: defines where to start recursing, in case you only care about a portion of the document. Could be omitted or set to "*" to process the entire document
  • recursive if you want the processor to fingerprint all the layers. False if you just want the top-level of keys hashed.
  • hash: murmur, minhash, etc
  • hash_all: if all the hashes should be hashed together to build a final fingerprint

And the new document:

{
  "foo": {
    "bar": {
      "baz": "buzz"
    },
    "beep": {
      "boop": "bop"
    }
  },
  "fingerprint_level1": 001734,
  "fingerprint_level2": 992727,
  "fingerprint_level3": 110293,
  "fingerprint_all": 235240
}

Instead of a fingerprint-per-field, we now have one per "level".

I can think of a number of objections to both of these, but this should at least kick off the discussion :)

/cc @martijnvg

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions