Skip to content

ingest-attachment support for per document indexed_chars limit #28942

@dadoonet

Description

@dadoonet

Coming from this discussion: https://discuss.elastic.co/t/how-to-control-the-indexed-chars-value-on-a-ingest-attachment-pipeline/123073/4

We today support a global indexed_chars processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

Here is my proposal.
We should add an option like reading this limit value from the document itself by adding a setting like indexed_chars_field.

Then we could do something:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}

Then index either:

PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}

Which will use the default value (or the one defined by indexed_chars)

Or

PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000 
}

I'll propose hopefully soon a PR for it unless someone in the meantime reject that feature request or propose another implementation for it.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions