Skip to content

ingest: bulk scripted_upsert runs the script after the pipeline  #36745

@jakelandis

Description

@jakelandis

#36618 allows a default pipeline to be used with bulk upserts. However, there behavior of a bulk scripted_upsert with a default pipeline has some surprising behavior.

Given an index with a default pipeline:

DELETE test
PUT test
{
  "settings": {
    "index.default_pipeline": "bytes"
  }
}
PUT _ingest/pipeline/bytes
{
  "processors": [
    {
      "bytes": {
        "field": "bytes"
      }
    }
  ]
}

Performing a non-bulk upsert works as expected:

POST test/doc/1/_update
{
  "scripted_upsert": true, 
  "script":{
    "source": "ctx._source.bytes = '1kb'" 
  },
  "upsert" :{
    "foo" : "bar"
  }
}
GET test/doc/1

results in :

{
...
  "_source" : {
    "bytes" : 1024,
    "foo" : "bar"
  }
}

The script evaluated, then the ingest pipeline ran normally. This matches the expectation that the script is always executed.
However, the same index request, but with the _bulk API is surprising.

POST _bulk
{"update":{"_id":"2","_index":"test","_type":"_doc"}}
{"script": "ctx._source.bytes = '1kb'", "upsert":{"foo":"bar"}, "scripted_upsert" : true}

Results in:

{
  "took" : 0,
  "ingest_took" : 7,
  "errors" : true,
  "items" : [
    {
      "index" : {
        "_index" : null,
        "_type" : null,
        "_id" : null,
        "status" : 500,
        "error" : {
          "type" : "exception",
          "reason" : "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [bytes] not present as part of path [bytes]",
          "caused_by" : {
            "type" : "illegal_argument_exception",
            "reason" : "java.lang.IllegalArgumentException: field [bytes] not present as part of path [bytes]",
            "caused_by" : {
              "type" : "illegal_argument_exception",
              "reason" : "field [bytes] not present as part of path [bytes]"
            }
          },
          "header" : {
            "processor_type" : "bytes"
          }
        }
      }
    }
  ]
}

This is because the script will only execute AFTER the pipeline. Note - the script still executes, but only after pipeline...which means that any computed data from the script is not available to the ingest pipeline.
For example, if you move the data that the processor cares about to the upsert (out of the script) it works as expected:

POST _bulk
{"update":{"_id":"2","_index":"test","_type":"_doc"}}
{"script": "ctx._source.foo = 'bar'", "upsert":{"bytes":"1kb"}, "scripted_upsert" : true}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions