Skip to content

Highlighting Error with span_field_masking Requires Indexing Offsets Unexpectedly #101804

@ahoogol

Description

@ahoogol

Elasticsearch Version

8.10.4

Installed Plugins

No response

Java Version

bundled

OS Version

Elastic Cloud - GCP - Iowa (us-central1)

Problem Description

I encountered an issue when using the span_field_masking feature in Elasticsearch. When attempting to use the highlighter with this feature, the following error is thrown:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "field 'text' was indexed without offsets, cannot highlight"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test_mask",
        "node": "jUZ9p0ZtR6-xYevegW6O_Q",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "field 'text' was indexed without offsets, cannot highlight"
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "field 'text' was indexed without offsets, cannot highlight",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "field 'text' was indexed without offsets, cannot highlight"
      }
    }
  },
  "status": 400
}

If I set "index_options": "offsets" in the mapping of the masked field 'stem', highlighting works as expected. However, I'm puzzled as to why the highlighter requires indexing offsets. I'd like to understand why the highlighter doesn't re-analyze the text to calculate offsets dynamically. My concern is that indexing offsets increases the index size, which I want to avoid.

Steps to Reproduce

PUT test_mask
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "whitespace"
      },
      "stem": {
        "type": "text",
        "analyzer": "whitespace"
      }
    }
  }
}

PUT test_mask/_doc/1
{
  "text": "a _ a b",
  "stem": "_ b _ _"
}

GET test_mask/_search
{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "text": {
              "value": "a"
            }
          }
        },
        {
          "span_field_masking": {
            "field": "text", 
            "query": {
              "span_term": {
                "stem": {
                  "value": "b"
                }
              }
            }
          }
        }
      ],
      "slop": 0,
      "in_order": true
    }
  },
  "highlight": {
    "pre_tags": "(", 
    "post_tags": ")", 
    "fields": {
      "*": {}
    },
    "type": "unified"
  }
}

Expected result

I was expecting the highlight to look like this:

"highlight": {
  "text": [
    "(a) (_) a b"
  ]
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions