Skip to content

Conversation

@jimczi
Copy link
Contributor

@jimczi jimczi commented May 12, 2023

This PR adapts the unified highlighter to use the Weight#matches mode by default when possible. This is the default mode in Lucene for some time now. For cases where the matches mode won't work (nested and parent-child queries),
the matches mode is disabled automatically.
I didn't expose an option to explicitly disable this mode because that should be seen as an internal implementation detail. With this change, matches that span multiple terms are highlighted together (something that users asked for years) and the clauses that don't match the document are ignored.

Note that this new mode is enabled only when require_field_match is true and the query doesn't contain:

  • A nested query.
  • A parent-join query.
  • A query that targets a runtime field.

Closes #29561

This PR adapts the unified highlighter to use the Weight#matches API by default when possible.
This is the default mode in Lucene for some time now. For cases where the matches API won't work (nested and parent-child queries),
 the matches mode is disabled automatically.
I didn't expose an  option to explicitly disable this mode because that should be seen as an internal implementation detail.
With this change, matches that span multiple terms are highlighted together (something that users asked for years) and the
clauses that don't match the document are ignored.
@jimczi jimczi added >enhancement :Search Relevance/Highlighting How a query matched a document Team:Search Meta label for search team v8.9.0 labels May 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Do you think we need to worry about backwards compatibility, or can we just count this is an improvement?

@jimczi
Copy link
Contributor Author

jimczi commented May 12, 2023

Do you think we need to worry about backwards compatibility, or can we just count this is an improvement?

That would be my preference yep, happy to discuss (and adapt) if we think it should be considered differently.

@jimczi
Copy link
Contributor Author

jimczi commented May 12, 2023

@romseygeek I disabled the Matches mode if a runtime field is queried or if require_field_match is set to false. In both scenario we're unable to extract the offsets from the original weight. I don't see how we could handle these cases with the current implementation but that shouldn't prevent this change imo.

@jimczi
Copy link
Contributor Author

jimczi commented Aug 8, 2023

@romseygeek, as discussed I added an undocumented index setting to disable the weight matches.
Can you take another look?

@jimczi jimczi changed the title Use the Weight#matches API for highlighting by default Use the Weight#matches mode for highlighting by default Aug 8, 2023
@jimczi jimczi merged commit 28a504d into elastic:main Aug 9, 2023
@jimczi jimczi deleted the unified_highlighter_matches branch August 9, 2023 01:44
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Aug 9, 2023
jimczi added a commit that referenced this pull request Aug 9, 2023
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Aug 9, 2023
@TristanMa
Copy link

@jimczi I know the likelihood is low but is there any chance we can get this ER backported to 8.4? I have a customer who runs a multi-tenant cluster and cannot perform the upgrade to 8.10 in the near term and they have an escalated customer situation on their end.

I've set expectations that the chance of a backport is low but since you mentioned it as a possibility in an earlier comment just wanted to follow up.

@jimczi
Copy link
Contributor Author

jimczi commented Feb 13, 2024

I know the likelihood is low but is there any chance we can get this ER backported to 8.4?

Unfortunately not since 8.4.x is not released anymore.

@TristanMa
Copy link

@jimczi thanks for the quick response!

gustavolimav pushed a commit to gustavolimav/liferay-portal that referenced this pull request Feb 27, 2024
…asticsearch version

In elasticsearch 8.10.2 and higher versions highligh behaviour is different
more info here: elastic/elasticsearch#96068

https://liferay.atlassian.net/browse/LPD-2141
gustavolimav pushed a commit to gustavolimav/liferay-portal that referenced this pull request Feb 27, 2024
…asticsearch version

In elasticsearch 8.10.2 and higher versions highligh behaviour is different
more info here: elastic/elasticsearch#96068

https://liferay.atlassian.net/browse/LPD-2141
gustavolimav pushed a commit to gustavolimav/liferay-portal that referenced this pull request Feb 28, 2024
…asticsearch version

In elasticsearch 8.10.2 and higher versions highligh behaviour is different
more info here: elastic/elasticsearch#96068

https://liferay.atlassian.net/browse/LPD-2141
gustavolimav pushed a commit to gustavolimav/liferay-portal that referenced this pull request Feb 28, 2024
…asticsearch version

In elasticsearch 8.10.2 and higher versions highligh behaviour is different
more info here: elastic/elasticsearch#96068

https://liferay.atlassian.net/browse/LPD-2141
liferay-continuous-integration pushed a commit to liferay-continuous-integration/liferay-portal that referenced this pull request Mar 1, 2024
…asticsearch version

In elasticsearch 8.10.2 and higher versions highligh behaviour is different
more info here: elastic/elasticsearch#96068

https://liferay.atlassian.net/browse/LPD-2141
brianchandotcom pushed a commit to brianchandotcom/liferay-portal that referenced this pull request Mar 5, 2024
…asticsearch version

In elasticsearch 8.10.2 and higher versions highligh behaviour is different
more info here: elastic/elasticsearch#96068

https://liferay.atlassian.net/browse/LPD-2141
@CarlosLoboZamarro
Copy link

CarlosLoboZamarro commented Feb 6, 2025

Hello @jimczi, I've identified an error that I believe is related to this improvement.

The error started occurring from version 8.10.0, with version 8.9.2 being the last one that works correctly.

The issue happens when a search is performed using the unified highlight type on a field containing an array of texts, and the search matches the end and beginning of two different phrases within the array.

I'll explain it with an example.

(These tests are performed in version 8.17.0. But the same issue also occurs in version 8.10.0)

Create index:

  • It has a single field field1, and we set position_increment_gap to 0:
PUT error-case
{
  "settings": {
    "number_of_shards": "1",
    "number_of_replicas": "1"
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {      
      "field1": {
        "type": "text",
        "position_increment_gap": 0
      }
    }
  }
}

Create docuemnt 1:

PUT error-case/_doc/1
{
    "field1": [
     	"-(GRITAI)",
     	"Do you know anything about dynamite?",
     	"-Of course.",
     	"Make sure this blows up the rails."
    ]
}

Create document 2:

PUT error-case/_doc/2
{
    "field1": [
     	"If you say things to a person that turn out not to be true",
     	"the person is inclined to think that he or she has been deceived."
    ]
}

We perform a simple search:

GET error-case/_search
{
  "query": {
    "match_phrase": {
      "field1": {
        "query": "blows up the rails"
      }
    }
  },
  "highlight": {
    "fields": {
      "field1": {
        "type": "unified"
      }
    }
  }
}

Response OK:

  • It retrieves the highlighted text over the entire match instead of doing it for each individual word.
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.526125,
    "hits": [
      {
        "_index": "error-case",
        "_id": "1",
        "_score": 2.526125,
        "_source": {
          "field1": [
            "-(GRITAI)",
            "Do you know anything about dynamite?",
            "-Of course.",
            "Make sure this blows up the rails."
          ]
        },
        "highlight": {
          "field1": [
            "Make sure this <em>blows up the rails</em>."
          ]
        }
      }
    ]
  }
}

We perform a search that matches the end and beginning of the phrases in doc 2 to be true the person:
[
"If you say things to a person that turn out not to be true",
"the person is inclined to think that he or she has been deceived."
]

Request:

GET error-case/_search
{
  "query": {
    "match_phrase": {
      "field1": {
        "query": "to be true the person"
      }
    }
  },
  "highlight": {
    "fields": {
      "field1": {
        "type": "unified"
      }
    }
  }
}

Response ERROR:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_state_exception",
        "reason": "first() should not be called in this context"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "error-case",
        "node": "yooSs9yzSXCWHCpaxvpC_g",
        "reason": {
          "type": "illegal_state_exception",
          "reason": "first() should not be called in this context"
        }
      }
    ],
    "caused_by": {
      "type": "illegal_state_exception",
      "reason": "first() should not be called in this context",
      "caused_by": {
        "type": "illegal_state_exception",
        "reason": "first() should not be called in this context"
      }
    }
  },
  "status": 500
}

The error in the log:

[2025-02-06T12:41:14,782][WARN ][r.suppressed             ] [prees01] path: /error-case/_search, params: {pretty=true, index=error-case}, status: 500
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:693) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:383) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:725) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:476) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:337) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:64) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:75) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:32) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:54) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:647) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.transport.TransportService$UnregisterChildTransportResponseHandler.handleException(TransportService.java:1786) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1510) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1644) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1619) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:44) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:45) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:152) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:29) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-8.17.0.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1575) ~[?:?]
Caused by: org.elasticsearch.ElasticsearchException$1: first() should not be called in this context
	at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:706) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:381) ~[elasticsearch-8.17.0.jar:?]
	... 22 more
Caused by: java.lang.IllegalStateException: first() should not be called in this context
	at org.elasticsearch.lucene.search.uhighlight.BoundedBreakIteratorScanner.first(BoundedBreakIteratorScanner.java:165) ~[elasticsearch-8.17.0.jar:?]
	at org.apache.lucene.search.uhighlight.SplittingBreakIterator.following(SplittingBreakIterator.java:183) ~[lucene-highlighter-9.12.0.jar:?]
	at org.apache.lucene.search.uhighlight.FieldHighlighter.highlightOffsetsEnums(FieldHighlighter.java:186) ~[lucene-highlighter-9.12.0.jar:?]
	at org.elasticsearch.lucene.search.uhighlight.CustomFieldHighlighter.highlightOffsetsEnums(CustomFieldHighlighter.java:118) ~[elasticsearch-8.17.0.jar:?]
	at org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:86) ~[lucene-highlighter-9.12.0.jar:?]
	at org.elasticsearch.lucene.search.uhighlight.CustomFieldHighlighter.highlightFieldForDoc(CustomFieldHighlighter.java:75) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.lucene.search.uhighlight.CustomUnifiedHighlighter.highlightField(CustomUnifiedHighlighter.java:152) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.fetch.subphase.highlight.DefaultHighlighter.highlight(DefaultHighlighter.java:87) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightPhase$1.process(HighlightPhase.java:70) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.fetch.FetchPhase$1.nextDoc(FetchPhase.java:182) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.fetch.FetchPhaseDocsIterator.iterate(FetchPhaseDocsIterator.java:90) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.fetch.FetchPhase.buildSearchHits(FetchPhase.java:194) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:84) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:774) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:719) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:573) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:79) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:76) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:101) ~[elasticsearch-8.17.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-8.17.0.jar:?]
	... 6 more

The error does not occur if we set "number_of_fragments": 0, but it returns a result that is not as expected.

GET error-case/_search
{
  "query": {
    "match_phrase": {
      "field1": {
        "query": "to be true the person"
      }
    }
  },
  "highlight": {
    "fields": {
      "field1": {
        "type": "unified",
        "number_of_fragments": 0
      }
    }
  }
}

Reponse:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.6749716,
    "hits": [
      {
        "_index": "error-case",
        "_id": "2",
        "_score": 2.6749716,
        "_source": {
          "field1": [
            "If you say things to a person that turn out not to be true",
            "the person is inclined to think that he or she has been deceived."
          ]
        },
        "highlight": {
          "field1": [
            """If you say things to a person that turn out not <em>to be true[Null]the person</em>"""
          ]
        }
      }
    ]
  }
}

error_case_null

On the other hand, if we run the same query by with "require_field_match": false the result is the expected because in this case the unified mode witch it is use is the old:

GET error-case/_search
{
  "query": {
    "match_phrase": {
      "field1": {
        "query": "to be true the person"
      }
    }
  },
  "highlight": {
    "fields": {
      "field1": {
        "type": "unified",
        "number_of_fragments": 0,
        "require_field_match": false
      }
    }
  }
}

Response:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.6749716,
    "hits": [
      {
        "_index": "error-case",
        "_id": "2",
        "_score": 2.6749716,
        "_source": {
          "field1": [
            "If you say things to a person that turn out not to be true",
            "the person is inclined to think that he or she has been deceived."
          ]
        },
        "highlight": {
          "field1": [
            "If you say things to a person that turn out not <em>to</em> <em>be</em> <em>true</em>",
            "<em>the</em> <em>person</em> is inclined to think that he or she has been deceived."
          ]
        }
      }
    ]
  }
}

If we run the same test in version 8.9.2, the errors do not occur and the expected result is obtained by highlighting the end and beginning of the two sentences of the array:

GET error-case/_search
{
  "query": {
    "match_phrase": {
      "field1": {
        "query": "to be true the person"
      }
    }
  },
  "highlight": {
    "fields": {
      "field1": {
      }
    }
  }
} 

Response OK

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.6749716,
    "hits": [
      {
        "_index": "error-case",
        "_id": "2",
        "_score": 2.6749716,
        "_source": {
          "field1": [
            "If you say things to a person that turn out not to be true",
            "the person is inclined to think that he or she has been deceived."
          ]
        },
        "highlight": {
          "field1": [
            "If you say things to a person that turn out not <em>to</em> <em>be</em> <em>true</em>",
            "<em>the</em> <em>person</em> is inclined to think that he or she has been deceived."
          ]
        }
      }
    ]
  }
}

Could you please tell me if this is a bug?
If so, is there any workaround, or could it be resolved in version 8.17.x?

On the other hand, the highlighted result is different when the query includes a nested type or not. This implies that when queries are made that should return the same highlighted field, they are not doing so.
To avoid that inconsistency in the results, it would be good to be able to set whether to use one mode of unified or another.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Highlighting How a query matched a document Team:Search Meta label for search team v8.10.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Highlighter breaks phrases

6 participants