Skip to content

Conversation

@jimczi
Copy link
Contributor

@jimczi jimczi commented Nov 25, 2019

By default the unified highlighter splits the input into passages using
a sentence break iterator. However we don't check if the field is tokenized
or not so keyword field also applies the break iterator even though they can
only match on the entire content. This means that by default we'll split the
content of a keyword field on sentence break if the requested number of fragments
is set to a value different than 0 (default to 5). This commit changes this behavior
to ignore the break iterator on non-tokenized fields (keyword) in order to always
highlight the entire values. The number of requested fragments controls the number of
matched values that are returned but the boundary_scanner_type is now ignored.
Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter
exposed this bug in Elasticsearch 7x.

By default the unified highlighter splits the input into passages using
a sentence break iterator. However we don't check if the field is tokenized
or not so `keyword` field also applies the break iterator even though they can
only match on the entire content. This means that by default we'll split the
content of a `keyword` field on sentence break if the requested number of fragments
is set to a value different than 0 (default to 5). This commit changes this behavior
to ignore the break iterator on non-tokenized fields (keyword) in order to always
highlight the entire values. The number of requested fragments control the number of
matched values are returned but the boundary_scanner_type is now ignored.
Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter
exposed this bug in Elasticsearch 7x.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Highlighting)

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jimczi , makes sense

@jimczi jimczi merged commit 871408f into elastic:master Dec 4, 2019
jimczi added a commit that referenced this pull request Dec 4, 2019
By default the unified highlighter splits the input into passages using
a sentence break iterator. However we don't check if the field is tokenized
or not so `keyword` field also applies the break iterator even though they can
only match on the entire content. This means that by default we'll split the
content of a `keyword` field on sentence break if the requested number of fragments
is set to a value different than 0 (default to 5). This commit changes this behavior
to ignore the break iterator on non-tokenized fields (keyword) in order to always
highlight the entire values. The number of requested fragments control the number of
matched values are returned but the boundary_scanner_type is now ignored.
Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter
exposed this bug in Elasticsearch 7x.
jimczi added a commit that referenced this pull request Dec 4, 2019
jimczi added a commit that referenced this pull request Dec 4, 2019
jimczi added a commit that referenced this pull request Dec 4, 2019
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
By default the unified highlighter splits the input into passages using
a sentence break iterator. However we don't check if the field is tokenized
or not so `keyword` field also applies the break iterator even though they can
only match on the entire content. This means that by default we'll split the
content of a `keyword` field on sentence break if the requested number of fragments
is set to a value different than 0 (default to 5). This commit changes this behavior
to ignore the break iterator on non-tokenized fields (keyword) in order to always
highlight the entire values. The number of requested fragments control the number of
matched values are returned but the boundary_scanner_type is now ignored.
Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter
exposed this bug in Elasticsearch 7x.
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants