-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add BreakIteratorBoundaryScanner support #23248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds a boundary_scanner property to the search highlight
request (defaults to "simple", per current behavior) so the user can
specify "break_iterator" and its type ("sentence", "word", "line" and
"character"). Also adds "boundary_scanner_locale" to define which one
should be used when highlighting the text.
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shaie
The change looks good, I left some comments regarding the BWC. Also the docs are missing. Can you add a section in the docs explaining how these new options can be used and their limitations ?
| order(in.readOptionalWriteable(Order::readFromStream)); | ||
| highlightFilter(in.readOptionalBoolean()); | ||
| forceSource(in.readOptionalBoolean()); | ||
| boundaryScannerType(in.readOptionalWriteable(BoundaryScannerType::readFromStream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR can go to 5.x can you add a check on the version, like this:
if (in.getVersion().onOrAfter(Version.V_5_4_0_UNRELEASED)) {
boundaryScannerType(in.readOptionalWriteable(BoundaryScannerType::readFromStream));
}
This will ensure that a node in 5.4 can safely send request to a 5.3 node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I didn't see this pattern in the files I changed, so had no idea :). I will do that in all 4 places you commented on.
| if (in.readBoolean()) { | ||
| boundaryChars(in.readString().toCharArray()); | ||
| } | ||
| boundaryScannerLocale(in.readOptionalString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| out.writeOptionalWriteable(order); | ||
| out.writeOptionalBoolean(highlightFilter); | ||
| out.writeOptionalBoolean(forceSource); | ||
| out.writeOptionalWriteable(boundaryScannerType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here ...
| out.writeBoolean(hasBoundaryScannerLocale); | ||
| if (hasBoundaryScannerLocale) { | ||
| out.writeString(boundaryScannerLocale.toLanguageTag()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the last one ;)
| } | ||
|
|
||
| public static enum BoundaryScannerType implements Writeable { | ||
| SIMPLE, BREAK_ITERATOR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BREAK_ITERATOR is not reflecting what this type is doing. Maybe just SENTENCE ? Or SENTENCE_BREAK ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually exactly what it means. As I wrote in the PR comment, I wanted to get initial feedback. Now that I see you're OK with the general approach, I want to add a "break_iterator_type" property which defaults to "SENTENCE" but also supports "WORD", "CHARACTER" and "LINE".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it thanks. Though we could avoid creating another option and merge the logic in a single one ? The boundary scanner type could be any of:
- simple
- sentence
- word
The fact that we use a break iterator for sentence and word is just an implementation detail ?
Although "CHARACTER" seems weird, should we really expose it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have to expose all options, was just thinking "if Lucene supports it ..". But having said that, from an API perspective this may be weird, so I'll take your proposal and add "boundary_scanner" with values "simple", "sentence" and "word". I'll do that.
| if (boundaryScannerType.toUpperCase(Locale.ROOT).equals(BREAK_ITERATOR.name())) { | ||
| return BoundaryScannerType.BREAK_ITERATOR; | ||
| } | ||
| return SIMPLE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's too lenient, we should fail if the type is unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the same approach as in Order though I agree with you. Just thought this is how ES prefers to do things. I can simplify the method to use valueOf(boundaryScannerType.toUpperCase(Locale.ROOT))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
|
@jimczi I believe I've addressed all your comments. Hope I addressed the BW ones as you intended! |
|
@jimczi I only now noticed your comment about adding something to the docs. Can you point me to where that is? |
|
Thanks @shaie ! You can check the current docs for the FVH here: https://github.com/elastic/elasticsearch/blob/master/docs/reference/search/request/highlighting.asciidoc#fast-vector-highlighter |
|
@jimczi I updated the docs and also renamed |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @shaie !
I agree chars for the boundary scanner type makes more sense.
I'll check that the build is ok and that the tests pass.
|
@elasticmachine please test it |
|
@jimczi, the build failed but seems weird exception, something about Git plugin .. |
|
Yes it's just a problem with our CI. I'll test locally :( |
|
K thanks @jimczi. Let me know if there are any issues, though I did run some tests locally and they passed (ones I thought are related to the change). |
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
| boolean hasBoundaryScannerLocale = boundaryScannerLocale != null; | ||
| out.writeBoolean(hasBoundaryScannerLocale); | ||
| if (hasBoundaryScannerLocale) { | ||
| out.writeString(boundaryScannerLocale.toLanguageTag()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out.writeOptionalString() since you use in.readOptionalString in the readFrom ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, notice that since boundaryScannerLocale is a Locale, I first write a boolean, and then write the String if it's not null. If I wanted to use writeOptionalString I would need to do something like:
out.writeOptionalString(boundaryScannerLocale == null ? null : boundaryScannerLocale.toLanguageTag())
Do you think that's preferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for reference @jimczi, I borrowed this pattern from the code just above those lines which handles boundaryChars:
boolean hasBounaryChars = boundaryChars != null;
out.writeBoolean(hasBounaryChars);
if (hasBounaryChars) {
out.writeString(String.valueOf(boundaryChars));
}
So before I change the code I wrote, does this change your opinion? not that I mind changing the code, just prefer the code to be consistent with the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind as long as we use writeString/readString and writeOptionalString/readOptionalString consistently.
So you can maybe just change the readFrom to explicitly use readBoolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK done.
| } | ||
| } | ||
|
|
||
| public static enum BoundaryScannerType implements Writeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes our code checker unhappy since enum are always static in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build fails for me. I've left some comments regarding the failing codes. You can test your change locally with gradle clean check to make sure that the build is ok.
| if (randomBoolean()) { | ||
| highlightBuilder.boundaryScannerType(randomFrom(BoundaryScannerType.values())); | ||
| } else { | ||
| // also test the string setter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fails the build, the toString is not correctly applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, fixed.
| } | ||
| if (fieldOptions.boundaryScannerType == null) { | ||
| fieldOptions.boundaryScannerType = globalOptions.boundaryScannerType; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merge of the boundaryScannerLocale is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that. Fixed.
|
I'm running |
|
@jimczi I pushed a commit which addresses all your comments and all tests pass. Actually, when running I can rerun, but if you want to start the tests too, wanted to let you know. |
|
Thanks @shaie ! |
This commit adds a boundary_scanner property to the search highlight request so the user can specify different boundary scanners: * `chars` (default, current behavior) * `word` Use a WordBreakIterator * `sentence` Use a SentenceBreakIterator This commit also adds "boundary_scanner_locale" to define which locale should be used when scanning the text.
|
Thanks @jimczi! |
* master: (26 commits) CLI: Fix prompting for yes/no to handle console returning null (elastic#23320) Tests: Fix reproduce line for packagingTest (elastic#23365) Build: Remove extra copies of netty license (elastic#23361) [TEST] Removes timeout based wait_for_active_shards REST test (elastic#23360) [TEST] increase timeout slightly in wait_for_active_shards test to allow for index creation cluster state update to be processed before ensuring the wait times out Handle snapshot repository's missing index.latest Adding equals/hashCode to MainResponse (elastic#23352) Always restore the ThreadContext for operations delayed due to a block (elastic#23349) Add support for named xcontent parsers to high level REST client (elastic#23328) Add unit tests for ParentToChildAggregator (elastic#23305) Fix after last merge with master and apply last comments [INGEST] Lazy load the geoip databases. disable BWC tests for the highlighters, need a new 5.x build to make it work Expose WordDelimiterGraphTokenFilter (elastic#23327) Test that buildCredentials returns correct clazz (elastic#23334) Add BreakIteratorBoundaryScanner support for FVH (elastic#23248) Prioritize listing index-N blobs over index.latest in reading snapshots (elastic#23333) Test: Fix hdfs test fixture setup on windows delete and index tests can share some part of the code Remove createSampleDocument method and use the sync'ed index method ...
This commit adds a boundary_scanner property to the search highlight
request (defaults to "simple", per current behavior) so the user can
specify "break_iterator" and its type ("sentence", "word", "line" and
"character"). Also adds "boundary_scanner_locale" to define which one
should be used when highlighting the text.