Skip to content

Conversation

@shaie
Copy link
Contributor

@shaie shaie commented Feb 19, 2017

This commit adds a boundary_scanner property to the search highlight
request (defaults to "simple", per current behavior) so the user can
specify "break_iterator" and its type ("sentence", "word", "line" and
"character"). Also adds "boundary_scanner_locale" to define which one
should be used when highlighting the text.

This commit adds a boundary_scanner property to the search highlight
request (defaults to "simple", per current behavior) so the user can
specify "break_iterator" and its type ("sentence", "word", "line" and
"character"). Also adds "boundary_scanner_locale" to define which one
should be used when highlighting the text.
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@jimczi jimczi self-requested a review February 20, 2017 11:09
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shaie
The change looks good, I left some comments regarding the BWC. Also the docs are missing. Can you add a section in the docs explaining how these new options can be used and their limitations ?

order(in.readOptionalWriteable(Order::readFromStream));
highlightFilter(in.readOptionalBoolean());
forceSource(in.readOptionalBoolean());
boundaryScannerType(in.readOptionalWriteable(BoundaryScannerType::readFromStream));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR can go to 5.x can you add a check on the version, like this:

 if (in.getVersion().onOrAfter(Version.V_5_4_0_UNRELEASED)) { 
    boundaryScannerType(in.readOptionalWriteable(BoundaryScannerType::readFromStream));
}

This will ensure that a node in 5.4 can safely send request to a 5.3 node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't see this pattern in the files I changed, so had no idea :). I will do that in all 4 places you commented on.

if (in.readBoolean()) {
boundaryChars(in.readString().toCharArray());
}
boundaryScannerLocale(in.readOptionalString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

out.writeOptionalWriteable(order);
out.writeOptionalBoolean(highlightFilter);
out.writeOptionalBoolean(forceSource);
out.writeOptionalWriteable(boundaryScannerType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here ...

out.writeBoolean(hasBoundaryScannerLocale);
if (hasBoundaryScannerLocale) {
out.writeString(boundaryScannerLocale.toLanguageTag());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the last one ;)

}

public static enum BoundaryScannerType implements Writeable {
SIMPLE, BREAK_ITERATOR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BREAK_ITERATOR is not reflecting what this type is doing. Maybe just SENTENCE ? Or SENTENCE_BREAK ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually exactly what it means. As I wrote in the PR comment, I wanted to get initial feedback. Now that I see you're OK with the general approach, I want to add a "break_iterator_type" property which defaults to "SENTENCE" but also supports "WORD", "CHARACTER" and "LINE".

Copy link
Contributor

@jimczi jimczi Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it thanks. Though we could avoid creating another option and merge the logic in a single one ? The boundary scanner type could be any of:

  • simple
  • sentence
  • word

The fact that we use a break iterator for sentence and word is just an implementation detail ?
Although "CHARACTER" seems weird, should we really expose it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to expose all options, was just thinking "if Lucene supports it ..". But having said that, from an API perspective this may be weird, so I'll take your proposal and add "boundary_scanner" with values "simple", "sentence" and "word". I'll do that.

if (boundaryScannerType.toUpperCase(Locale.ROOT).equals(BREAK_ITERATOR.name())) {
return BoundaryScannerType.BREAK_ITERATOR;
}
return SIMPLE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too lenient, we should fail if the type is unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the same approach as in Order though I agree with you. Just thought this is how ES prefers to do things. I can simplify the method to use valueOf(boundaryScannerType.toUpperCase(Locale.ROOT))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@shaie
Copy link
Contributor Author

shaie commented Feb 22, 2017

@jimczi I believe I've addressed all your comments. Hope I addressed the BW ones as you intended!

@shaie
Copy link
Contributor Author

shaie commented Feb 22, 2017

@jimczi I only now noticed your comment about adding something to the docs. Can you point me to where that is?

@jimczi
Copy link
Contributor

jimczi commented Feb 22, 2017

Thanks @shaie ! You can check the current docs for the FVH here: https://github.com/elastic/elasticsearch/blob/master/docs/reference/search/request/highlighting.asciidoc#fast-vector-highlighter
I'll review your changes shortly.

@shaie
Copy link
Contributor Author

shaie commented Feb 23, 2017

@jimczi I updated the docs and also renamed SIMPLE to CHARS, which I think makes more sense?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @shaie !
I agree chars for the boundary scanner type makes more sense.
I'll check that the build is ok and that the tests pass.

@jimczi
Copy link
Contributor

jimczi commented Feb 23, 2017

@elasticmachine please test it

@shaie
Copy link
Contributor Author

shaie commented Feb 23, 2017

@jimczi, the build failed but seems weird exception, something about Git plugin ..

@jimczi
Copy link
Contributor

jimczi commented Feb 23, 2017

Yes it's just a problem with our CI. I'll test locally :(

@shaie
Copy link
Contributor Author

shaie commented Feb 23, 2017

K thanks @jimczi. Let me know if there are any issues, though I did run some tests locally and they passed (ones I thought are related to the change).

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

boolean hasBoundaryScannerLocale = boundaryScannerLocale != null;
out.writeBoolean(hasBoundaryScannerLocale);
if (hasBoundaryScannerLocale) {
out.writeString(boundaryScannerLocale.toLanguageTag());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out.writeOptionalString() since you use in.readOptionalString in the readFrom ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, notice that since boundaryScannerLocale is a Locale, I first write a boolean, and then write the String if it's not null. If I wanted to use writeOptionalString I would need to do something like:

out.writeOptionalString(boundaryScannerLocale == null ? null : boundaryScannerLocale.toLanguageTag())

Do you think that's preferred?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference @jimczi, I borrowed this pattern from the code just above those lines which handles boundaryChars:

        boolean hasBounaryChars = boundaryChars != null;
        out.writeBoolean(hasBounaryChars);
        if (hasBounaryChars) {
            out.writeString(String.valueOf(boundaryChars));
        }

So before I change the code I wrote, does this change your opinion? not that I mind changing the code, just prefer the code to be consistent with the rest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind as long as we use writeString/readString and writeOptionalString/readOptionalString consistently.
So you can maybe just change the readFrom to explicitly use readBoolean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK done.

}
}

public static enum BoundaryScannerType implements Writeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes our code checker unhappy since enum are always static in this context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build fails for me. I've left some comments regarding the failing codes. You can test your change locally with gradle clean check to make sure that the build is ok.

if (randomBoolean()) {
highlightBuilder.boundaryScannerType(randomFrom(BoundaryScannerType.values()));
} else {
// also test the string setter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fails the build, the toString is not correctly applied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed.

}
if (fieldOptions.boundaryScannerType == null) {
fieldOptions.boundaryScannerType = globalOptions.boundaryScannerType;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge of the boundaryScannerLocale is missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. Fixed.

@shaie
Copy link
Contributor Author

shaie commented Feb 23, 2017

I'm running gradle clean check and will push a new commit to the PR once it finishes. Sorry for the noise @jimczi

@shaie
Copy link
Contributor Author

shaie commented Feb 23, 2017

@jimczi I pushed a commit which addresses all your comments and all tests pass. Actually, when running gradle clean check only one test failed: org.elasticsearch.client.transport.NodeDisconnectIT.testNotifyOnDisconnect, but it doesn't look related to this PR and also the failures looked like:

 1> Caused by: org.elasticsearch.transport.NodeNotConnectedException: [node_s0][127.0.0.1:9500] Node not connected
  1>    at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:594) ~[main/:?]
  1>    at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:114) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:509) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:485) ~[main/:?]
  1>    ... 19 more

I can rerun, but if you want to start the tests too, wanted to let you know.

@jimczi jimczi merged commit eeac6d2 into elastic:master Feb 23, 2017
@jimczi
Copy link
Contributor

jimczi commented Feb 23, 2017

Thanks @shaie !
I'll merge to 5.x now.

jimczi pushed a commit that referenced this pull request Feb 23, 2017
This commit adds a boundary_scanner property to the search highlight
request so the user can specify different boundary scanners:

* `chars` (default,  current behavior)
* `word` Use a WordBreakIterator
* `sentence` Use a SentenceBreakIterator

This commit also adds "boundary_scanner_locale" to define which locale
should be used when scanning the text.
@shaie shaie deleted the break-iterator-scanner branch February 24, 2017 08:31
@shaie
Copy link
Contributor Author

shaie commented Feb 24, 2017

Thanks @jimczi!

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 25, 2017
* master: (26 commits)
  CLI: Fix prompting for yes/no to handle console returning null (elastic#23320)
  Tests: Fix reproduce line for packagingTest (elastic#23365)
  Build: Remove extra copies of netty license (elastic#23361)
  [TEST] Removes timeout based wait_for_active_shards REST test (elastic#23360)
  [TEST] increase timeout slightly in wait_for_active_shards test to allow for index creation cluster state update to be processed before ensuring the wait times out
  Handle snapshot repository's missing index.latest
  Adding equals/hashCode to MainResponse (elastic#23352)
  Always restore the ThreadContext for operations delayed due to a block (elastic#23349)
  Add support for named xcontent parsers to high level REST client (elastic#23328)
  Add unit tests for ParentToChildAggregator (elastic#23305)
  Fix after last merge with master and apply last comments
  [INGEST] Lazy load the geoip databases.
  disable BWC tests for the highlighters, need a new 5.x build to make it work
  Expose WordDelimiterGraphTokenFilter (elastic#23327)
  Test that buildCredentials returns correct clazz (elastic#23334)
  Add BreakIteratorBoundaryScanner support for FVH (elastic#23248)
  Prioritize listing index-N blobs over index.latest in reading snapshots (elastic#23333)
  Test: Fix hdfs test fixture setup on windows
  delete and index tests can share some part of the code
  Remove createSampleDocument method and use the sync'ed index method
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants