Skip to content

Conversation

@jimczi
Copy link
Contributor

@jimczi jimczi commented Mar 1, 2017

This commit introduce a new break iterator (a BoundedBreakIterator) designed for the unified highlighter that is able to limit the size of fragments produced by generic break iterator like sentence.
The unified highlighter now supports boundary_scanner with the following mode: words and sentence. The sentence mode uses the new bounded break iterator in order to limit the size of the sentence to fragment_length.
When sentences bigger than fragment_length are produced, this mode will break the sentence at the next word boundary after fragment_length is reached.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minors. As I said, I'm not a fan of using BreakIterator for something so narrow but this is a big improvement so it is worth getting in regardless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like BreakIterator for this. It is a much wider interface than the highlighter needs. But it is what we have to work with. So we shall!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be hard checks? Or have a message associated with them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd make these a hard check and throw a new IllegalArgumentException("Usage doesn't look like UnifiedHighlighter").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably worth commenting why this is needed too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noMatchSize + pos?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I reading this right, this is the only change to the file. Everything else is just word wrapping?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a non-random test just so it is easier to see? One with a "normal" sentence and maybe one with a long one? Just for illustration purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you have enough in CustomUnifiedHighlighterTests, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd throw an exception if it is > 0 rather than ignore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is 100 so it feels weird to set an explicit fragment_length:0 to make this mode works ?

jimczi added 2 commits March 17, 2017 16:23
This commit introduce a new break iterator (a BoundedBreakIterator) designed for the unified highlighter
 that is able to limit the size of fragments produced by generic break iterator like `sentence`.
The `unified` highlighter now supports `boundary_scanner` which can `words` or `sentence`.
The `sentence` mode will use the bounded break iterator in order to limit the size of the sentence to `fragment_length`.
When sentences bigger than `fragment_length` are produced, this mode will break the sentence at the next word boundary **after**
 `fragment_length` is reached.
@jimczi jimczi merged commit b8c352f into elastic:master Mar 17, 2017
@jimczi jimczi deleted the bounded_break_iterator branch March 17, 2017 17:10
@jimczi
Copy link
Contributor Author

jimczi commented Mar 17, 2017

Thanks @nik9000

jimczi added a commit that referenced this pull request Mar 17, 2017
* Add support for fragment_length in the unified highlighter

This commit introduce a new break iterator (a BoundedBreakIterator) designed for the unified highlighter
 that is able to limit the size of fragments produced by generic break iterator like `sentence`.
The `unified` highlighter now supports `boundary_scanner` which can `words` or `sentence`.
The `sentence` mode will use the bounded break iterator in order to limit the size of the sentence to `fragment_length`.
When sentences bigger than `fragment_length` are produced, this mode will break the sentence at the next word boundary **after**
 `fragment_length` is reached.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 17, 2017
* master:
  Eclipse: move print margin to 100 columns
  Add support for fragment_length in the unified highlighter (elastic#23431)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 17, 2017
* master:
  Eclipse: move print margin to 100 columns
  Add support for fragment_length in the unified highlighter (elastic#23431)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 17, 2017
* single-node-discovery:
  Eclipse: move print margin to 100 columns
  Add support for fragment_length in the unified highlighter (elastic#23431)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants