Skip to content

Conversation

@andyb-elastic
Copy link
Contributor

This exposes Lucene's new Simple Pattern Tokenizer and Simple Pattern Split Tokenizer, which use a restricted subset of regular expressions for faster tokenization. They're annotated as experimental in Lucene and are documented as such here.

For #23363

Register these experimental tokenizers. Their default patterns
are both set to the empty string. These tokenizers only seem
useful if there is a pattern the user has in mind, so there
aren't really "sensible" defaults. However tokenizer factories
are instantiated at index creation time, so they blow up if
there's no default pattern.

Add a rest test and entries in the reference for each tokenizer

For #23363
super(indexSettings, name, settings);

String pattern = settings.get("pattern", "");
if (pattern == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can never happen since you specified a default value on the previous line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, if it's basically-dead code then best to take it out. I think I was concerned about the user passing "pattern": null in the settings explicitly, but it looks like Settings#get(String, String) handles that case

========================================
A badly written regular expression could run very slowly or even throw a
StackOverflowError and cause the node it is running on to exit suddenly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this isn't true for this tokenizer. The nice thing about lucene's regexes is that badly written regexes tend to explode up front with "TooComplexToDeterminizeException". That isn't to say that all regexes that don't throw that don't have that problem, just that I don't think you need to be "beware". So I think we can drop this warning entirely from this tokenizer.

In fact, this is the real reason we like this tokenizer. It might be faster than the old pattern one, but the reason we want this is that it is much safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was on the fence about including that since you can still (as far as I can tell) write regexes that do a lot of backtracking with its feature set. If it will cause an exception at regex-compile time then it's probably best to take this out.

Earlier someone made a good point about how if there are too many admonition blocks, then people don't pay as much attention to them (and they're definitely more important in the places that use Java regexes).

For reference, this is what the Lucene code has to say about it

// TODO: the matcher here is naive and does have N^2 adversarial cases that are unlikely to arise in practice, e.g. if the pattern is
// aaaaaaaaaab and the input is aaaaaaaaaaa, the work we do here is N^2 where N is the number of a's. This is because on failing to match
// a token, we skip one character forward and try again. A better approach would be to compile something like this regexp
// instead: .* | , because that automaton would not "forget" all the as it had already seen, and would be a single pass
// through the input.

* under the License.
*/

package org.elasticsearch.index.analysis;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to put these in the analysis-common module. We are (slowly) in the process of moving all the analysis components there that depend on lucene-analyzers-common.jar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will do

Fixes for code review

Take out admonition blocks in reference detail pages on
these tokenizers because Lucene's regexes are better protected
against being too complex or causing deep stacks.

Move these tokenizers to the common-analysis module because
that's where we're relocating code that depends on
lucene-analyzers-common

For #23363
@andyb-elastic
Copy link
Contributor Author

@nik9000 @jasontedor made the suggested changes. I think I found the right place to hook into CommonAnalysisPlugin, let me know if I missed something

@andyb-elastic
Copy link
Contributor Author

This is the test that failed, doesn't seem obviously related

Tests with failures:
  - org.elasticsearch.index.translog.TranslogTests.testWithRandomException

retest this please jenkins

@jasontedor
Copy link
Member

@andy-elastic That test is indeed not related, you can ignore it since its spurious. Relates #25133 so you can merge master into your branch to pick up the fix that was already pushed.

super(indexSettings, name, settings);

String pattern = settings.get("pattern", "");
this.pattern = pattern;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the shadowing local really needed, just assign directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not necessary, fixed that. Also merged master back in.

@andyb-elastic
Copy link
Contributor Author

@jasontedor I'll go ahead and merge this in an hour or so unless you have more notes (I'm guessing this is all set but don't want to cowboy-merge my first pr)

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more comments.

.put("standard", StandardTokenizerFactory.class)
.put("thai", ThaiTokenizerFactory.class)
.put("uax29urlemail", UAX29URLEmailTokenizerFactory.class)
.put("whitespace", WhitespaceTokenizerFactory.class)
Copy link
Member

@jasontedor jasontedor Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formatting is annoying, since it has to be maintained every time a tokenizer is added or removed. I appreciate you trying to maintain it but maybe we should take this opportunity to just format it normally so we do not have to worry about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree, I'm not a fan of this style either for that reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it in just that one map, in the interest of keeping the diff to only relevant parts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is much preferred, thank you.

[horizontal]
`pattern`::

A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this additional whitespace here intentional?

[horizontal]
`pattern`::

A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this additional whitespace here intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the empty line or the leading whitespace (or both)? Yes, but for no other reason than it was like that in the other asciidoc files I looked at. It seems like the leading whitespace is the convention. I don't really have a strong opinion about readability here so I'm fine with removing the empty line at least

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was wondering why you did not simply have:

[horizontal]
`pattern`:: A...

matches using the same restricted regular expression subset, see the
<<analysis-simplepatternsplit-tokenizer,`simplepatternsplit`>> tokenizer.

This tokenizer uses http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't sure what to do here, this just seemed least-bad since I saw it in other places. Using that version property is much better but it looks like that url doesn't exist yet for lucene_version_path = 7_0_0 where we are right now.

My initial impression is that having that link broken until Lucene's 7.0.0 release is probably better than having it possibly be out of date forever, especially if lucene just isn't hosting a version 7 javadoc right now (I'll do a little more looking).

So I'll use the version property unless we really want to avoid that broken link (my reading of the docs project is it only complains about broken links within the book, not external urls)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, and I agree with the preference.

[horizontal]
`pattern`::

A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?

subset, see the <<analysis-simplepattern-tokenizer,`simplepattern`>>
tokenizer.

This tokenizer uses http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?

[horizontal]
`pattern`::

A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?

Make links to lucene javadocs relative to the
lucene-core-javadoc property so they'll stay up to date
as we change lucene versions

Whitespace formatting in tokenizer docs

Whitespace formatting in AnalysisFactoryTestCase so that
we don't have to change spacing every time we edit that map

Clearer usage in the header for simplepatternsplit's section

For #23363
@andyb-elastic
Copy link
Contributor Author

Made those whitespace and lucene javadoc link changes

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@andyb-elastic
Copy link
Contributor Author

Thanks @jasontedor. Just to be clear, this should go in 5.x right?

@andyb-elastic andyb-elastic merged commit 48696ab into elastic:master Jun 13, 2017
@andyb-elastic
Copy link
Contributor Author

Ah, looks like the common analysis plugin doesn't exist in 5.x. Where these changes would go in 5.x is different enough that I'm gonna infer that it shouldn't go in there.

@jasontedor
Copy link
Member

I think that 6.0.0 only (master) is fine for this.

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 14, 2017
* master: (27 commits)
  Refactor TransportShardBulkAction.executeUpdateRequest and add tests
  Make sure range queries are correctly profiled. (elastic#25108)
  Test: allow setting socket timeout for rest client (elastic#25221)
  Migration docs for elastic#25080 (elastic#25218)
  Remove `discovery.type` BWC layer from the EC2/Azure/GCE plugins elastic#25080
  When stopping via systemd only kill the JVM, not its control group (elastic#25195)
  Remove PrefixAnalyzer, because it is no longer used.
  Internal: Remove Strings.cleanPath (elastic#25209)
  Docs: Add note about which secure settings are valid (elastic#25212)
  Indices.rollover/10_basic should refresh to make the doc visible in lucene stats
  Port support for commercial GeoIP2 databases from Logstash. (elastic#24889)
  [DOCS] Add ML node to node.asciidoc (elastic#24495)
  expose simple pattern tokenizers (elastic#25159)
  Test: add setting to change request timeout for rest client (elastic#25201)
  Fix secure repository-hdfs tests on JDK 9
  Add target_field parameter to gsub, join, lowercase, sort, split, trim, uppercase (elastic#24133)
  Add Cross Cluster Search support for scroll searches (elastic#25094)
  Adapt skip version in rest-api-spec/test/indices.rollover/20_max_doc_condition.yml
  Rollover max docs should only count primaries (elastic#24977)
  Add remote cluster infrastructure to fetch discovery nodes. (elastic#25123)
  ...
@clintongormley clintongormley changed the title expose simplepattern and simplepatternsplit tokenizers Expose simplepattern and simplepatternsplit tokenizers Jun 19, 2017
@clintongormley
Copy link
Contributor

@andy-elastic sorry for the late remark, but could you rename these to simple_pattern and simple_pattern_split please, as that's more consistent with eg path_hierarchy and uax_url_email

@andyb-elastic
Copy link
Contributor Author

rename these to simple_pattern and simple_pattern_split

Sure thing, much prefer that. I'd seen it as simplepattern in the code in a few places and just went with that.

@andyb-elastic
Copy link
Contributor Author

I see why simplepattern was already in there: in the analysis factory tests, tokenizers have their names without spaces because some of the tests look them up with an SPI name.

I opened #25300 for that change, let me know if there's a more appropriate to get it in than another PR

@nik9000
Copy link
Member

nik9000 commented Jun 19, 2017 via email

andyb-elastic added a commit that referenced this pull request Jun 19, 2017
Changed names to be snake case for consistency

Related to #25159, original issue #23363
@andyb-elastic
Copy link
Contributor Author

@clintongormley I merged #25300 which changes their names to snake case

@clintongormley
Copy link
Contributor

thanks @andy-elastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants