Expose simplepattern and simplepatternsplit tokenizers #25159

andyb-elastic · 2017-06-09T17:04:04Z

This exposes Lucene's new Simple Pattern Tokenizer and Simple Pattern Split Tokenizer, which use a restricted subset of regular expressions for faster tokenization. They're annotated as experimental in Lucene and are documented as such here.

For #23363

Register these experimental tokenizers. Their default patterns are both set to the empty string. These tokenizers only seem useful if there is a pattern the user has in mind, so there aren't really "sensible" defaults. However tokenizer factories are instantiated at index creation time, so they blow up if there's no default pattern. Add a rest test and entries in the reference for each tokenizer For #23363

jasontedor · 2017-06-09T17:44:05Z

core/src/main/java/org/elasticsearch/index/analysis/SimplePatternTokenizerFactory.java

+        super(indexSettings, name, settings);
+
+        String pattern = settings.get("pattern", "");
+        if (pattern == null) {


This can never happen since you specified a default value on the previous line.

Good point, if it's basically-dead code then best to take it out. I think I was concerned about the user passing "pattern": null in the settings explicitly, but it looks like Settings#get(String, String) handles that case

nik9000 · 2017-06-09T21:59:50Z

docs/reference/analysis/tokenizers/simplepattern-tokenizer.asciidoc

+========================================
+
+A badly written regular expression could run very slowly or even throw a
+StackOverflowError and cause the node it is running on to exit suddenly.


I believe this isn't true for this tokenizer. The nice thing about lucene's regexes is that badly written regexes tend to explode up front with "TooComplexToDeterminizeException". That isn't to say that all regexes that don't throw that don't have that problem, just that I don't think you need to be "beware". So I think we can drop this warning entirely from this tokenizer.

In fact, this is the real reason we like this tokenizer. It might be faster than the old pattern one, but the reason we want this is that it is much safer.

Yeah, I was on the fence about including that since you can still (as far as I can tell) write regexes that do a lot of backtracking with its feature set. If it will cause an exception at regex-compile time then it's probably best to take this out.

Earlier someone made a good point about how if there are too many admonition blocks, then people don't pay as much attention to them (and they're definitely more important in the places that use Java regexes).

For reference, this is what the Lucene code has to say about it

// TODO: the matcher here is naive and does have N^2 adversarial cases that are unlikely to arise in practice, e.g. if the pattern is
// aaaaaaaaaab and the input is aaaaaaaaaaa, the work we do here is N^2 where N is the number of a's. This is because on failing to match
// a token, we skip one character forward and try again. A better approach would be to compile something like this regexp
// instead: .* | , because that automaton would not "forget" all the as it had already seen, and would be a single pass
// through the input.

nik9000 · 2017-06-09T22:01:11Z

core/src/main/java/org/elasticsearch/index/analysis/SimplePatternTokenizerFactory.java

+ * under the License.
+ */
+
+package org.elasticsearch.index.analysis;


I think I'd like to put these in the analysis-common module. We are (slowly) in the process of moving all the analysis components there that depend on lucene-analyzers-common.jar.

Sounds good, will do

Fixes for code review Take out admonition blocks in reference detail pages on these tokenizers because Lucene's regexes are better protected against being too complex or causing deep stacks. Move these tokenizers to the common-analysis module because that's where we're relocating code that depends on lucene-analyzers-common For #23363

andyb-elastic · 2017-06-12T15:28:31Z

@nik9000 @jasontedor made the suggested changes. I think I found the right place to hook into CommonAnalysisPlugin, let me know if I missed something

andyb-elastic · 2017-06-12T16:14:01Z

This is the test that failed, doesn't seem obviously related

Tests with failures:
  - org.elasticsearch.index.translog.TranslogTests.testWithRandomException

retest this please jenkins

jasontedor · 2017-06-12T16:53:38Z

@andy-elastic That test is indeed not related, you can ignore it since its spurious. Relates #25133 so you can merge master into your branch to pick up the fix that was already pushed.

jasontedor · 2017-06-12T16:54:20Z

...mmon/src/main/java/org/elasticsearch/analysis/common/SimplePatternSplitTokenizerFactory.java

+        super(indexSettings, name, settings);
+
+        String pattern = settings.get("pattern", "");
+        this.pattern = pattern;


Is the shadowing local really needed, just assign directly?

Definitely not necessary, fixed that. Also merged master back in.

Fix for code review to cleanup unnecessary variables For #23363

andyb-elastic · 2017-06-12T20:54:54Z

@jasontedor I'll go ahead and merge this in an hour or so unless you have more notes (I'm guessing this is all set but don't want to cowboy-merge my first pr)

jasontedor

I left some more comments.

jasontedor · 2017-06-12T21:11:12Z

test/framework/src/main/java/org/elasticsearch/indices/analysis/AnalysisFactoryTestCase.java

+        .put("standard",           StandardTokenizerFactory.class)
+        .put("thai",               ThaiTokenizerFactory.class)
+        .put("uax29urlemail",      UAX29URLEmailTokenizerFactory.class)
+        .put("whitespace",         WhitespaceTokenizerFactory.class)


This formatting is annoying, since it has to be maintained every time a tokenizer is added or removed. I appreciate you trying to maintain it but maybe we should take this opportunity to just format it normally so we do not have to worry about that?

Totally agree, I'm not a fan of this style either for that reason

Fixed it in just that one map, in the interest of keeping the diff to only relevant parts

Yes, that is much preferred, thank you.

jasontedor · 2017-06-12T21:11:49Z

docs/reference/analysis/tokenizers/simplepattern-tokenizer.asciidoc

+[horizontal]
+`pattern`::
+
+    A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.


Is this additional whitespace here intentional?

jasontedor · 2017-06-12T21:11:56Z

docs/reference/analysis/tokenizers/simplepatternsplit-tokenizer.asciidoc

+[horizontal]
+`pattern`::
+
+  A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.


Is this additional whitespace here intentional?

Do you mean the empty line or the leading whitespace (or both)? Yes, but for no other reason than it was like that in the other asciidoc files I looked at. It seems like the leading whitespace is the convention. I don't really have a strong opinion about readability here so I'm fine with removing the empty line at least

Yes, I was wondering why you did not simply have:

[horizontal] `pattern`:: A...

jasontedor · 2017-06-12T21:13:01Z

docs/reference/analysis/tokenizers/simplepattern-tokenizer.asciidoc

+matches using the same restricted regular expression subset, see the
+<<analysis-simplepatternsplit-tokenizer,`simplepatternsplit`>> tokenizer.
+
+This tokenizer uses http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].


I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?

Yeah, I wasn't sure what to do here, this just seemed least-bad since I saw it in other places. Using that version property is much better but it looks like that url doesn't exist yet for lucene_version_path = 7_0_0 where we are right now.

My initial impression is that having that link broken until Lucene's 7.0.0 release is probably better than having it possibly be out of date forever, especially if lucene just isn't hosting a version 7 javadoc right now (I'll do a little more looking).

So I'll use the version property unless we really want to avoid that broken link (my reading of the docs project is it only complains about broken links within the book, not external urls)

That's correct, and I agree with the preference.

jasontedor · 2017-06-12T21:13:05Z

docs/reference/analysis/tokenizers/simplepattern-tokenizer.asciidoc

+[horizontal]
+`pattern`::
+
+    A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.


I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?

jasontedor · 2017-06-12T21:13:14Z

docs/reference/analysis/tokenizers/simplepatternsplit-tokenizer.asciidoc

+subset, see the <<analysis-simplepattern-tokenizer,`simplepattern`>>
+tokenizer.
+
+This tokenizer uses http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].


I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?

jasontedor · 2017-06-12T21:13:29Z

docs/reference/analysis/tokenizers/simplepatternsplit-tokenizer.asciidoc

+[horizontal]
+`pattern`::
+
+  A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.


I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?

Make links to lucene javadocs relative to the lucene-core-javadoc property so they'll stay up to date as we change lucene versions Whitespace formatting in tokenizer docs Whitespace formatting in AnalysisFactoryTestCase so that we don't have to change spacing every time we edit that map Clearer usage in the header for simplepatternsplit's section For #23363

andyb-elastic · 2017-06-12T23:09:32Z

Made those whitespace and lucene javadoc link changes

jasontedor

LGTM.

andyb-elastic · 2017-06-13T19:24:31Z

Thanks @jasontedor. Just to be clear, this should go in 5.x right?

andyb-elastic · 2017-06-13T20:04:44Z

Ah, looks like the common analysis plugin doesn't exist in 5.x. Where these changes would go in 5.x is different enough that I'm gonna infer that it shouldn't go in there.

jasontedor · 2017-06-13T20:29:36Z

I think that 6.0.0 only (master) is fine for this.

* master: (27 commits) Refactor TransportShardBulkAction.executeUpdateRequest and add tests Make sure range queries are correctly profiled. (elastic#25108) Test: allow setting socket timeout for rest client (elastic#25221) Migration docs for elastic#25080 (elastic#25218) Remove `discovery.type` BWC layer from the EC2/Azure/GCE plugins elastic#25080 When stopping via systemd only kill the JVM, not its control group (elastic#25195) Remove PrefixAnalyzer, because it is no longer used. Internal: Remove Strings.cleanPath (elastic#25209) Docs: Add note about which secure settings are valid (elastic#25212) Indices.rollover/10_basic should refresh to make the doc visible in lucene stats Port support for commercial GeoIP2 databases from Logstash. (elastic#24889) [DOCS] Add ML node to node.asciidoc (elastic#24495) expose simple pattern tokenizers (elastic#25159) Test: add setting to change request timeout for rest client (elastic#25201) Fix secure repository-hdfs tests on JDK 9 Add target_field parameter to gsub, join, lowercase, sort, split, trim, uppercase (elastic#24133) Add Cross Cluster Search support for scroll searches (elastic#25094) Adapt skip version in rest-api-spec/test/indices.rollover/20_max_doc_condition.yml Rollover max docs should only count primaries (elastic#24977) Add remote cluster infrastructure to fetch discovery nodes. (elastic#25123) ...

clintongormley · 2017-06-19T11:01:44Z

@andy-elastic sorry for the late remark, but could you rename these to simple_pattern and simple_pattern_split please, as that's more consistent with eg path_hierarchy and uax_url_email

andyb-elastic · 2017-06-19T16:11:42Z

rename these to simple_pattern and simple_pattern_split

Sure thing, much prefer that. I'd seen it as simplepattern in the code in a few places and just went with that.

andyb-elastic · 2017-06-19T17:52:41Z

I see why simplepattern was already in there: in the analysis factory tests, tokenizers have their names without spaces because some of the tests look them up with an SPI name.

I opened #25300 for that change, let me know if there's a more appropriate to get it in than another PR

nik9000 · 2017-06-19T18:12:23Z

You can override the SPI lookup if you don't like the names that it makes you use. It is really just there for convenience.

…

On Mon, Jun 19, 2017 at 1:52 PM Andy Bristol ***@***.***> wrote: I see why simplepattern was already in there: in the analysis factory tests, tokenizers have their names without spaces <https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/indices/analysis/AnalysisFactoryTestCase.java#L124-L137> because some of the tests look them up with an SPI name <https://github.com/andy-elastic/elasticsearch/blob/428e70758ac6895ac995f4315412f4d3729aea9b/test/framework/src/main/java/org/elasticsearch/indices/analysis/AnalysisFactoryTestCase.java#L370-L370> . I opened #25300 <#25300> for that change, let me know if there's a more appropriate to get it in than another PR — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25159 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANLoqmtIjetVeZJAcqI2FboSqY3KzJEks5sFrVtgaJpZM4N1m5P> .

Changed names to be snake case for consistency Related to #25159, original issue #23363

andyb-elastic · 2017-06-19T20:51:05Z

@clintongormley I merged #25300 which changes their names to snake case

clintongormley · 2017-06-20T08:50:18Z

thanks @andy-elastic

andyb-elastic added :Search Relevance/Analysis How text is split into tokens >feature review v6.0.0 labels Jun 9, 2017

jasontedor reviewed Jun 9, 2017

View reviewed changes

nik9000 reviewed Jun 9, 2017

View reviewed changes

nik9000 approved these changes Jun 12, 2017

View reviewed changes

jasontedor reviewed Jun 12, 2017

View reviewed changes

andyb-elastic added 2 commits June 12, 2017 11:39

expose simplepattern and simplepatternsplit tokenizers

d97846a

Fix for code review to cleanup unnecessary variables For #23363

Merge branch 'master' into feature/expose-simplepattern-tokenizer

71ea151

jasontedor requested changes Jun 12, 2017

View reviewed changes

nik9000 approved these changes Jun 13, 2017

View reviewed changes

jasontedor approved these changes Jun 13, 2017

View reviewed changes

andyb-elastic merged commit 48696ab into elastic:master Jun 13, 2017

clintongormley changed the title ~~expose simplepattern and simplepatternsplit tokenizers~~ Expose simplepattern and simplepatternsplit tokenizers Jun 19, 2017

andyb-elastic mentioned this pull request Jun 19, 2017

Rename simple pattern tokenizers #25300

Merged

andyb-elastic added a commit that referenced this pull request Jun 19, 2017

Rename simple pattern tokenizers (#25300)

4c5bd57

Changed names to be snake case for consistency Related to #25159, original issue #23363

clintongormley added v6.0.0-beta1 and removed v6.0.0 labels Jul 25, 2017

Expose simplepattern and simplepatternsplit tokenizers #25159

Expose simplepattern and simplepatternsplit tokenizers #25159

Uh oh!

Conversation

andyb-elastic commented Jun 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyb-elastic commented Jun 12, 2017

Uh oh!

andyb-elastic commented Jun 12, 2017

Uh oh!

jasontedor commented Jun 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyb-elastic commented Jun 12, 2017

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor Jun 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyb-elastic commented Jun 12, 2017

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

andyb-elastic commented Jun 13, 2017

Uh oh!

andyb-elastic commented Jun 13, 2017

Uh oh!

jasontedor commented Jun 13, 2017

Uh oh!

clintongormley commented Jun 19, 2017

Uh oh!

andyb-elastic commented Jun 19, 2017

Uh oh!

andyb-elastic commented Jun 19, 2017

Uh oh!

jasontedor Jun 12, 2017 •

edited

Loading