-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Expose simplepattern and simplepatternsplit tokenizers #25159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose simplepattern and simplepatternsplit tokenizers #25159
Conversation
Register these experimental tokenizers. Their default patterns are both set to the empty string. These tokenizers only seem useful if there is a pattern the user has in mind, so there aren't really "sensible" defaults. However tokenizer factories are instantiated at index creation time, so they blow up if there's no default pattern. Add a rest test and entries in the reference for each tokenizer For #23363
| super(indexSettings, name, settings); | ||
|
|
||
| String pattern = settings.get("pattern", ""); | ||
| if (pattern == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can never happen since you specified a default value on the previous line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, if it's basically-dead code then best to take it out. I think I was concerned about the user passing "pattern": null in the settings explicitly, but it looks like Settings#get(String, String) handles that case
| ======================================== | ||
| A badly written regular expression could run very slowly or even throw a | ||
| StackOverflowError and cause the node it is running on to exit suddenly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this isn't true for this tokenizer. The nice thing about lucene's regexes is that badly written regexes tend to explode up front with "TooComplexToDeterminizeException". That isn't to say that all regexes that don't throw that don't have that problem, just that I don't think you need to be "beware". So I think we can drop this warning entirely from this tokenizer.
In fact, this is the real reason we like this tokenizer. It might be faster than the old pattern one, but the reason we want this is that it is much safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was on the fence about including that since you can still (as far as I can tell) write regexes that do a lot of backtracking with its feature set. If it will cause an exception at regex-compile time then it's probably best to take this out.
Earlier someone made a good point about how if there are too many admonition blocks, then people don't pay as much attention to them (and they're definitely more important in the places that use Java regexes).
For reference, this is what the Lucene code has to say about it
// TODO: the matcher here is naive and does have N^2 adversarial cases that are unlikely to arise in practice, e.g. if the pattern is
// aaaaaaaaaab and the input is aaaaaaaaaaa, the work we do here is N^2 where N is the number of a's. This is because on failing to match
// a token, we skip one character forward and try again. A better approach would be to compile something like this regexp
// instead: .* | , because that automaton would not "forget" all the as it had already seen, and would be a single pass
// through the input.
| * under the License. | ||
| */ | ||
|
|
||
| package org.elasticsearch.index.analysis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to put these in the analysis-common module. We are (slowly) in the process of moving all the analysis components there that depend on lucene-analyzers-common.jar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will do
Fixes for code review Take out admonition blocks in reference detail pages on these tokenizers because Lucene's regexes are better protected against being too complex or causing deep stacks. Move these tokenizers to the common-analysis module because that's where we're relocating code that depends on lucene-analyzers-common For #23363
|
@nik9000 @jasontedor made the suggested changes. I think I found the right place to hook into CommonAnalysisPlugin, let me know if I missed something |
|
This is the test that failed, doesn't seem obviously related retest this please jenkins |
|
@andy-elastic That test is indeed not related, you can ignore it since its spurious. Relates #25133 so you can merge master into your branch to pick up the fix that was already pushed. |
| super(indexSettings, name, settings); | ||
|
|
||
| String pattern = settings.get("pattern", ""); | ||
| this.pattern = pattern; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the shadowing local really needed, just assign directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not necessary, fixed that. Also merged master back in.
Fix for code review to cleanup unnecessary variables For #23363
|
@jasontedor I'll go ahead and merge this in an hour or so unless you have more notes (I'm guessing this is all set but don't want to cowboy-merge my first pr) |
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some more comments.
| .put("standard", StandardTokenizerFactory.class) | ||
| .put("thai", ThaiTokenizerFactory.class) | ||
| .put("uax29urlemail", UAX29URLEmailTokenizerFactory.class) | ||
| .put("whitespace", WhitespaceTokenizerFactory.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formatting is annoying, since it has to be maintained every time a tokenizer is added or removed. I appreciate you trying to maintain it but maybe we should take this opportunity to just format it normally so we do not have to worry about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree, I'm not a fan of this style either for that reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it in just that one map, in the interest of keeping the diff to only relevant parts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is much preferred, thank you.
| [horizontal] | ||
| `pattern`:: | ||
|
|
||
| A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this additional whitespace here intentional?
| [horizontal] | ||
| `pattern`:: | ||
|
|
||
| A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this additional whitespace here intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the empty line or the leading whitespace (or both)? Yes, but for no other reason than it was like that in the other asciidoc files I looked at. It seems like the leading whitespace is the convention. I don't really have a strong opinion about readability here so I'm fine with removing the empty line at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was wondering why you did not simply have:
[horizontal]
`pattern`:: A...
| matches using the same restricted regular expression subset, see the | ||
| <<analysis-simplepatternsplit-tokenizer,`simplepatternsplit`>> tokenizer. | ||
|
|
||
| This tokenizer uses http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't sure what to do here, this just seemed least-bad since I saw it in other places. Using that version property is much better but it looks like that url doesn't exist yet for lucene_version_path = 7_0_0 where we are right now.
My initial impression is that having that link broken until Lucene's 7.0.0 release is probably better than having it possibly be out of date forever, especially if lucene just isn't hosting a version 7 javadoc right now (I'll do a little more looking).
So I'll use the version property unless we really want to avoid that broken link (my reading of the docs project is it only complains about broken links within the book, not external urls)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, and I agree with the preference.
| [horizontal] | ||
| `pattern`:: | ||
|
|
||
| A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?
| subset, see the <<analysis-simplepattern-tokenizer,`simplepattern`>> | ||
| tokenizer. | ||
|
|
||
| This tokenizer uses http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?
| [horizontal] | ||
| `pattern`:: | ||
|
|
||
| A http://lucene.apache.org/core//6_5_1/core/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about linking to a specific version of the docs. Are these going to rapidly go stale? Can we link based on the Lucene version constant (lucene_version_path)?
Make links to lucene javadocs relative to the lucene-core-javadoc property so they'll stay up to date as we change lucene versions Whitespace formatting in tokenizer docs Whitespace formatting in AnalysisFactoryTestCase so that we don't have to change spacing every time we edit that map Clearer usage in the header for simplepatternsplit's section For #23363
|
Made those whitespace and lucene javadoc link changes |
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
Thanks @jasontedor. Just to be clear, this should go in 5.x right? |
|
Ah, looks like the common analysis plugin doesn't exist in 5.x. Where these changes would go in 5.x is different enough that I'm gonna infer that it shouldn't go in there. |
|
I think that 6.0.0 only (master) is fine for this. |
* master: (27 commits) Refactor TransportShardBulkAction.executeUpdateRequest and add tests Make sure range queries are correctly profiled. (elastic#25108) Test: allow setting socket timeout for rest client (elastic#25221) Migration docs for elastic#25080 (elastic#25218) Remove `discovery.type` BWC layer from the EC2/Azure/GCE plugins elastic#25080 When stopping via systemd only kill the JVM, not its control group (elastic#25195) Remove PrefixAnalyzer, because it is no longer used. Internal: Remove Strings.cleanPath (elastic#25209) Docs: Add note about which secure settings are valid (elastic#25212) Indices.rollover/10_basic should refresh to make the doc visible in lucene stats Port support for commercial GeoIP2 databases from Logstash. (elastic#24889) [DOCS] Add ML node to node.asciidoc (elastic#24495) expose simple pattern tokenizers (elastic#25159) Test: add setting to change request timeout for rest client (elastic#25201) Fix secure repository-hdfs tests on JDK 9 Add target_field parameter to gsub, join, lowercase, sort, split, trim, uppercase (elastic#24133) Add Cross Cluster Search support for scroll searches (elastic#25094) Adapt skip version in rest-api-spec/test/indices.rollover/20_max_doc_condition.yml Rollover max docs should only count primaries (elastic#24977) Add remote cluster infrastructure to fetch discovery nodes. (elastic#25123) ...
|
@andy-elastic sorry for the late remark, but could you rename these to |
Sure thing, much prefer that. I'd seen it as |
|
I see why I opened #25300 for that change, let me know if there's a more appropriate to get it in than another PR |
|
You can override the SPI lookup if you don't like the names that it makes
you use. It is really just there for convenience.
…On Mon, Jun 19, 2017 at 1:52 PM Andy Bristol ***@***.***> wrote:
I see why simplepattern was already in there: in the analysis factory
tests, tokenizers have their names without spaces
<https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/indices/analysis/AnalysisFactoryTestCase.java#L124-L137>
because some of the tests look them up with an SPI name
<https://github.com/andy-elastic/elasticsearch/blob/428e70758ac6895ac995f4315412f4d3729aea9b/test/framework/src/main/java/org/elasticsearch/indices/analysis/AnalysisFactoryTestCase.java#L370-L370>
.
I opened #25300 <#25300> for
that change, let me know if there's a more appropriate to get it in than
another PR
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25159 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANLoqmtIjetVeZJAcqI2FboSqY3KzJEks5sFrVtgaJpZM4N1m5P>
.
|
|
@clintongormley I merged #25300 which changes their names to snake case |
|
thanks @andy-elastic |
This exposes Lucene's new Simple Pattern Tokenizer and Simple Pattern Split Tokenizer, which use a restricted subset of regular expressions for faster tokenization. They're annotated as experimental in Lucene and are documented as such here.
For #23363