Allow plugins to build "pre-configured" token filters #24223

nik9000 · 2017-04-20T20:02:47Z

This changes the way we register "pre-built" token filters so that
plugins can declare them and starts to move all of the "pre-built"
token filters out of core. It doesn't finish the job because doing
so would make the change unreviewably large. So this PR includes
a shim that keeps the "old" way of registering "pre-built" token
filters around.

The Lowercase token filter is special because there is a "special"
interaction between it and the lowercase tokenizer. I'm not sure
exactly what to do about it so for now I'm leaving it alone with
the intent of figuring out what to do with it in a followup.

This is a part of #23658

To keep the PR small I'm breaking it off here.

nik9000 · 2017-04-20T20:04:09Z

core/src/main/java/org/elasticsearch/indices/analysis/PreBuiltTokenFilters.java

-    },
-
    // Extended Token Filters
    SNOWBALL(CachingStrategy.ONE) {


I stopped migrating here in an attempt to keep the PR reviewable.

nik9000 · 2017-04-20T20:05:26Z

core/src/main/java/org/elasticsearch/indices/analysis/PreBuiltTokenFilters.java

+
    private interface MultiTermAwareTokenFilterFactory extends TokenFilterFactory, MultiTermAwareComponent {}

    public synchronized TokenFilterFactory getTokenFilterFactory(final Version version) {


I've preserved this method entirely for PreBuiltAnalyer.LOWERCASE#getMultiTermComponent. I'll figure out how to move away from that at some point but I thought that too should wait for another PR.

nik9000 · 2017-04-20T20:07:03Z

core/src/main/java/org/elasticsearch/index/analysis/PreBuiltTokenFilterFactoryProvider.java

+
+    private interface MultiTermAwareTokenFilterFactory extends TokenFilterFactory, MultiTermAwareComponent {}
+
+    private synchronized TokenFilterFactory getTokenFilterFactory(final Version version) {


I've lifted this method pretty much verbatim from PreBuiltTokenFilters in an attempt to keep the changes small.

nik9000 · 2017-04-20T20:07:49Z

core/src/main/java/org/elasticsearch/indices/analysis/AnalysisModule.java

+        NamedRegistry<PreBuiltTokenFilterSpec> preBuiltTokenFilters = new NamedRegistry<>("pre built token_filter");
+
+        // Add filters available in lucene-core
+        preBuiltTokenFilters.register("lowercase", new PreBuiltTokenFilterSpec(true, CachingStrategy.LUCENE, (inputs, version) ->


I debated moving lowercase out of core and just keeping standard but there isn't really a technical reason to do that so I decided not to for now.

nik9000 · 2017-04-20T20:08:33Z

core/src/main/java/org/elasticsearch/indices/analysis/AnalysisModule.java

+         * lucene-analyzers-common so "stop" is defined in the analysis-common module. */
+
+        // Add token filers declared in PreBuiltTokenFilters until they have all been migrated
+        for (PreBuiltTokenFilters preBuilt : PreBuiltTokenFilters.values()) {


This is the shim that I'll drop in a followup.

nik9000 · 2017-04-20T20:09:36Z

...lysis-common/src/test/java/org/elasticsearch/analysis/common/CommonAnalysisFactoryTests.java

+    @Override
+    protected Map<String, Class<?>> getPreBuiltTokenFilters() {
+        Map<String, Class<?>> filters = new TreeMap<>(super.getPreBuiltTokenFilters());
+        filters.put("asciifolding", null);


I've sort of preserved this null behavior from the original test case. It is a bit funky and magical and I'm tempted to remove it but I've kept it for now to get opinions on it.

nik9000 · 2017-04-20T20:10:43Z

.../analysis-common/src/test/resources/rest-api-spec/test/analysis-common/40_token_filters.yaml

    - match:  { tokens.0.token: Musee d'Orsay }

+    - do:
+        indices.analyze:


I added this when I was debugging around for examples. It isn't strictly needed for this PR but I figure an extra example doesn't hurt.

nik9000 · 2017-04-20T20:11:28Z

test/framework/src/main/java/org/elasticsearch/indices/analysis/AnalysisFactoryTestCase.java

                continue;
            }
+            if (luceneFactory == null) {
+                luceneFactory = TokenFilterFactory.lookupClass(toCamelCase(name));


This is the other side of that magic null behavior that I wasn't super comfortable with. It is convenient though.

nik9000 · 2017-04-20T20:13:27Z

As it stands this PR is a net positive in lines of code but I think that is because of the shims that I've left in to keep the PR a manageable size to review. I think it'd be somewhere close to net 0 line difference if it weren't for that.

nik9000 · 2017-04-25T21:58:23Z

I talked to @rjernst and he suggested moving the replacing the PreBuiltTokenFilterSpec in favor of having the plugin just return the thing that we'll ultimately build with it. We also brainstormed about renaming "pre-built". We decided that preconfigured would be a better name because the components are configured on startup and it more closely reflected the user experience.

nik9000 · 2017-04-27T19:58:24Z

@rjernst I think this might be ready for another look.

jpountz · 2017-05-03T12:46:47Z

core/src/main/java/org/elasticsearch/index/analysis/PreConfiguredTokenFilter.java

+    public PreConfiguredTokenFilter(String name, boolean useFilterForMultitermQueries,
+            PreBuiltCacheFactory.CachingStrategy cachingStrategy, Function<TokenStream, TokenStream> create) {
+        this(name, useFilterForMultitermQueries, cachingStrategy, (input, version) -> create.apply(input));
+        // TODO why oh why aren't these all CachingStrategy.ONE? They *can't* vary based on version because they don't get it, right?!


Sounds correct to me, let's hardcode CachingStrategy.ONE?

I'd been thinking of doing it in a follow. Any objections?

not at all, this PR is large already

nik9000 · 2017-05-04T14:19:26Z

@rjernst can you have another look at this one?

rjernst

Looks good, thanks for the name change, I think it makes the purpose much clearer (please remember to change the commit message on merge). I left a few suggestions, nothing critical.

rjernst · 2017-05-09T03:47:43Z

core/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

            this.charFilterFactories = Collections.unmodifiableMap(charFilterFactories);
-            this.tokenFilterFactories = Collections.unmodifiableMap(tokenFilterFactories);
            this.tokenizerFactories = Collections.unmodifiableMap(tokenizerFactories);
+            tokenFilterFactories = preConfiguredTokenFilters;


Maybe later (followup, as cleanup after these are all done) these members could be renamed to preConfigured*

rjernst · 2017-05-09T03:51:52Z

core/src/main/java/org/elasticsearch/indices/analysis/AnalysisModule.java

+         * lucene-analyzers-common so "stop" is defined in the analysis-common
+         * module. */
+
+        // Add token filers declared in PreBuiltTokenFilters until they have all been migrated


typo: filers -> filters

rjernst · 2017-05-09T03:55:15Z

core/src/test/java/org/elasticsearch/index/analysis/BuiltInAnalysisFactoryTests.java


-public class AnalysisFactoryTests extends AnalysisFactoryTestCase {
-    // tests are inherited and nothing needs to be defined here
+public class BuiltInAnalysisFactoryTests extends AnalysisFactoryTestCase {


Can this be PreConfiguredAnalyzerTests? Just a thought, could be later.

This test waited 10 seconds for a refresh listener to appear in the stats. It turns out that in our NFS testing infrastructure this can take a lot longer than 10 seconds. The error reported here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+nfs/257/consoleFull has it taking something like 15 seconds. This bumps the timeout to a solid minute. Closes elastic#24417

And add some more comments

Allows plugins to register pre-configured tokenizers. Much of the decisions are the same as those in #24223, #24572, and #24223. This only migrates the lowercase tokenizer but I figure that is a good start because it proves out the features.

nik9000 added 13 commits April 20, 2017 12:06

Begin

ecf6b46

More

1a110bd

Further

baaa06d

Fix more

9decf6a

More tests

0b54ba2

More

2d09531

More

3abc5b6

Convert NOCOMMIT to TODO

6a06eac

Convert another NOCOMMIT to TODO

1995bea

To keep the PR small I'm breaking it off here.

One more NOCOMMIT

e43fc1f

Yet another

fad8fb3

Cleanup

c47cafa

Try and keep caching strategy

b54a8ce

nik9000 requested review from abeyad and rjernst April 20, 2017 20:02

nik9000 added v6.0.0-alpha1 :Search Relevance/Analysis How text is split into tokens labels Apr 20, 2017

nik9000 commented Apr 20, 2017

View reviewed changes

nik9000 added the review label Apr 20, 2017

nik9000 mentioned this pull request Apr 20, 2017

Move analysis components to a module #23658

Open

75 tasks

clintongormley added the >non-issue label Apr 25, 2017

Merge branch 'master' into prebuilt_token_filter

9c2e236

nik9000 removed the review label Apr 27, 2017

nik9000 added 5 commits April 27, 2017 15:29

Start converting to PreConfiguredTokenFilter

3cd577d

Cleanup

605a898

Eclipse likes these but javac doesn't

025ffe6

More rename

45746ea

More renames

34fb2cf

clintongormley added v6.0.0 and removed v6.0.0-alpha1 labels May 3, 2017

jpountz reviewed May 3, 2017

View reviewed changes

rjernst approved these changes May 9, 2017

View reviewed changes

nik9000 added 4 commits May 9, 2017 12:27

Merge branch 'master' of github.com:elastic/elasticsearch

0400342

Merge branch 'master' into prebuilt_token_filter

5a238d2

Rename class to be more clear

324ac70

And add some more comments

nik9000 changed the title ~~Allow plugins to build "pre-built" token filters~~ Allow plugins to build "pre-configured" token filters May 9, 2017

nik9000 merged commit bb06d8e into elastic:master May 9, 2017

nik9000 mentioned this pull request May 17, 2017

Allow plugins to register pre-configured tokenizers #24751

Merged

clintongormley added v6.0.0-alpha2 and removed v6.0.0 labels Jun 6, 2017

nik9000 deleted the prebuilt_token_filter branch June 7, 2017 14:52


		private interface MultiTermAwareTokenFilterFactory extends TokenFilterFactory, MultiTermAwareComponent {}

		public synchronized TokenFilterFactory getTokenFilterFactory(final Version version) {


		private interface MultiTermAwareTokenFilterFactory extends TokenFilterFactory, MultiTermAwareComponent {}

		private synchronized TokenFilterFactory getTokenFilterFactory(final Version version) {

Allow plugins to build "pre-configured" token filters #24223

Allow plugins to build "pre-configured" token filters #24223

Uh oh!

Conversation

nik9000 commented Apr 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Apr 20, 2017

Uh oh!

nik9000 commented Apr 25, 2017

Uh oh!

nik9000 commented Apr 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nik9000 commented May 4, 2017

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants