-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Allow plugins to build "pre-configured" token filters #24223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To keep the PR small I'm breaking it off here.
| }, | ||
|
|
||
| // Extended Token Filters | ||
| SNOWBALL(CachingStrategy.ONE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stopped migrating here in an attempt to keep the PR reviewable.
|
|
||
| private interface MultiTermAwareTokenFilterFactory extends TokenFilterFactory, MultiTermAwareComponent {} | ||
|
|
||
| public synchronized TokenFilterFactory getTokenFilterFactory(final Version version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've preserved this method entirely for PreBuiltAnalyer.LOWERCASE#getMultiTermComponent. I'll figure out how to move away from that at some point but I thought that too should wait for another PR.
|
|
||
| private interface MultiTermAwareTokenFilterFactory extends TokenFilterFactory, MultiTermAwareComponent {} | ||
|
|
||
| private synchronized TokenFilterFactory getTokenFilterFactory(final Version version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've lifted this method pretty much verbatim from PreBuiltTokenFilters in an attempt to keep the changes small.
| NamedRegistry<PreBuiltTokenFilterSpec> preBuiltTokenFilters = new NamedRegistry<>("pre built token_filter"); | ||
|
|
||
| // Add filters available in lucene-core | ||
| preBuiltTokenFilters.register("lowercase", new PreBuiltTokenFilterSpec(true, CachingStrategy.LUCENE, (inputs, version) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated moving lowercase out of core and just keeping standard but there isn't really a technical reason to do that so I decided not to for now.
| * lucene-analyzers-common so "stop" is defined in the analysis-common module. */ | ||
|
|
||
| // Add token filers declared in PreBuiltTokenFilters until they have all been migrated | ||
| for (PreBuiltTokenFilters preBuilt : PreBuiltTokenFilters.values()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the shim that I'll drop in a followup.
| @Override | ||
| protected Map<String, Class<?>> getPreBuiltTokenFilters() { | ||
| Map<String, Class<?>> filters = new TreeMap<>(super.getPreBuiltTokenFilters()); | ||
| filters.put("asciifolding", null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've sort of preserved this null behavior from the original test case. It is a bit funky and magical and I'm tempted to remove it but I've kept it for now to get opinions on it.
| - match: { tokens.0.token: Musee d'Orsay } | ||
|
|
||
| - do: | ||
| indices.analyze: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this when I was debugging around for examples. It isn't strictly needed for this PR but I figure an extra example doesn't hurt.
| continue; | ||
| } | ||
| if (luceneFactory == null) { | ||
| luceneFactory = TokenFilterFactory.lookupClass(toCamelCase(name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other side of that magic null behavior that I wasn't super comfortable with. It is convenient though.
|
As it stands this PR is a net positive in lines of code but I think that is because of the shims that I've left in to keep the PR a manageable size to review. I think it'd be somewhere close to net 0 line difference if it weren't for that. |
|
I talked to @rjernst and he suggested moving the replacing the |
|
@rjernst I think this might be ready for another look. |
| public PreConfiguredTokenFilter(String name, boolean useFilterForMultitermQueries, | ||
| PreBuiltCacheFactory.CachingStrategy cachingStrategy, Function<TokenStream, TokenStream> create) { | ||
| this(name, useFilterForMultitermQueries, cachingStrategy, (input, version) -> create.apply(input)); | ||
| // TODO why oh why aren't these all CachingStrategy.ONE? They *can't* vary based on version because they don't get it, right?! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds correct to me, let's hardcode CachingStrategy.ONE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd been thinking of doing it in a follow. Any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not at all, this PR is large already
|
@rjernst can you have another look at this one? |
rjernst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the name change, I think it makes the purpose much clearer (please remember to change the commit message on merge). I left a few suggestions, nothing critical.
| this.charFilterFactories = Collections.unmodifiableMap(charFilterFactories); | ||
| this.tokenFilterFactories = Collections.unmodifiableMap(tokenFilterFactories); | ||
| this.tokenizerFactories = Collections.unmodifiableMap(tokenizerFactories); | ||
| tokenFilterFactories = preConfiguredTokenFilters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe later (followup, as cleanup after these are all done) these members could be renamed to preConfigured*
| * lucene-analyzers-common so "stop" is defined in the analysis-common | ||
| * module. */ | ||
|
|
||
| // Add token filers declared in PreBuiltTokenFilters until they have all been migrated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: filers -> filters
|
|
||
| public class AnalysisFactoryTests extends AnalysisFactoryTestCase { | ||
| // tests are inherited and nothing needs to be defined here | ||
| public class BuiltInAnalysisFactoryTests extends AnalysisFactoryTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be PreConfiguredAnalyzerTests? Just a thought, could be later.
This test waited 10 seconds for a refresh listener to appear in the stats. It turns out that in our NFS testing infrastructure this can take a lot longer than 10 seconds. The error reported here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+nfs/257/consoleFull has it taking something like 15 seconds. This bumps the timeout to a solid minute. Closes elastic#24417
And add some more comments
This changes the way we register "pre-built" token filters so that
plugins can declare them and starts to move all of the "pre-built"
token filters out of core. It doesn't finish the job because doing
so would make the change unreviewably large. So this PR includes
a shim that keeps the "old" way of registering "pre-built" token
filters around.
The Lowercase token filter is special because there is a "special"
interaction between it and the lowercase tokenizer. I'm not sure
exactly what to do about it so for now I'm leaving it alone with
the intent of figuring out what to do with it in a followup.
This is a part of #23658