Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Apr 20, 2017

This changes the way we register "pre-built" token filters so that
plugins can declare them and starts to move all of the "pre-built"
token filters out of core. It doesn't finish the job because doing
so would make the change unreviewably large. So this PR includes
a shim that keeps the "old" way of registering "pre-built" token
filters around.

The Lowercase token filter is special because there is a "special"
interaction between it and the lowercase tokenizer. I'm not sure
exactly what to do about it so for now I'm leaving it alone with
the intent of figuring out what to do with it in a followup.

This is a part of #23658

@nik9000 nik9000 requested review from abeyad and rjernst April 20, 2017 20:02
@nik9000 nik9000 added v6.0.0-alpha1 :Search Relevance/Analysis How text is split into tokens labels Apr 20, 2017
},

// Extended Token Filters
SNOWBALL(CachingStrategy.ONE) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped migrating here in an attempt to keep the PR reviewable.


private interface MultiTermAwareTokenFilterFactory extends TokenFilterFactory, MultiTermAwareComponent {}

public synchronized TokenFilterFactory getTokenFilterFactory(final Version version) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've preserved this method entirely for PreBuiltAnalyer.LOWERCASE#getMultiTermComponent. I'll figure out how to move away from that at some point but I thought that too should wait for another PR.


private interface MultiTermAwareTokenFilterFactory extends TokenFilterFactory, MultiTermAwareComponent {}

private synchronized TokenFilterFactory getTokenFilterFactory(final Version version) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've lifted this method pretty much verbatim from PreBuiltTokenFilters in an attempt to keep the changes small.

NamedRegistry<PreBuiltTokenFilterSpec> preBuiltTokenFilters = new NamedRegistry<>("pre built token_filter");

// Add filters available in lucene-core
preBuiltTokenFilters.register("lowercase", new PreBuiltTokenFilterSpec(true, CachingStrategy.LUCENE, (inputs, version) ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated moving lowercase out of core and just keeping standard but there isn't really a technical reason to do that so I decided not to for now.

* lucene-analyzers-common so "stop" is defined in the analysis-common module. */

// Add token filers declared in PreBuiltTokenFilters until they have all been migrated
for (PreBuiltTokenFilters preBuilt : PreBuiltTokenFilters.values()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the shim that I'll drop in a followup.

@Override
protected Map<String, Class<?>> getPreBuiltTokenFilters() {
Map<String, Class<?>> filters = new TreeMap<>(super.getPreBuiltTokenFilters());
filters.put("asciifolding", null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've sort of preserved this null behavior from the original test case. It is a bit funky and magical and I'm tempted to remove it but I've kept it for now to get opinions on it.

- match: { tokens.0.token: Musee d'Orsay }

- do:
indices.analyze:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this when I was debugging around for examples. It isn't strictly needed for this PR but I figure an extra example doesn't hurt.

continue;
}
if (luceneFactory == null) {
luceneFactory = TokenFilterFactory.lookupClass(toCamelCase(name));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other side of that magic null behavior that I wasn't super comfortable with. It is convenient though.

@nik9000
Copy link
Member Author

nik9000 commented Apr 20, 2017

As it stands this PR is a net positive in lines of code but I think that is because of the shims that I've left in to keep the PR a manageable size to review. I think it'd be somewhere close to net 0 line difference if it weren't for that.

@nik9000
Copy link
Member Author

nik9000 commented Apr 25, 2017

I talked to @rjernst and he suggested moving the replacing the PreBuiltTokenFilterSpec in favor of having the plugin just return the thing that we'll ultimately build with it. We also brainstormed about renaming "pre-built". We decided that preconfigured would be a better name because the components are configured on startup and it more closely reflected the user experience.

@nik9000 nik9000 removed the review label Apr 27, 2017
@nik9000
Copy link
Member Author

nik9000 commented Apr 27, 2017

@rjernst I think this might be ready for another look.

public PreConfiguredTokenFilter(String name, boolean useFilterForMultitermQueries,
PreBuiltCacheFactory.CachingStrategy cachingStrategy, Function<TokenStream, TokenStream> create) {
this(name, useFilterForMultitermQueries, cachingStrategy, (input, version) -> create.apply(input));
// TODO why oh why aren't these all CachingStrategy.ONE? They *can't* vary based on version because they don't get it, right?!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds correct to me, let's hardcode CachingStrategy.ONE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd been thinking of doing it in a follow. Any objections?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not at all, this PR is large already

@nik9000
Copy link
Member Author

nik9000 commented May 4, 2017

@rjernst can you have another look at this one?

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the name change, I think it makes the purpose much clearer (please remember to change the commit message on merge). I left a few suggestions, nothing critical.

this.charFilterFactories = Collections.unmodifiableMap(charFilterFactories);
this.tokenFilterFactories = Collections.unmodifiableMap(tokenFilterFactories);
this.tokenizerFactories = Collections.unmodifiableMap(tokenizerFactories);
tokenFilterFactories = preConfiguredTokenFilters;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe later (followup, as cleanup after these are all done) these members could be renamed to preConfigured*

* lucene-analyzers-common so "stop" is defined in the analysis-common
* module. */

// Add token filers declared in PreBuiltTokenFilters until they have all been migrated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: filers -> filters


public class AnalysisFactoryTests extends AnalysisFactoryTestCase {
// tests are inherited and nothing needs to be defined here
public class BuiltInAnalysisFactoryTests extends AnalysisFactoryTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be PreConfiguredAnalyzerTests? Just a thought, could be later.

nik9000 added 4 commits May 9, 2017 12:27
This test waited 10 seconds for a refresh listener to appear in
the stats. It turns out that in our NFS testing infrastructure this can
take a lot longer than 10 seconds. The error reported here:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+nfs/257/consoleFull
has it taking something like 15 seconds. This bumps the timeout
to a solid minute.

Closes elastic#24417
And add some more comments
@nik9000 nik9000 changed the title Allow plugins to build "pre-built" token filters Allow plugins to build "pre-configured" token filters May 9, 2017
@nik9000 nik9000 merged commit bb06d8e into elastic:master May 9, 2017
nik9000 added a commit that referenced this pull request May 19, 2017
Allows plugins to register pre-configured tokenizers. Much
of the decisions are the same as those in #24223, #24572,
and #24223. This only migrates the lowercase tokenizer but
I figure that is a good start because it proves out the features.
@nik9000 nik9000 deleted the prebuilt_token_filter branch June 7, 2017 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants