Move analysis components to a module

We'd like to move the analyzers from Elasticsearch core into a module. They would still ship with Elasticsearch, just not with the Elasticsearch jar. We like this for a few reasons:
1. It reduces the size of the high level rest client and the transport client. They don't need to reference analyzers.
2. It proves that analysis plugins are first class citizens by consuming the plugin API for setting up the analyzers.
3. It forces us to develop features a little more generically, not relying on specific analyzers, which is a good thing if you are going to have a first class plugin API.

At this point I propose we move analysis components a few at a time. Claim the components you'd like to move before doing the move using the list below. We're doing this directly in master and 5.x. There is no need for a long running branch for this.

Keep in mind when claiming components that moving the code is not time consuming but fixing tests that rely on the components might be.

Misc
-------
* [x] Allow plugins to build "pre-built" analysis components. This blocks a number of the analyzers below.
  * [x] Token filters #24223 #24572 #24223
  * [x] Analyzers #31095
  * [x] Tokenizers #24751 #24863
  * [x] Char filters #25000
* [ ] Remove core's dependency on `lucene-analzyers-common.jar`

Analyzers
--------
* [x] Standard Analyzer (This one will stay in core. It isn't part of the `lucene-analzyers-common.jar` and it will keep testing easier to keep it in core.)
* [x] Simple Analyzer
* [x] Whitespace Analyzer
* [x] Stop Analyzer
* [x] Keyword Analyzer
* [x] Pattern Analyzer #31095
* [x] Language Analyzers #31143 #31300
* [x] Fingerprint Analyzer #31095
* [x] Standard html strip Analyzer #31095

Tokenizers
--------
* [x] Standard Tokenizer (I believe this one will also stay in core for the same reasons Standard Analyzer is staying.)
* [x] Letter Tokenizer (#30538)
* [x] Lowercase Tokenizer (#30538)
* [x] Whitespace Tokenizer (#30538)
* [x] UAX URL Email Tokenizer (#30538)
* [x] Classic Tokenizer (#30538)
* [x] Thai Tokenizer (#30538)
* [x] N-Gram Tokenizer (#30538)
* [x] Edge N-Gram Tokenizer (#30538)
* [x] Keyword Tokenizer (#30642)
* [x] Pattern Tokenizer (#30538)
* [x] Path Tokenizer (#30538)

Token Filters
--------
* [x] Standard Token Filter (This will stay in core, because `StandardFilter` is part of lucene-core)
* [x] ASCII Folding Token Filter (#23614)
* [x] Flatten Graph Token Filter @martijnvg (#25214)
* [x] Length Token Filter @martijnvg (#25214)
* [x] Lowercase Token Filter @martijnvg (#25214)
* [x] Uppercase Token Filter @martijnvg (#25214)
* [x] NGram Token Filter @martijnvg (#25214)
* [x] Edge NGram Token Filter @martijnvg (#25214)
* [x] Porter Stem Token Filter @martijnvg (#24948)
* [ ] Shingle Token Filter (trickier: because of `PhraseSuggestionBuilder`, it uses `ShingleTokenFilterFactory`'s getters)
* [x] Stop Token Filter (can remain in core as it uses classes from lucene-core and lucene-suggest jars)
* [x] Word Delimiter Token Filter (#23614)
* [x] Stemmer Token Filter (#25384)
* [x] Stemmer Override Token Filter (#25384)
* [x] Keyword Marker Token Filter @martijnvg (#24948)
* [x] Keyword Repeat Token Filter (Hasn't been exposed yet as a token filter)
* [x] KStem Token Filter (#25384)
* [x] Snowball Token Filter @martijnvg (#24948)
* [x] Phonetic Token Filter (Is already in its own module `analysis-phonetic` )
* [x] Synonym Token Filter (trickier: because `CustomAnalyzerProvider` depends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868
* [x] Synonym Graph Token Filter (trickier: because `CustomAnalyzerProvider` depends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868
* [x] Compound Word Token Filter (#25384)
* [x] Reverse Token Filter (#25384)
* [x] Elision Token Filter (#25384)
* [x] Truncate Token Filter (#25384)
* [x] Unique Token Filter @martijnvg (#25214)
* [x] Pattern Capture Token Filter (#25580)
* [x] Pattern Replace Token Filter (#25580)
* [x] Trim Token Filter @martijnvg (#24948)
* [x] Limit Token Count Token Filter (#25580)
* [ ] Hunspell Token Filter (trickier: because of its infra`AnalysisPlugin#getHunspellDictionaries()`)
* [x] Common Grams Token Filter (#25580)
* [x] Normalization Token Filter (#25715)
* [x] CJK Width Token Filter (#25715)
* [x] CJK Bigram Token Filter (#25715)
* [x] Delimited Payload Token Filter (#25784)
* [x] Keep Words Token Filter (#25784)
* [x] Keep Types Token Filter (#25784)
* [x] Classic Token Filter (#25784)
* [x] Apostrophe Token Filter (#25784)
* [x] Decimal Digit Token Filter (#25784)
* [x] Fingerprint Token Filter (#25784)
* [x] Minhash Token Filter (#25784)
* [x] Scandinavian folding token filter (#25784)
* [x] Language stem token filters (arabic, brazilian, czech, dutch, french, german, russian) (#26042)

Character Filters
---------
* [x] HTML Strip Char Filter #24261
* [x] Mapping Char Filter #24261
* [x] Pattern Replace Char Filter #24261

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move analysis components to a module #23658

Misc

Analyzers

Tokenizers

Token Filters

Character Filters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Move analysis components to a module #23658

Description

Misc

Analyzers

Tokenizers

Token Filters

Character Filters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions