-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Open
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokens>refactoringMetaTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearchhelp wantedadoptmeadoptme
Description
We'd like to move the analyzers from Elasticsearch core into a module. They would still ship with Elasticsearch, just not with the Elasticsearch jar. We like this for a few reasons:
- It reduces the size of the high level rest client and the transport client. They don't need to reference analyzers.
- It proves that analysis plugins are first class citizens by consuming the plugin API for setting up the analyzers.
- It forces us to develop features a little more generically, not relying on specific analyzers, which is a good thing if you are going to have a first class plugin API.
At this point I propose we move analysis components a few at a time. Claim the components you'd like to move before doing the move using the list below. We're doing this directly in master and 5.x. There is no need for a long running branch for this.
Keep in mind when claiming components that moving the code is not time consuming but fixing tests that rely on the components might be.
Misc
- Allow plugins to build "pre-built" analysis components. This blocks a number of the analyzers below.
- Token filters Allow plugins to build "pre-configured" token filters #24223 Make PreConfiguredTokenFilter harder to misuse #24572 Allow plugins to build "pre-configured" token filters #24223
- Analyzers Make PreBuiltAnalyzerProviderFactory plugable via AnalysisPlugin and #31095
- Tokenizers Allow plugins to register pre-configured tokenizers #24751 Move pre-configured "keyword" tokenizer to the analysis-common module #24863
- Char filters Plugins can register pre-configured char filters #25000
- Remove core's dependency on
lucene-analzyers-common.jar
Analyzers
- Standard Analyzer (This one will stay in core. It isn't part of the
lucene-analzyers-common.jarand it will keep testing easier to keep it in core.) - Simple Analyzer
- Whitespace Analyzer
- Stop Analyzer
- Keyword Analyzer
- Pattern Analyzer Make PreBuiltAnalyzerProviderFactory plugable via AnalysisPlugin and #31095
- Language Analyzers Move number of language analyzers to analysis-common module #31143 Move language analyzers from server to analysis-common module. #31300
- Fingerprint Analyzer Make PreBuiltAnalyzerProviderFactory plugable via AnalysisPlugin and #31095
- Standard html strip Analyzer Make PreBuiltAnalyzerProviderFactory plugable via AnalysisPlugin and #31095
Tokenizers
- Standard Tokenizer (I believe this one will also stay in core for the same reasons Standard Analyzer is staying.)
- Letter Tokenizer (Move tokenizers to analysis common module #30538)
- Lowercase Tokenizer (Move tokenizers to analysis common module #30538)
- Whitespace Tokenizer (Move tokenizers to analysis common module #30538)
- UAX URL Email Tokenizer (Move tokenizers to analysis common module #30538)
- Classic Tokenizer (Move tokenizers to analysis common module #30538)
- Thai Tokenizer (Move tokenizers to analysis common module #30538)
- N-Gram Tokenizer (Move tokenizers to analysis common module #30538)
- Edge N-Gram Tokenizer (Move tokenizers to analysis common module #30538)
- Keyword Tokenizer (Move keyword tokenizer to analysis-common module #30642)
- Pattern Tokenizer (Move tokenizers to analysis common module #30538)
- Path Tokenizer (Move tokenizers to analysis common module #30538)
Token Filters
- Standard Token Filter (This will stay in core, because
StandardFilteris part of lucene-core) - ASCII Folding Token Filter (Start building analysis-common module #23614)
- Flatten Graph Token Filter @martijnvg (Move more token filters to analysis-common module #25214)
- Length Token Filter @martijnvg (Move more token filters to analysis-common module #25214)
- Lowercase Token Filter @martijnvg (Move more token filters to analysis-common module #25214)
- Uppercase Token Filter @martijnvg (Move more token filters to analysis-common module #25214)
- NGram Token Filter @martijnvg (Move more token filters to analysis-common module #25214)
- Edge NGram Token Filter @martijnvg (Move more token filters to analysis-common module #25214)
- Porter Stem Token Filter @martijnvg (Move several token filters to common-analysis module #24948)
- Shingle Token Filter (trickier: because of
PhraseSuggestionBuilder, it usesShingleTokenFilterFactory's getters) - Stop Token Filter (can remain in core as it uses classes from lucene-core and lucene-suggest jars)
- Word Delimiter Token Filter (Start building analysis-common module #23614)
- Stemmer Token Filter (Move more token filters to analysis-common module #25384)
- Stemmer Override Token Filter (Move more token filters to analysis-common module #25384)
- Keyword Marker Token Filter @martijnvg (Move several token filters to common-analysis module #24948)
- Keyword Repeat Token Filter (Hasn't been exposed yet as a token filter)
- KStem Token Filter (Move more token filters to analysis-common module #25384)
- Snowball Token Filter @martijnvg (Move several token filters to common-analysis module #24948)
- Phonetic Token Filter (Is already in its own module
analysis-phonetic) - Synonym Token Filter (trickier: because
CustomAnalyzerProviderdepends on it which is used by analyze api and this token filter relies on AnalysisRegistry) Remove special-casing of Synonym filters in AnalysisRegistry #33868 - Synonym Graph Token Filter (trickier: because
CustomAnalyzerProviderdepends on it which is used by analyze api and this token filter relies on AnalysisRegistry) Remove special-casing of Synonym filters in AnalysisRegistry #33868 - Compound Word Token Filter (Move more token filters to analysis-common module #25384)
- Reverse Token Filter (Move more token filters to analysis-common module #25384)
- Elision Token Filter (Move more token filters to analysis-common module #25384)
- Truncate Token Filter (Move more token filters to analysis-common module #25384)
- Unique Token Filter @martijnvg (Move more token filters to analysis-common module #25214)
- Pattern Capture Token Filter (Move more token filters to analysis-common module #25580)
- Pattern Replace Token Filter (Move more token filters to analysis-common module #25580)
- Trim Token Filter @martijnvg (Move several token filters to common-analysis module #24948)
- Limit Token Count Token Filter (Move more token filters to analysis-common module #25580)
- Hunspell Token Filter (trickier: because of its infra
AnalysisPlugin#getHunspellDictionaries()) - Common Grams Token Filter (Move more token filters to analysis-common module #25580)
- Normalization Token Filter (Move more token filters to analysis-common module #25715)
- CJK Width Token Filter (Move more token filters to analysis-common module #25715)
- CJK Bigram Token Filter (Move more token filters to analysis-common module #25715)
- Delimited Payload Token Filter (Move more token filters to analysis-common module #25784)
- Keep Words Token Filter (Move more token filters to analysis-common module #25784)
- Keep Types Token Filter (Move more token filters to analysis-common module #25784)
- Classic Token Filter (Move more token filters to analysis-common module #25784)
- Apostrophe Token Filter (Move more token filters to analysis-common module #25784)
- Decimal Digit Token Filter (Move more token filters to analysis-common module #25784)
- Fingerprint Token Filter (Move more token filters to analysis-common module #25784)
- Minhash Token Filter (Move more token filters to analysis-common module #25784)
- Scandinavian folding token filter (Move more token filters to analysis-common module #25784)
- Language stem token filters (arabic, brazilian, czech, dutch, french, german, russian) (Move more token filters to analysis-common module #26042)
Character Filters
- HTML Strip Char Filter Move char filters into analysis-common #24261
- Mapping Char Filter Move char filters into analysis-common #24261
- Pattern Replace Char Filter Move char filters into analysis-common #24261
Bklara and martijnvg
Metadata
Metadata
Assignees
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokens>refactoringMetaTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearchhelp wantedadoptmeadoptme