Remove _analyzer

### Background

The most important thing about specifying analyzers is the analyzer used at index time needs to be basically the same as the analyzer used at query time.  If completely different analyzers were to be used, you would either produce terms that could never be found at query time, or query for terms that could never exist in the index.  The API for specifying analyzers on fields does allow to set index and query analyzers separately.  This is to allow things like synonyms where you may want to only add synonyms at index time (yields cheaper queries later), or just add them within the query (more flexible since synonyms can be changed dynamically).
### `_analyzer` today

Today we have multiple ways to specify which analyzer will be used on text fields.  At indexing time, the order to check is as follows:
1. analyzer for the field
2. `_analyzer` (proposed to remove here)
3. type level default analyzer (will be removed in #8874)
4. index level default analyzer

`_analyzer` is a special field in the document which specifies the name of an analyzer to use as the default for that document.  This means that the same field for one document can use a completely different analyzer than another document.  The typical use case for this is working with documents in many languages, where each document contains a field specifying which language its main data is in (e.g. subject and body fields). Then at query time, either a single query is used with a “magic” analyzer over that field, or a conjunction of queries that use every analyzer the data may have been indexed with.
### Problems with `_analyzer`

The typical use case for `_analyzer` has many problems:
- If using the single analyzer at query time approach, the “magic” analyzer is never good enough.  It cannot possibly cover all the terms that may have been produced by each languages' analyzer, so some terms are never matchable.  For example, “die” in German would be a stop word, while the same word in English is simply a regular word.  Either the magic analyzer removes “die” (in which case English documents about dying can not be found) or it includes it, and German documents that contained it can not match (since the indexing process removed that term).
- Scoring will be skewed. Text relevance models rely on term statistics to weigh the importance of a term for a given document versus the importance of that term for the entire index.  Because some words may analyze to the same term in different languages, the frequencies of terms can be skewed, which will distort where documents matching these words appear in query results.
- Mappings code is already complicated, and this feature further complicates following the logic of where Analyzers are set in code.
- Having multiple ways to set the analyzer used at index or query time is also confusing on users, as they have to decide which way is “better”.
### Proposal

I propose to remove `_analyzer`, and the associated “analyzer” setting of the match query.  Removing this, along with the type level default in #8874 will simplify specifying analyzers considerably.  There would be no loss of end functionality, since better results can be achieved with multiple fields.
### Alternatives to `_analyzer`
- One alternative to dealing with multiple languages is to use n-grams. While this can itself be tricky to deal with, it is worth mentioning.
- An alternative that is more directly in line with the current uses of _analyzer is having one field per language.  This requires slight modifications to client code for indexing (copy data into the field for the appropriate language, instead of specifying the language in a field) and querying (query the appropriate language field, instead of selecting an analyzer for that language.  This will produce better results since documents from other languages cannot accidentally appear in the results, and relevance results should be as expected for that language.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove _analyzer #9279

Background

`_analyzer` today

Problems with `_analyzer`

Proposal

Alternatives to `_analyzer`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remove _analyzer #9279

Description

Background

_analyzer today

Problems with _analyzer

Proposal

Alternatives to _analyzer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`_analyzer` today

Problems with `_analyzer`

Alternatives to `_analyzer`