Skip to content

Remove _analyzer #9279

@rjernst

Description

@rjernst

Background

The most important thing about specifying analyzers is the analyzer used at index time needs to be basically the same as the analyzer used at query time. If completely different analyzers were to be used, you would either produce terms that could never be found at query time, or query for terms that could never exist in the index. The API for specifying analyzers on fields does allow to set index and query analyzers separately. This is to allow things like synonyms where you may want to only add synonyms at index time (yields cheaper queries later), or just add them within the query (more flexible since synonyms can be changed dynamically).

_analyzer today

Today we have multiple ways to specify which analyzer will be used on text fields. At indexing time, the order to check is as follows:

  1. analyzer for the field
  2. _analyzer (proposed to remove here)
  3. type level default analyzer (will be removed in Remove type-level analyzer, index_analyzer, search_analyzer #8874)
  4. index level default analyzer

_analyzer is a special field in the document which specifies the name of an analyzer to use as the default for that document. This means that the same field for one document can use a completely different analyzer than another document. The typical use case for this is working with documents in many languages, where each document contains a field specifying which language its main data is in (e.g. subject and body fields). Then at query time, either a single query is used with a “magic” analyzer over that field, or a conjunction of queries that use every analyzer the data may have been indexed with.

Problems with _analyzer

The typical use case for _analyzer has many problems:

  • If using the single analyzer at query time approach, the “magic” analyzer is never good enough. It cannot possibly cover all the terms that may have been produced by each languages' analyzer, so some terms are never matchable. For example, “die” in German would be a stop word, while the same word in English is simply a regular word. Either the magic analyzer removes “die” (in which case English documents about dying can not be found) or it includes it, and German documents that contained it can not match (since the indexing process removed that term).
  • Scoring will be skewed. Text relevance models rely on term statistics to weigh the importance of a term for a given document versus the importance of that term for the entire index. Because some words may analyze to the same term in different languages, the frequencies of terms can be skewed, which will distort where documents matching these words appear in query results.
  • Mappings code is already complicated, and this feature further complicates following the logic of where Analyzers are set in code.
  • Having multiple ways to set the analyzer used at index or query time is also confusing on users, as they have to decide which way is “better”.

Proposal

I propose to remove _analyzer, and the associated “analyzer” setting of the match query. Removing this, along with the type level default in #8874 will simplify specifying analyzers considerably. There would be no loss of end functionality, since better results can be achieved with multiple fields.

Alternatives to _analyzer

  • One alternative to dealing with multiple languages is to use n-grams. While this can itself be tricky to deal with, it is worth mentioning.
  • An alternative that is more directly in line with the current uses of _analyzer is having one field per language. This requires slight modifications to client code for indexing (copy data into the field for the appropriate language, instead of specifying the language in a field) and querying (query the appropriate language field, instead of selecting an analyzer for that language. This will produce better results since documents from other languages cannot accidentally appear in the results, and relevance results should be as expected for that language.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Search Foundations/MappingIndex mappings, including merging and defining field typesTeam:Search FoundationsMeta label for the Search Foundations team in Elasticsearch

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions