Skip to content

Merging the terms from multiple sub-analyzers #1128

@ofavre

Description

@ofavre

Multi-field is great, but searching with multiple analyzers against only one field is simpler/better.
If you have a multi-lingual index, where each document has its source language, you can analyze the text fields using a special analyzer, based on the detected language (maybe even using the _analyzer.path functionality).
But what happens when you misdetected the language somehow, either at index- or at query-time? Some aggressive stemming can have devastating effects.

In such a scenario, having the original words indexed in parallel to the stemmed one would help. Be they in the same field would even letting phrase/slop queries work properly.
The only way to get multiple terms at the same position with ElasticSearch is through the synonym token filter, useless for stemming.

I've been working on a way to merge the terms that multiple analyzers output.
Say you want both to use a simple analyzer, and any of the specialized language-specific analyzer, or anything.
My plugin can make it as simple as the following index setting:

index:
  analysis:
    analyzer:
      # An analyzer using both the "simple" analyzer and the sophisticated "english" analyzer, combining the resulting terms
      combo_en:
        type: combo
        sub_analyzers: [simple, english]

Here is a simple example of what is does:

# What the "simple" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=simple' -d 'An example'
{
  "tokens" : [ {
    "token" : "an",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "example",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  } ]
}
# What the "english" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=english' -d 'An example'
{
  "tokens" : [ {
    "token" : "exampl",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

# Now what our combined analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=combo_en' -d 'An example'
{
  "tokens" : [ {
    "token" : "an",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "example",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "exampl",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

Terms are sorted by position, then by start/end offset, so that it's easier to use its output under reasonable assumptions of using a classical analyzer.

Here is the good news! You can find my implementation here: https://github.com/ofavre/elasticsearch/tree/combo-analyzer-v0.16.4 (based on released ElasticSearch version 0.16.4).

EDIT: It is finally available as a plugin, thanks to jprante: https://github.com/yakaz/elasticsearch-analysis-combo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions