Skip to content

Preconfigured edge_ngram tokenizer has incorrect defaults #43582

@romseygeek

Description

@romseygeek

The docs state:

With the default settings, the `edge_ngram` tokenizer treats the initial text as a
single token and produces N-grams with minimum length `1` and maximum length
`2`:

This is corrrect if you define a new tokenizer of type edge_ngram, like so:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "my_ngram"
        }
      },
      "tokenizer" : {
        "my_ngram" : {
          "type" : "edge_ngram"
        }
      }
    }
  }
}
GET test/_analyze
{
  "analyzer" : "default",
  "text" : "test"
}
{
  "tokens" : [
    {
      "token" : "t",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "te",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    }
  ]
}

However, if you instead use the pre-configured edge_ngram tokenizer, you only get ngrams of size 1:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "edge_ngram"
        }
      }
    }
  }
}
GET test/_analyze
{
  "analyzer" : "default",
  "text" : "test"
}
{
  "tokens" : [
    {
      "token" : "t",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    }
  ]
}

We should change the preconfigured filter to correspond to the documentation

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions