-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokensTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearch
Description
The docs state:
With the default settings, the `edge_ngram` tokenizer treats the initial text as a
single token and produces N-grams with minimum length `1` and maximum length
`2`:
This is corrrect if you define a new tokenizer of type edge_ngram, like so:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "my_ngram"
}
},
"tokenizer" : {
"my_ngram" : {
"type" : "edge_ngram"
}
}
}
}
}
GET test/_analyze
{
"analyzer" : "default",
"text" : "test"
}
{
"tokens" : [
{
"token" : "t",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "te",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
}
]
}
However, if you instead use the pre-configured edge_ngram tokenizer, you only get ngrams of size 1:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "edge_ngram"
}
}
}
}
}
GET test/_analyze
{
"analyzer" : "default",
"text" : "test"
}
{
"tokens" : [
{
"token" : "t",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
}
]
}
We should change the preconfigured filter to correspond to the documentation
Metadata
Metadata
Assignees
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokensTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearch