-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
I recently saw an issue where an anlyzer chain was set up to perform some stemming on the input and then apply a synonym filter afterwards.
In order to also keep the unstemmed tokens in the output (and apply synonyms as well there if possible), a keyword_repeat filter was used, but
this already leads to errors on index creating because the synonyms in the filter are validated by running through the analysis chain:
PUT /index
{
"settings": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"optimised => optimized"
]
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"light_english_stemmer": {
"type": "stemmer",
"language": "light_english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"blogs_synonyms_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"keyword_repeat",
"light_english_stemmer",
"my_synonyms"
]
}
}
}
}
}
Gives:
"type": "illegal_argument_exception",
"reason": "failed to build synonyms",
"caused_by": {
"type": "parse_exception",
"reason": "Invalid synonym rule at line 1",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "term: optimised analyzed to a token (optimise) with position increment != 1 (got: 0)"
}
}
I also tried using a multipexer like so, but that is running into similar issues:
PUT /index
{
"settings": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"optimised => optimized"
]
},
"my_multiplexer": {
"type": "multiplexer",
"filters": ["light_english_stemmer"]
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"light_english_stemmer": {
"type": "stemmer",
"language": "light_english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"blogs_synonyms_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_multiplexer",
"my_synonyms"
]
}
}
}
}
}
I'm wondering if I'm using this the wrong way or if there are other ways to achieve similar effect.
Also I'm trying to understand what the position checks that are causing this rejection in SynonymMap#analyze are supposed to prevent
and if those checks could possibly be omitted for the case of the tokens generated by keyword_repeat or multiplexer.