Skip to content

keyword_repeat and multiplexer don't play well with subsequent synonym filters #33609

@cbuescher

Description

@cbuescher

I recently saw an issue where an anlyzer chain was set up to perform some stemming on the input and then apply a synonym filter afterwards.
In order to also keep the unstemmed tokens in the output (and apply synonyms as well there if possible), a keyword_repeat filter was used, but
this already leads to errors on index creating because the synonyms in the filter are validated by running through the analysis chain:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "keyword_repeat",
            "light_english_stemmer",
            "my_synonyms"
          ]
        }
      }
    }
  }
}

Gives:

    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: optimised analyzed to a token (optimise) with position increment != 1 (got: 0)"
      }
    }

I also tried using a multipexer like so, but that is running into similar issues:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "my_multiplexer": {
          "type": "multiplexer",
          "filters": ["light_english_stemmer"]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_multiplexer",
            "my_synonyms"
            
          ]
        }
      }
    }
  }
}

I'm wondering if I'm using this the wrong way or if there are other ways to achieve similar effect.
Also I'm trying to understand what the position checks that are causing this rejection in SynonymMap#analyze are supposed to prevent
and if those checks could possibly be omitted for the case of the tokens generated by keyword_repeat or multiplexer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions