Skip to content

A shingle filter before synonym filter causes index creation/open failure when multi-word synonyms are used #36129

@adoerrES

Description

@adoerrES

Elasticsearch version (bin/elasticsearch --version): 6.5.1 and earlier 6.x versions (tested in 6.4.2 as well)

Version: 6.5.1, Build: default/tar/8c58350/2018-11-16T02:22:42.182257Z, JVM: 1.8.0_181

Plugins installed: [] None

JVM version (java -version):

java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

OS version (uname -a if on a Unix-like system):

Reproduced in Elastic SaaS and on Mac:

16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64 x86_64

Description of the problem including expected versus actual behavior:

Index settings that use a shingle filter before a synonym filter in a filter chain AND contain multi-word synonyms that contain whitespace (like "eagle claw, eagleclaw") cause index creation and opening to fail.

For example, creating or opening an index with the following settings:

PUT testindex
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms": [
              "panamanian, panama",
              "eagle claw, eagleclaw"
            ],
            "tokenizer": "keyword"
          },
          "my_shingle": {
            "max_shingle_size": "6",
            "min_shingle_size": "2",
            "output_unigrams": "true",
            "type": "shingle"
          }
        },
        "analyzer": {
          "with_synonyms": {
            "filter": [
              "lowercase",
              "my_shingle",
              "synonym"
            ],
            "type": "custom",
            "tokenizer": "standard"
          }
        }
      }
    }
  }
}

...yields this error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "failed to build synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "Invalid synonym rule at line 2",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: eagle claw analyzed to a token (eagle claw) with position increment != 1 (got: 0)"
      }
    }
  },
  "status": 400
}

Workarounds:

Either remove all multi-word thesaurus entries from the index settings ("eagle claw, eagleclaw") which is undesirable, or, change the order in the filter chain so that the synonym filter comes before theshingle filter, like so:

(excerpt)

        "analyzer": {
          "with_synonyms": {
            "filter": [
              "lowercase",
              "synonym",
              "my_shingle"
            ],

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions