Skip to content

Daitch-Mokotoff soundex gives incorrect results when it should return multiple encodings #28211

@bkazez

Description

@bkazez

Elasticsearch version: Version: 6.1.1, Build: bd92e7f/2017-12-17T20:23:25.338Z, JVM: 1.8.0_144

Plugins installed: [analysis-icu, analysis-phonetic]

JVM version: java version "1.8.0_144"

OS version: Darwin Kernel Version 17.3.0

Description of the problem including expected versus actual behavior:

Daitch-Mokotoff analyzer returns only one token when it should return multiple.

Steps to reproduce:

...
        "analyzer_daitch_mokotoff": {
          "type": "custom",
          "tokenizer": "lowercase",
          "filter": [
            "daitch_mokotoff"
          ]
        }
curl -XGET 'http://localhost:9200/indexname/_analyze?pretty' -H 'Content-Type: application/json' -d'{
  "analyzer": "analyzer_daitch_mokotoff",
  "text": "CHAUPTMAN"
}'

This should return 573660 (ch sounding like tch) and 473660 (ch sounding like kh) but instead only returns 473660.

{
  "tokens" : [
    {
      "token" : "473660",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    }
  ]
}

See Daitch-Mokotoff soundex spec here: http://www.avotaynu.com/soundex.htm

Until this is fixed, the D-M soundex feature in the phonetic plugin is not usable.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions