Skip to content

Possible to index duplicate documents with same id and routing id. #31976

@kylelyk

Description

@kylelyk

Elasticsearch version: 6.2.4

Plugins installed: []

JVM version: 1.8.0_172

OS version: MacOS (Darwin Kernel Version 15.6.0)

Description of the problem including expected versus actual behavior:
Over the past few months, we've been seeing completely identical documents pop up which have the same id, type and routing id. We're using custom routing to get parent-child joins working correctly and we make sure to delete the existing documents when re-indexing them to avoid two copies of the same document on the same shard. We use Bulk Index API calls to delete and index the documents. The indexTime field below is set by the service that indexes the document into ES and as you can see, the documents were indexed about 1 second apart from each other. This problem only seems to happen on our production server which has more traffic and 1 read replica, and it's only ever 2 documents that are duplicated on what I believe to be a single shard.

The problem can be fixed by deleting the existing documents with that id and re-indexing it again which is weird since that is what the indexing service is doing in the first place.

Queries:
GET /my-index/_search

{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
      "field": "id",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}
{
  "took": 2588,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 15430904,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "duplicateCount": {
      "doc_count_error_upper_bound": 4,
      "sum_other_doc_count": 15430801,
      "buckets": [
        {
          "key": "746004ff8168bbe5672605fad34704a5",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "my-index",
                  "_type": "ce",
                  "_id": "746004ff8168bbe5672605fad34704a5",
                  "_score": 1,
                  "_routing": "746004ff8168bbe5672605fad34704a5",
                  "_source": {
                    "indexTime": 1531249623788
                  }
                },
                {
                  "_index": "my-index",
                  "_type": "ce",
                  "_id": "746004ff8168bbe5672605fad34704a5",
                  "_score": 1,
                  "_routing": "746004ff8168bbe5672605fad34704a5",
                  "_source": {
                    "indexTime": 1531249622605
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Indexing/EngineAnything around managing Lucene and the Translog in an open shard.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions