Skip to content

Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

@geekpete

Description

@geekpete

Describe the feature:

Elasticsearch 5.x allows duplicate fields to be indexed into documents:
#19614

This was fixed in 6.x by enforcing strict duplicate validation:
#22073

A user who upgrades a cluster containing docs with duplicate fields from 5.x to 6.x will currently have no warning that their data might become unusable (for doing any operations requiring json parsing, including ?pretty, indexing, scripts) once upgraded to 6.x.

The error message example is:

"caused_by": { "type": "json_parse_exception",
"reason" : "Duplicate field 'FILECONTENT'\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@5d3ea628; line: 99, column: 14]"...

This scenario can occur when a custom indexing client may be inadvertently creating documents with duplicate fields, so this might be seen as an edge case as well by the few reports of users hitting duplicate fields issues I've seen but when it does occur it's a bad situation.

Once upgraded to 6.x, there are limited options to repair the problem documents due to the inability to use reindex without hitting the duplicate field error (even with the escape hatch enabled to disable validation: es.json.strict_duplicate_detection=false as it doesn't disable validatin for reindex operations).
For this reason, repairs are probably best done before upgrading on the 5.x cluster.

It seems that a fix is either to update in place using the the existing _source (eg with update script ctx._source = ctx._source;) or to reindex the documents which will create new documents without the duplicate fields:

One other thing to consider is that any scenario where the values for the duplicate field are different, then a more custom script or solution might be needed to be able to choose which value to keep, it gets messy then.

So users should be at least warned and if possible presented with either automatic or manual repair options.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions