Skip to content

Conversation

@droberts195
Copy link

Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613

David Roberts added 2 commits January 3, 2020 14:50
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes elastic#50613
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent
Copy link
Member

In the case of a job re-running from a snapshot, we delete results directly correct? And do not rely on the IDs to be the same between the two runs?

@droberts195
Copy link
Author

In the case of a job re-running from a snapshot, we delete results directly correct?

We do if it's an explicit revert with delete_intervening_results set to true. If not then we try to start off from after the last observed input or result, which will generally mean we don't create duplicate results - see

long lastEndTime = Math.max(latestFinalBucketEndTimeMs, latestRecordTimeMs);

This latter scenario will always be the case during a failover from one node to another.

It is actually possible for a model plot document to exist that's more recent than the restart time. This is because model plot documents are written before bucket documents - see https://github.com/elastic/ml-cpp/blob/a9c468cf8b991b8d30f1a9ba2846ff90edaa8bcc/lib/api/CAnomalyJob.cc#L626-L629

So in the worst case we'd persist some model plot documents that would get indexed due to a bulk request filling up, then the node would be killed before the corresponding bucket or data counts documents could be indexed, then the job would restart on a different node, redo the same bucket and we'd get duplicate model plot documents for one bucket. This would only be a problem in the case where the old node was running a version before this fix and the new node was running a version after this fix. I think it's probably worth tolerating this unlikely/single bucket problem to fix the problem of entire partitions never having any model plot.

We do explicit deletes in the case of interim results, so those won't be a problem - see

// Delete any existing interim results generated by a Flush command
// which have not been replaced or superseded by new results.
LOGGER.trace("[{}] Deleting interim results", jobId);
persister.deleteInterimResults(jobId);
deleteInterimRequired = false;

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have a ID length worst case test where every possible value is at its configurable max.

Looking at the maximum value possible for murmur hash bytes: "170141183460469231722463931679029329919" which is two Long.MAX_VALUE bytes. That as a length of UTF_8 bytes is: 39.

I think we are way under the limit, but it would be nice for such a test to cover that extreme (and unlikely) case.

@droberts195
Copy link
Author

I think we are way under the limit, but it would be nice for such a test to cover that extreme

I added a test and we're under 200 bytes in total in the worst case.

@droberts195 droberts195 merged commit 1adf4c2 into elastic:master Jan 7, 2020
@droberts195 droberts195 deleted the test_id_uniqueness branch January 7, 2020 10:23
droberts195 pushed a commit that referenced this pull request Jan 7, 2020
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes elastic#50613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] Result document IDs are not sufficiently unique

4 participants