Identify documents by their `_id`. #24460

jpountz · 2017-05-03T12:10:10Z

Now that indices have a single type by default, we can move to the next step
and identify documents using their _id rather than the _uid.

One notable change in this commit is that I made deletions implicitly create
types. This helps with the live version map in the case that documents are
deleted before the first type is introduced. Otherwise there would be no way
to differenciate DELETE index/foo/1 followed by PUT index/foo/1 from
DELETE index/bar/1 followed by PUT index/foo/1, even though those are
different if versioning is involved.

s1monw

I left an initial set of comments, engine stuff looks good

s1monw · 2017-05-05T15:07:45Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

I think we should add a final member to IndexSettings that holds the value of this setting. I don't think it can change in any way, can it?... just checked it's final so lets just make it a first class citizen.

s1monw · 2017-05-05T15:08:50Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

here you should use the seq number of the request I think.. it's not necessarily a pre 6.0 index?

indeed it will actually be a 6.x index most of the time. I used the seq no of the primary response, is it the right thing to do?

s1monw · 2017-05-05T15:11:12Z

core/src/main/java/org/elasticsearch/common/lucene/uid/PerThreadIDVersionAndSeqNoLookup.java

I didn't look very close here but can we keep to to one enum somehow and pass in the field mapper name as a ctor argument? it should stay the same across the lifetime of an index?

I wonder if we can make that decision ahead of time when we create the engine?

I tried to make changes around those lines. Information is kept in static threadlocals so we can't really pass the information at construction time, but it should be at least better validated now.

s1monw · 2017-05-05T15:14:17Z

core/src/main/java/org/elasticsearch/index/fielddata/UidIndexFieldData.java

can you be more clear what we do here? it's really a wrapper around DV not in-memory FieldData right?

make the class final?

actually it IS a wrapper around in-memory fielddata. See changes in the docs for instance where I deprecated fielddata access on the _uid and _id fields.

jpountz · 2017-05-09T08:16:54Z

@s1monw Thanks for looking, I made some changes.

s1monw

I left some minors here LGTM otherwise

s1monw · 2017-05-09T12:15:31Z

core/src/main/java/org/elasticsearch/index/fieldvisitor/FieldsVisitor.java

can we assert that it has only one type at most?

s1monw · 2017-05-09T12:17:45Z

core/src/main/java/org/elasticsearch/index/fielddata/UidIndexFieldData.java

this TODO worries me a bit, I think we use this access by default for instance when we slice scrolls? Should we add docvalues to _id as well? I wonder if we should open a new issue and make it a blocker to ensure we make a call here?

there is one already, adding doc values was a bit controversial #11887

s1monw · 2017-05-09T12:19:05Z

core/src/main/java/org/elasticsearch/index/mapper/IdFieldMapper.java

I might miss something but we are on the ID field that doesn't have a type? why are we using Uid.createUidsForTypesAndIds(context.queryTypes(), values)?

This is only triggered when indexOptions() == IndexOptions.NONE, which means it is a 5.x index, I'll add a comment.

thanks that was confusing to me

s1monw · 2017-05-09T12:22:22Z

core/src/main/java/org/elasticsearch/index/mapper/IdFieldMapper.java

you have IndexSettings available when you create IdFieldMapper I wonder if we can just pass a boolean down to the ctor if we have more than one type?

s1monw · 2017-05-09T12:24:17Z

core/src/main/java/org/elasticsearch/index/mapper/UidFieldMapper.java

same what I mentioned above, we have IndexSettings available here can we maybe pass a boolean down? also maybe we should just reference a static method from IdFieldMapper here?

s1monw · 2017-05-09T12:29:20Z

core/src/main/java/org/elasticsearch/search/slice/SliceBuilder.java

as a followup I wonder if we can use the seq ID for this now since it has docvalues? That would be a much better default IMO, at least for 6.0.0 indices

I wonder whether there is an expectation that a given seed always gives the same score to a given id, regardless of whether it was updated. I remember Robert arguing that we should just use the Lucene doc id for random score generation.

slicing is not about scoring, it's about partitioning the data afaik? with lucene doc IDs you can not resume a slice since you might hit a different replica when you resume. I wonder if that is a different problem here... for random score I agree we might just go down that path by default and if somebody needs reproducibility we can still use something like the ID or seq ids?

Now that indices have a single type by default, we can move to the next step and identify documents using their `_id` rather than the `_uid`. One notable change in this commit is that I made deletions implicitly create types. This helps with the live version map in the case that documents are deleted before the first type is introduced. Otherwise there would be no way to differenciate `DELETE index/foo/1` followed by `PUT index/foo/1` from `DELETE index/bar/1` followed by `PUT index/foo/1`, even though those are different if versioning is involved.

s1monw

LGTM still.. you got a sysout in your last commit that you might wanna remove...

s1monw · 2017-05-09T12:47:42Z

core/src/main/java/org/elasticsearch/index/mapper/IdFieldMapper.java

thanks that was confusing to me

s1monw · 2017-05-09T12:47:52Z

...main/java/org/elasticsearch/search/aggregations/metrics/cardinality/HyperLogLogPlusPlus.java

    }

+    public static void main(String[] args) {
+        System.out.println(precisionFromThreshold(50));


leftover I think

See #24460

Now that indices have a single type by default, we can move to the next step and identify documents using their `_id` rather than the `_uid`. One notable change in this commit is that I made deletions implicitly create types. This helps with the live version map in the case that documents are deleted before the first type is introduced. Otherwise there would be no way to differenciate `DELETE index/foo/1` followed by `PUT index/foo/1` from `DELETE index/bar/1` followed by `PUT index/foo/1`, even though those are different if versioning is involved.

This was introduced in elastic#24460: the constructor of `Translog.Delete` that takes a `StreamInput` does not set the type and id. To make it a bit more robust, I made fields final so that forgetting to set them would make the compiler complain.

…4586) This was introduced in #24460: the constructor of `Translog.Delete` that takes a `StreamInput` does not set the type and id. To make it a bit more robust, I made fields final so that forgetting to set them would make the compiler complain.

jpountz added :Search Foundations/Mapping Index mappings, including merging and defining field types >enhancement labels May 3, 2017

jpountz force-pushed the fix/do_not_include_type_in_uid branch from 3775490 to 038b09e Compare May 3, 2017 14:48

s1monw self-requested a review May 4, 2017 15:35

jpountz force-pushed the fix/do_not_include_type_in_uid branch 2 times, most recently from 352001a to 57728e2 Compare May 5, 2017 08:56

s1monw suggested changes May 5, 2017

View reviewed changes

s1monw approved these changes May 9, 2017

View reviewed changes

jpountz added 3 commits May 9, 2017 14:43

Apply @s1monw 's feedback.

7548b91

iter

2a5e282

jpountz force-pushed the fix/do_not_include_type_in_uid branch from ceda51d to 2a5e282 Compare May 9, 2017 12:43

s1monw approved these changes May 9, 2017

View reviewed changes

jpountz added 3 commits May 9, 2017 14:51

iter

5cdf686

iter

2bf6788

Extract terms from TermInSetQuery for machine learning.

8ae7798

jpountz merged commit a72eaa8 into elastic:master May 9, 2017

jpountz deleted the fix/do_not_include_type_in_uid branch May 9, 2017 14:33

rjernst added a commit that referenced this pull request May 9, 2017

Fix ids query test when none or ALL type is used

53f6d94

See #24460

jpountz mentioned this pull request May 10, 2017

type and id are lost upon serialization of Translog.Delete. #24586

Merged

clintongormley added the v6.0.0 label May 15, 2017

clintongormley added v6.0.0-alpha2 and removed v6.0.0 labels Jun 6, 2017

jpountz mentioned this pull request Jul 4, 2017

Fix the documentation to state that the _id field is indexed. #25540

Merged

Identify documents by their _id. #24460

Identify documents by their _id. #24460

Uh oh!

Conversation

jpountz commented May 3, 2017

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented May 9, 2017

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Identify documents by their `_id`. #24460

Identify documents by their `_id`. #24460