-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Identify documents by their _id.
#24460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify documents by their _id.
#24460
Conversation
3775490 to
038b09e
Compare
352001a to
57728e2
Compare
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left an initial set of comments, engine stuff looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a final member to IndexSettings that holds the value of this setting. I don't think it can change in any way, can it?... just checked it's final so lets just make it a first class citizen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you should use the seq number of the request I think.. it's not necessarily a pre 6.0 index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed it will actually be a 6.x index most of the time. I used the seq no of the primary response, is it the right thing to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look very close here but can we keep to to one enum somehow and pass in the field mapper name as a ctor argument? it should stay the same across the lifetime of an index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can make that decision ahead of time when we create the engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make changes around those lines. Information is kept in static threadlocals so we can't really pass the information at construction time, but it should be at least better validated now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you be more clear what we do here? it's really a wrapper around DV not in-memory FieldData right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the class final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it IS a wrapper around in-memory fielddata. See changes in the docs for instance where I deprecated fielddata access on the _uid and _id fields.
|
@s1monw Thanks for looking, I made some changes. |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minors here LGTM otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assert that it has only one type at most?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this TODO worries me a bit, I think we use this access by default for instance when we slice scrolls? Should we add docvalues to _id as well? I wonder if we should open a new issue and make it a blocker to ensure we make a call here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is one already, adding doc values was a bit controversial #11887
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might miss something but we are on the ID field that doesn't have a type? why are we using Uid.createUidsForTypesAndIds(context.queryTypes(), values)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only triggered when indexOptions() == IndexOptions.NONE, which means it is a 5.x index, I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks that was confusing to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have IndexSettings available when you create IdFieldMapper I wonder if we can just pass a boolean down to the ctor if we have more than one type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same what I mentioned above, we have IndexSettings available here can we maybe pass a boolean down? also maybe we should just reference a static method from IdFieldMapper here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a followup I wonder if we can use the seq ID for this now since it has docvalues? That would be a much better default IMO, at least for 6.0.0 indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether there is an expectation that a given seed always gives the same score to a given id, regardless of whether it was updated. I remember Robert arguing that we should just use the Lucene doc id for random score generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slicing is not about scoring, it's about partitioning the data afaik? with lucene doc IDs you can not resume a slice since you might hit a different replica when you resume. I wonder if that is a different problem here... for random score I agree we might just go down that path by default and if somebody needs reproducibility we can still use something like the ID or seq ids?
Now that indices have a single type by default, we can move to the next step and identify documents using their `_id` rather than the `_uid`. One notable change in this commit is that I made deletions implicitly create types. This helps with the live version map in the case that documents are deleted before the first type is introduced. Otherwise there would be no way to differenciate `DELETE index/foo/1` followed by `PUT index/foo/1` from `DELETE index/bar/1` followed by `PUT index/foo/1`, even though those are different if versioning is involved.
ceda51d to
2a5e282
Compare
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM still.. you got a sysout in your last commit that you might wanna remove...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks that was confusing to me
| } | ||
|
|
||
| public static void main(String[] args) { | ||
| System.out.println(precisionFromThreshold(50)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally!
Now that indices have a single type by default, we can move to the next step and identify documents using their `_id` rather than the `_uid`. One notable change in this commit is that I made deletions implicitly create types. This helps with the live version map in the case that documents are deleted before the first type is introduced. Otherwise there would be no way to differenciate `DELETE index/foo/1` followed by `PUT index/foo/1` from `DELETE index/bar/1` followed by `PUT index/foo/1`, even though those are different if versioning is involved.
This was introduced in elastic#24460: the constructor of `Translog.Delete` that takes a `StreamInput` does not set the type and id. To make it a bit more robust, I made fields final so that forgetting to set them would make the compiler complain.
Now that indices have a single type by default, we can move to the next step
and identify documents using their
_idrather than the_uid.One notable change in this commit is that I made deletions implicitly create
types. This helps with the live version map in the case that documents are
deleted before the first type is introduced. Otherwise there would be no way
to differenciate
DELETE index/foo/1followed byPUT index/foo/1fromDELETE index/bar/1followed byPUT index/foo/1, even though those aredifferent if versioning is involved.