-
Notifications
You must be signed in to change notification settings - Fork 25.6k
When parsing JSON fields, also create tokens prefixed with the field key. #34207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When parsing JSON fields, also create tokens prefixed with the field key. #34207
Conversation
|
Pinging @elastic/es-search-aggs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null character \0 seemed like a reasonable choice for a separator, as (1) it shouldn’t show up too often in field keys, and (2) there is already precedent for it, as we use it when storing percolator queries (PercolatorFieldMapper#FIELD_VALUE_SEPARATOR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For prefixed values, the alternative option here would be to check the whole length of the prefixed token, as opposed to just the value. I think that this behavior is more intuitive (and I also don't think we're as concerned about field keys being really long, as opposed to values?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should probably put in some kind of soft limit on the depth of these objects at some point and the ignore above plus that soft limit will give us an upper bound on the term lengths here anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, I'll make a note on the meta-issue to add a limit.
colings86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but they are very minor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should call this rootFieldName or something so its clear everywhere exact which field is being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw an exception in an else block here? If we encounter something like an array of arrays we should probably reject the document rather than silently ignoring it? The same probably applies for parseObject above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I think this is actually a bug, since an array of arrays is valid JSON and should be accepted. Will fix.
I agree it's a good idea to add an else with an exception, so we'll fail fast when encountering something unexpected rather than attempt to proceed in a potentially wrong state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should probably put in some kind of soft limit on the depth of these objects at some point and the ignore above plus that soft limit will give us an upper bound on the term lengths here anyway
romseygeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. My only nit is that I don't like the name 'prefixed', as this seems to me to be an implementation detail. Something like 'keyed' might be better?
|
Thanks both of you for taking a look. I also like the 'keyed' naming better, will update to that. |
colings86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for making the changes
romseygeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too, thanks Julie!
402b7ac to
c0ea6d4
Compare
cce97f2 to
21534d9
Compare
This PR updates the parsing for
jsonfields to emit tokens prefixed by the field key, in addition to 'root' tokens containing the unprefixed value.As an example, given a
jsonfield called 'json_field' and the following inputthe mapper will produce untokenized string fields with the values
some value,key\0some value,true, andkey2.key3\0true.An important note about this change: the behavior we want is for searches on the 'root' JSON field to only consider raw values, and not prefixed tokens. For example, given the JSON field
"headers": { "content-type": "application/json”"}, it should not match the query"prefix": { "headers": "content" } }just because one of the indexed tokens iscontent-type\0application/json. I think this behavior would be confusing, and that we should try to keep the token prefixing as an internal implementation detail. To avoid this issue, I chose to add the prefixed tokens to a new lucene field called<field name>._prefixed. This won't affect how searches are done (the user will still query using the keyheaders, orheaders.content-type).Finally, this PR just updates the indexing process -- these prefixed tokens can’t be searched yet. The search side will be implemented in a subsequent PR (I’m just a fan of keeping PRs small!)