Skip to content

Flattened object fields design + implementation #33003

@jtibshirani

Description

@jtibshirani

Main issue: #25312
Feature branch: https://github.com/elastic/elasticsearch/tree/object-fields

Note: this field type was previously called embedded_json, so many PRs + comments will refer to that name.

Motivation

Documents sometimes contain large objects, where only a small number of the fields are frequently used in searches. By default, we create dynamic mappings for all key-value pairs in the object, and index each one as a separate field. This has a number of downsides:

  • We’re creating a large number of distinct fields in Lucene.
  • Each field becomes its own entry in the mappings, which can lead to a large cluster state.
  • From a UX perspective, the list of fields can appear quite cluttered, and it can be difficult to understand which fields are most critical.

In some cases, the number of field keys not just a large known number, but unbounded. Here, it can be difficult to successfully model the data at all.

Feature Summary

This feature will allow an entire JSON object to be indexed into a field, and provide limited search functionality over the field's contents. Given an object field header of the form {"content-type": "text/html", "referer": "https://google.com"}, its content will be analyzed into the individual tokens content-type\0text/html, referer\0https://google.com (where \0 is some suitable delimiter). Additionally, tokens are created for each value alone: text/html, https://google.com. Each leaf value in the object becomes its own token, and no further analysis is applied to the individual values.

In addition to being able to retrieve the JSON blob (through fetching source, or as a stored field), we plan to support queries of the following forms:

  • key: header, value: application/json, for example {"term": {"header": "application/json"}}
  • key: header.content-type, value: application/json, for example {"term": {"header.content-type": "application/json"}}

Note that it is not possible to search the prefixed tokens directly, i.e. the following query will not return results: {"term": {"header": "content-type\0application/json"}}.

As a first pass, the following query types will be allowed: term, terms, terms_set, range (without special support for numerics), prefix, match family (insofar as they work for keyword fields), query_string, simple_query_string, exists.

In this first version, it will not be possible to refer to field keys using wildcards, as in {"header.content-*": "application/json"}. Under the proposed API/ implementation, supporting field wildcards would add significant complexity and uncertainty around performance.

Potential Extensions

  • Collect more feedback on the importance of numeric fields, and explore adding more targeted support for them. As an example, users may want to perform true range queries on numeric fields within the object.
  • Introduce a way for certain JSON keys to be 'promoted' into individual fields. One approach we're considering is to extend copy_to to work on entire objects, so that the same JSON blob could be added both as a 'queryable object' field, and also as normal object with explicit subfield definitions.
  • Add support for additional query types.
    • By performing proper escaping and taking advantage of prefix_length, we could likely support wildcard, regexp, and fuzzy queries.
    • As mentioned in the original issue, we could consider tokenizing values on whitespace. This could allow for better support of positional queries like match_phrase.
  • Explore adding support for aggregations + sorting. This idea needs a lot more research, but could maybe be accomplished by creating additional 'doc value fields', then adding a filtering layer when fetching doc values that checks for the field prefix. Update: we've decided to include this in the first version.
  • Explore adding support for highlighting, since with large JSON blobs it can be difficult to tell which key-value pairs matched the query.
  • Potentially allow for the field contents to be specified as a JSON string, in addition to accepting an object embedded in the document source.

Implementation Plan
Core items:

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions