Flattened object fields design + implementation

Main issue: #25312
Feature branch: https://github.com/elastic/elasticsearch/tree/object-fields

**Note:** this field type was previously called `embedded_json`, so many PRs + comments will refer to that name.

**Motivation**

Documents sometimes contain large objects, where only a small number of the fields are frequently used in searches. By default, we create dynamic mappings for all key-value pairs in the object, and index each one as a separate field. This has a number of downsides:
- We’re creating a large number of distinct fields in Lucene.
- Each field becomes its own entry in the mappings, which can lead to a large cluster state.
- From a UX perspective, the list of fields can appear quite cluttered, and it can be difficult to understand which fields are most critical.

In some cases, the number of field keys not just a large known number, but unbounded. Here, it can be difficult to successfully model the data at all.

**Feature Summary**

This feature will allow an entire JSON object to be indexed into a field, and provide limited search functionality over the field's contents. Given an object field `header` of the form `{"content-type": "text/html", "referer": "https://google.com"}`, its content will be analyzed into the individual tokens `content-type\0text/html`, `referer\0https://google.com` (where `\0` is some suitable delimiter). Additionally, tokens are created for each value alone: `text/html`, `https://google.com`. Each leaf value in the object becomes its own token, and no further analysis is applied to the individual values.

In addition to being able to retrieve the JSON blob (through fetching source, or as a stored field), we plan to support queries of the following forms:
- key: `header`, value: `application/json`, for example `{"term": {"header": "application/json"}}`
- key: `header.content-type`, value: `application/json`, for example `{"term": {"header.content-type": "application/json"}}`

Note that it is not possible to search the prefixed tokens directly, i.e. the following query will not return results: `{"term": {"header": "content-type\0application/json"}}`.

As a first pass, the following query types will be allowed: `term`, `terms`, `terms_set`, `range` (without special support for numerics), `prefix`, `match` family (insofar as they work for keyword fields), `query_string`, `simple_query_string`, `exists`.

In this first version, it will not be possible to refer to field keys using wildcards, as in `{"header.content-*": "application/json"}`. Under the proposed API/ implementation, supporting field wildcards would add significant complexity and uncertainty around performance.

**Potential Extensions**

- Collect more feedback on the importance of numeric fields, and explore adding more targeted support for them. As an example, users may want to perform true range queries on numeric fields within the object.
- Introduce a way for certain JSON keys to be 'promoted' into individual fields. One approach we're considering is to extend `copy_to` to work on entire objects, so that the same JSON blob could be added both as a 'queryable object' field, and also as normal object with explicit subfield definitions.
- Add support for additional query types.
  - By performing proper escaping and taking advantage of `prefix_length`, we could likely support `wildcard`, `regexp`, and `fuzzy` queries.
  - As mentioned in the original issue, we could consider tokenizing values on whitespace. This could allow for better support of positional queries like `match_phrase`.
- Explore adding support for aggregations + sorting. This idea needs a lot more research, but could maybe be accomplished by creating additional 'doc value fields', then adding a filtering layer when fetching doc values that checks for the field prefix. **Update**: we've decided to include this in the first version.
- Explore adding support for highlighting, since with large JSON blobs it can be difficult to tell which key-value pairs matched the query.
- Potentially allow for the field contents to be specified as a JSON string, in addition to accepting an object embedded in the document source.

**Implementation Plan**
Core items:
- [x] Create a new field type that accepts an object and indexes its leaf values. Verify that the object field can be used in queries of the form `{"header": "application/json"}`. #33923
- [x] Index prefixed tokens, and support searching for values based on key as in `{"header.content-type": "application/json"}`. #34207 #34621
- [x] Add support for storing the field by adding a single stored field containing the whole JSON blob. #34942
- [x] Add a limit to the depth of the objects that will be indexed. #35063
- [x] Add documentation. #35281
- [x] Add tests for the supported query types. #35319
- [x] Revisit the field lookup logic with potential optimizations. #39872
- [x] Rename field type to `embedded_json`. #40712
- [x] Explore adding support for doc values, to allow for aggregations + sorting. #40069
- [x] Address issues in doc values implementation  (https://github.com/elastic/elasticsearch/pull/40069#issuecomment-477797499). #41282 #41319
- [x] Perform benchmarks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flattened object fields design + implementation #33003

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flattened object fields design + implementation #33003

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions