Skip to content

Alternatives to disabling or filtering the _source field at index time #11116

@clintongormley

Description

@clintongormley

In #10915 we removed the ability to disable the _source field, and in #10814 we removed the ability to use includes and excludes to remove selected fields from the _source field that is stored with each document.

The reason for this is that a number of important existing and future features rely on having the complete original _source field available in Elasticsearch, such as:

  • the update API
  • on-the-fly highlighting
  • reindexing (either to change mappings/analysis or in to upgrade an index over major versions)
  • automated repair of index corruption
  • the ability to debug problems by viewing the original source used for indexing

In our experience, many new users disable _source just to save disk space, or because it seemed like a nice optimisation. Almost all of them later regret it, and found themselves unable to move forward because rebuilding the index from the original data store was too costly.

Instead, we have the ability to:

The above changes are good for the most common use cases, probably 90% of our user base. However, there are two use cases in particular where controlling how or whether the source is indexed would beneficial to the more expert user:

No source needed

High volume indexing of documents used almost exclusively for analytics. The source field is not required in the search results, indices can be rebuilt from fast primary data stores, minimising disk usage and write performance matters. In this case, we can provide an index setting to completely disable the storage of the _source field and all of the benefits that come with having the original source.

Why an index setting?

Previously, users could do this just by setting _source.enabled: false, so why switch this to an index setting? Doing this in the mapping was too convenient, so users who didn't understand the consequences used the option and ended up suffering for it. By making it an index setting, it (1) invalidates the behaviour that has been recommended in blog posts, making users go back to read the documentation and (2) allows us to use a scary enough name (with accompanying docs) that will make users think twice.

Reading a large _source is slow and unnecessary

Users who are indexing a large field (like the contents of a PDF) plus several small fields (eg title, creation date, tags, etc) are likely to want to return just the small fields plus highlighted snippets. However, returning just the title field necessitates reading (and then filtering out) the large contents field as well.

Previously, users used the source.includes and source.excludes filters to remove these large fields from the _source, but as a consequence, this disables all of the features mentioned above. As an alternative, the user can still disable the _source field and set individual fields to store: true.

It would be nice to do better though: to keep the original _source but make search responses requiring just a few fields faster than they are today. Two proposals:

Add a _response_source field

The original _source would still be stored, but the _response_source would be a second stored field with a filtered list of fields (behaving like the old includes/excludes). The user could choose which field should be returned with their search requests. Compression would minimise the amount of extra storage required because the fields in the_response_sourcewould be a subset of those in the_source`.

Store top-level fields as separate stored fields

As suggested in #9034, the _source field would be stored as separate stored fields, one for top-level field in the JSON document. This would allow Elasticsearch to efficiently skip over filtered out fields to return just the required subset, yet it preserves the original JSON so that values such as [1,null,1] or [] etc can be returned correctly.

An advantage of this solution is that the decision about which fields to return is query time, while the _response_source option is set at index time.

This also opens up the possibility to enable more efficient compression techniques for individual fields, depending on the type of data contained in each field.

Thoughts?

Metadata

Metadata

Assignees

Labels

:Search Foundations/MappingIndex mappings, including merging and defining field typesMetaTeam:Search FoundationsMeta label for the Search Foundations team in Elasticsearch

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions