Alternatives to disabling or filtering the `_source` field at index time

In https://github.com/elastic/elasticsearch/pull/10915 we removed the ability to disable the `_source` field, and in https://github.com/elastic/elasticsearch/pull/10814 we removed the ability to use `includes` and `excludes` to remove selected fields from the `_source` field that is stored with each document.

The reason for this is that a number of important existing and future features rely on having the complete original `_source` field available in Elasticsearch, such as:
- the `update` API
- on-the-fly highlighting
- reindexing (either to change mappings/analysis or in to upgrade an index over major versions)
- automated repair of index corruption 
- the ability to debug problems by viewing the original source used for indexing

In our experience, many new users disable `_source` just to save disk space, or because it seemed like a nice optimisation. Almost all of them later regret it, and found themselves unable to move forward because rebuilding the index from the original data store was too costly.

Instead, we have the ability to:
- filter the contents of the `_source` field that is returned to the user (#3302)
- enable a higher compression ratio (https://github.com/elastic/elasticsearch/pull/8863)
- filter down the entire search response with the `path` parameter (https://github.com/elastic/elasticsearch/pull/10980)

The above changes are good for the most common use cases, probably 90% of our user base.  However, there are two use cases in particular where controlling how or whether the source is indexed would beneficial to the more expert user:
## No source needed

High volume indexing of documents used almost exclusively for analytics. The source field is not required in the search results, indices can be rebuilt from fast primary data stores, minimising disk usage and write performance matters. In this case, we can provide an index setting to completely disable the storage of the `_source` field and all of the benefits that come with having the original source.

**Why an index setting?**

Previously, users could do this just by setting `_source.enabled: false`, so why switch this to an index setting?  Doing this in the mapping was too convenient, so users who didn't understand the consequences used the option and ended up suffering for it.  By making it an index setting, it (1) invalidates the behaviour that has been recommended in blog posts, making users go back to read the documentation and (2) allows us to use a scary enough name (with accompanying docs) that will make users think twice.
## Reading a large `_source` is slow and unnecessary

Users who are indexing a large field (like the contents of a PDF) plus several small fields (eg title, creation date, tags, etc) are likely to want to return just the small fields plus highlighted snippets.  However, returning just the `title` field necessitates reading (and then filtering out) the large `contents` field as well.

Previously, users used the `source.includes` and `source.excludes` filters to remove these large fields from the `_source`, but as a consequence, this disables all of the features mentioned above.   As an alternative, the user can still disable the `_source` field and set individual fields to `store: true`.

It would be nice to do better though: to keep the original `_source` but make search responses requiring just a few fields faster than they are today.  Two proposals:

**Add  a `_response_source` field**

The original  `_source` would still be stored, but the `_response_source` would be a second stored field with a filtered list of fields (behaving like the old `includes/excludes).  The user could choose which field should be returned with their search requests.  Compression would minimise the amount of extra storage required because the fields in the`_response_source`would be a subset of those in the`_source`.

**Store top-level fields as separate stored fields**

As suggested in https://github.com/elastic/elasticsearch/issues/9034, the `_source` field would be stored as separate stored fields, one for top-level field in the JSON document.  This would allow Elasticsearch to efficiently skip over filtered out fields to return just the required subset, yet it preserves the original JSON so that values such as `[1,null,1]` or `[]` etc can be returned correctly.

An advantage of this solution is that the decision about which fields to return is query time, while the `_response_source` option is set at index time.

This also opens up the possibility to enable more efficient compression techniques for individual fields, depending on the type of data contained in each field.

Thoughts?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alternatives to disabling or filtering the `_source` field at index time #11116

No source needed

Reading a large `_source` is slow and unnecessary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alternatives to disabling or filtering the _source field at index time #11116

Description

No source needed

Reading a large _source is slow and unnecessary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Alternatives to disabling or filtering the `_source` field at index time #11116

Reading a large `_source` is slow and unnecessary