-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
In #10915 we removed the ability to disable the _source field, and in #10814 we removed the ability to use includes and excludes to remove selected fields from the _source field that is stored with each document.
The reason for this is that a number of important existing and future features rely on having the complete original _source field available in Elasticsearch, such as:
- the
updateAPI - on-the-fly highlighting
- reindexing (either to change mappings/analysis or in to upgrade an index over major versions)
- automated repair of index corruption
- the ability to debug problems by viewing the original source used for indexing
In our experience, many new users disable _source just to save disk space, or because it seemed like a nice optimisation. Almost all of them later regret it, and found themselves unable to move forward because rebuilding the index from the original data store was too costly.
Instead, we have the ability to:
- filter the contents of the
_sourcefield that is returned to the user (Added source fetching and filtering parameters to search, get, multi-get, get-source and explain requests #3302) - enable a higher compression ratio (Add
best_compressionoption for indices #8863) - filter down the entire search response with the
pathparameter (API: Add response filtering withfilter_pathparameter #10980)
The above changes are good for the most common use cases, probably 90% of our user base. However, there are two use cases in particular where controlling how or whether the source is indexed would beneficial to the more expert user:
No source needed
High volume indexing of documents used almost exclusively for analytics. The source field is not required in the search results, indices can be rebuilt from fast primary data stores, minimising disk usage and write performance matters. In this case, we can provide an index setting to completely disable the storage of the _source field and all of the benefits that come with having the original source.
Why an index setting?
Previously, users could do this just by setting _source.enabled: false, so why switch this to an index setting? Doing this in the mapping was too convenient, so users who didn't understand the consequences used the option and ended up suffering for it. By making it an index setting, it (1) invalidates the behaviour that has been recommended in blog posts, making users go back to read the documentation and (2) allows us to use a scary enough name (with accompanying docs) that will make users think twice.
Reading a large _source is slow and unnecessary
Users who are indexing a large field (like the contents of a PDF) plus several small fields (eg title, creation date, tags, etc) are likely to want to return just the small fields plus highlighted snippets. However, returning just the title field necessitates reading (and then filtering out) the large contents field as well.
Previously, users used the source.includes and source.excludes filters to remove these large fields from the _source, but as a consequence, this disables all of the features mentioned above. As an alternative, the user can still disable the _source field and set individual fields to store: true.
It would be nice to do better though: to keep the original _source but make search responses requiring just a few fields faster than they are today. Two proposals:
Add a _response_source field
The original _source would still be stored, but the _response_source would be a second stored field with a filtered list of fields (behaving like the old includes/excludes). The user could choose which field should be returned with their search requests. Compression would minimise the amount of extra storage required because the fields in the_response_sourcewould be a subset of those in the_source`.
Store top-level fields as separate stored fields
As suggested in #9034, the _source field would be stored as separate stored fields, one for top-level field in the JSON document. This would allow Elasticsearch to efficiently skip over filtered out fields to return just the required subset, yet it preserves the original JSON so that values such as [1,null,1] or [] etc can be returned correctly.
An advantage of this solution is that the decision about which fields to return is query time, while the _response_source option is set at index time.
This also opens up the possibility to enable more efficient compression techniques for individual fields, depending on the type of data contained in each field.
Thoughts?