-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Currently we don’t limit the number of fields that can be retrieved using the “fields” API.
The original reasoning was that field values are retrieved from an already loaded “source”, so the actual lookup from the source map should come with relatively small cost.
In order to add of the ability to include unmapped fields, we are making use of Automata to match field patterns that have the “include_unmapped” option set. We do this because we don’t know which unmapped leave values the source contains and want to be able to efficiently match wildcard field paths while traversing the source. These Automata by default come with a limit on the number of states they can have (by default 10000) in order to prevent unexpected memory consumption. This is quite sufficient when we have a small number of “fields” pattern with “include _unmapped” set, as should normally be the case.
However, it is possible to exceed this size limit when using the “_fields” API with hundreds or thousands of field patterns that all have the "include_unmapped” option turned on, in which case the request will fail.
This led us to think about whether we should put a limit on the number of fields (or field patterns?) that the API can retrieve, which could be a dynamic index setting like the ones we e.g. have for doc value fields (index.max_docvalue_fields_search).
There are some questions here:
- apart from the request and response size, retrieving tons of fields should not be considered as costly as e.g. doc_values lookup. The discussion was triggered by the discussion about the right sizing of the automaton used only in the context of field patterns with the “include_unmapped” option
- would we want to limit the number of field pattern the users sends in the API request or the number of fields the patterns are resolved to? “*” is just one pattern that can return thousand of fields, but that should be not be the problem
- with a limit on the number of field pattern, an estimation on the worst-case automaton size needed would still be a rough guess based on estimated length of field names.
- we definitely want some limit of the automaton size because its better to error than to OOM
If the main motivation for introducing any limit here is the potential danger of reaching the size limit of the automaton used for “include_unmapped” fields, I think we can lower that risk even more by limiting its use to field patterns using wildcards. Concrete field paths (as in the case when enumerating known field names) can be directly looked up from source without using the automaton. I wonder if this leaves many non-esoteric use cases in which we would run into a size limit.
Relates to #60985