-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
This is a summary of memory accounting shortcomings found w.r.t. the aggregation framework in #83055
Related issues: #65019, #67476
The problem statement is complex and requires some explanations. The sub-problems are marked with P.{N} below.
Intro
Lifecycle of an aggregation
Every aggregation basically runs through multiple stages, namely the collection phase (mapping), executed on every shard, the post collection - this is where an InternalAggregation is build (Aggregator::buildAggregations) and the reduce phase (partial(combine) or final (reduce)).
Circuit breakers / memory accounting
To avoid out of memory elasticsearch uses so called circuit breakers. In a nutshell it "measures" the "current", which is our memory consumption and intercepts the flow (exception) if it notices that the consumption goes over budget. To make this work, every allocation must be registered, but also every free must be deducted. This all happens globally - not per request - and must be accurate in all situations, meaning it must survive every abnormal execution flow.
Circuit breakers and aggregations
The aggregation framework provides access to the circuit breaker as part of the AggregationContext and with some syntactic sugar as part of AggregatorBase::addRequestCircuitBreakerBytes, which takes care of releasing the bytes in close(). In addition the AggregationContext provides access to Bigarrays, an abstraction for larger arrays in memory. Bigarrays use the circuit breaker from the context.
Problem statements
Problem: AggregationContext only covers 1/3 in the lifecycle
The AggregationContext is only available in the collection phase. Although it is available when InternalAggregations are build, it gets released shortly after. The InternalAggregations stays in memory or gets serialized and send over. Because of that order, data structures must not use memory allocated from the BigArray instance of the AggregationContext.
P.1 Building the InternalAggregation lacks memory accounting / Big array support
(Existing workarounds: Cardinality deep-copies the HLL++ structure(see CardinalityAggregator), frequent_items disconnects the agg context circuitbreaker. This avoids the copy and avoids the risk to go out of memory during deep-copy)
The InternalAggregation might be send over the wire, meaning it gets serialized and deserialized. Deserialization happens before reduce, it lacks access to memory accounting structures, e.g. access to a BigArray instance.
P.2 Deserialization of InternalAggregation lacks access to CB/BigArrays
(Existing workaround: Cardinality (see InternalCardinality) uses the BigArrays.NON_RECYCLING_INSTANCE. Note, NON_RECYCLING is a bad name, it is not only non-recycling, but also non-accounting)
Instead of accounting for memory the reduce phase works differently: It takes 1.5 the serialized size of all internal aggregations and assumes that this budget is sufficient for the lifetime of the request. See QueryPhaseResultConsumer::estimateRamBytesUsedForReduce. Several problems arise from this educated guess:
P.3 Reduce must not use more than 1.5 times the sum of serialized sizes of all internal aggregations
Note: The serialized size is a bad indicator, because serialization uses compression, e.g. variable length encodings. The size in memory can be different to the size for serialization as e.g. helper structures like hash tables require extra memory.
P.4 BigArrays as part of AggregationReduceContext lead to double accounting memory usage
AggregationReduceContext provides access to a BigArrays instance, however if this instance is used, it accounts for memory used again. That's why it can only be used with care, otherwise memory is accounted for the serialized sizes as above and the allocation in this place in addition. A solution might be a hook to free the "educated guess based accounting" and return to proper memory management.
(Existing workaround: InternalCardinality does not use the bigarray instance but uses BigArrays.NON_RECYCLING_INSTANCE in reduce).
By nature a reduce usually frees more memory than it allocates, because it merges shard results. However currently this is not taken into account.
P.5 The reduce phase blocks memory due to overestimation
The magic 1.5 stays for the whole lifetime of the request, meaning it blocks memory although memory might have been freed meanwhile. Parallel incoming requests might trigger the circuit breaker, although there is no memory shortage. This problems gets more severe the longer the aggregation executes during final reduce. The frequent_items aggregation can take seconds and therefore blocks memory for longer.
Summary
The problems are complex. The most concerning are P.3-P.5, because it can lead to circuit breakers kicking in without need. For complex aggregations like frequent_items, where the main part of the logic happens on final reduce, we must ensure that the agg does not use more than 1.5 times the sum of serialized sizes (P.3) without knowing this budget in code. A 1st step might be to make this information accessible, e.g. as part of the reduce context.