InternalAggregations lack memory accounting 

This is a summary of memory accounting shortcomings found w.r.t. the aggregation framework in #83055
Related issues: #65019, #67476

The problem statement is complex and requires some explanations. The sub-problems are marked with **P.{N}** below.

## Intro
### Lifecycle of an aggregation

Every aggregation basically runs through multiple stages, namely the collection phase (mapping), executed on every shard, the post collection - this is where an `InternalAggregation` is build (`Aggregator::buildAggregations`) and the reduce phase (_partial_(_combine_) or _final_ (_reduce_)).

### Circuit breakers / memory accounting

To avoid out of memory elasticsearch uses so called circuit breakers. In a nutshell it "measures" the "current", which is our memory consumption and intercepts the flow (exception) if it notices that the consumption goes over budget. To make this work, every allocation must be registered, but also every free must be deducted. This all happens globally - not per request - and must be accurate in all situations, meaning it must survive every abnormal execution flow.

### Circuit breakers and aggregations

The aggregation framework provides access to the circuit breaker as part of the `AggregationContext` and with some syntactic sugar as part of `AggregatorBase::addRequestCircuitBreakerBytes`, which takes care of releasing the bytes in `close()`. In addition the `AggregationContext` provides access to `Bigarrays`, an abstraction for larger arrays in memory. `Bigarrays` use the circuit breaker from the context.

## Problem statements
### Problem: AggregationContext only covers 1/3 in the lifecycle

The `AggregationContext` is only available in the collection phase. Although it is available when `InternalAggregations` are build, it gets released shortly after. The `InternalAggregations` stays in memory or gets serialized and send over. Because of that order, data structures _must not_ use memory allocated from the BigArray instance of the `AggregationContext`.

**P.1** Building the InternalAggregation lacks memory accounting / Big array support

(Existing workarounds: Cardinality deep-copies the HLL++ structure(see `CardinalityAggregator`), `frequent_items` disconnects the agg context circuitbreaker. This avoids the copy and avoids the risk to go out of memory during deep-copy)

The `InternalAggregation` might be send over the wire, meaning it gets serialized and deserialized. Deserialization happens _before_ reduce, it lacks access to memory accounting structures, e.g. access to a BigArray instance.

**P.2** Deserialization of InternalAggregation lacks access to CB/BigArrays

(Existing workaround: Cardinality (see `InternalCardinality`) uses the `BigArrays.NON_RECYCLING_INSTANCE`. Note, `NON_RECYCLING` is a bad name, it is not only non-recycling, but also non-accounting)

Instead of accounting for memory the reduce phase works differently: It takes `1.5` the serialized size of all internal aggregations and assumes that this budget is sufficient for the lifetime of the request. See `QueryPhaseResultConsumer::estimateRamBytesUsedForReduce`. Several problems arise from this educated guess:

**P.3** Reduce must not use more than 1.5 times the sum of serialized sizes of all internal aggregations

Note: The serialized size is a bad indicator, because serialization uses compression, e.g. variable length encodings. The size in memory can be different to the size for serialization as e.g. helper structures like hash tables require extra memory.

**P.4** BigArrays as part of AggregationReduceContext lead to double accounting memory usage

`AggregationReduceContext` provides access to a BigArrays instance, however if this instance is used, it accounts for memory used again. That's why it can only be used with care, otherwise memory is accounted for the serialized sizes as above _and_ the allocation in this place in addition. A solution might be a hook to free the "educated guess based accounting" and return to proper memory management. 

(Existing workaround: InternalCardinality does _not_ use the bigarray instance but uses `BigArrays.NON_RECYCLING_INSTANCE` in reduce).

By nature a reduce usually frees more memory than it allocates, because it _merges_ shard results. However currently this is not taken into account.

**P.5** The reduce phase blocks memory due to overestimation

The magic 1.5 stays for the whole lifetime of the request, meaning it blocks memory although memory might have been freed meanwhile. Parallel incoming requests might trigger the circuit breaker, although there is no memory shortage. This problems gets more severe the longer the aggregation executes during final reduce. The `frequent_items` aggregation can take seconds and therefore blocks memory for longer.

### Summary

The problems are complex. The most concerning are **P.3-P.5**, because it can lead to circuit breakers kicking in without need. For complex aggregations like `frequent_items`, where the main part of the logic happens on final reduce, we must ensure that the agg does not use more than 1.5 times the sum of serialized sizes (P.3) without knowing this budget in code. A 1st step might be to make this information accessible, e.g. as part of the reduce context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

InternalAggregations lack memory accounting #88128

Intro

Lifecycle of an aggregation

Circuit breakers / memory accounting

Circuit breakers and aggregations

Problem statements

Problem: AggregationContext only covers 1/3 in the lifecycle

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

InternalAggregations lack memory accounting #88128

Description

Intro

Lifecycle of an aggregation

Circuit breakers / memory accounting

Circuit breakers and aggregations

Problem statements

Problem: AggregationContext only covers 1/3 in the lifecycle

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions