-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
The Problem
Currently, when a node experiences heavy load we continue to read from connections as fast as possible in most cases (recovery, snapshots, etc. have throughput limits, but essentially all other requests don't). This creates two issues:
Higher than Necessary Memory Use
- We're reading messages and partially deserializing that we know we can't process
Since we are never throttling reads connections will if constantly writing to a node that can't handle the throughput can eventually lead to the node running out of heap-memory (without the real memory circuit breaker in place). This is a result of requests that run on a thread-pool without a size limit (generic pool) queuing up in de-serialized form and without limit.
Also, without throttling reads the network buffer memory (currently that's on heap as well) used by a node is proportional to the product of the average message rate per connection, average message size and the number of connections in the ideal case. Under heavy load with the circuit breaker interrupting requests, this might not hold if connections retry requests. So currently, under heavy load, heap pressure can result in increased additional heap pressure from network buffers and thus sub-optimal scaling.
Under heavy load, every message read and then failed by the circuit-breaker costs the buffer pool memory to buffer it in as well as some heap-memory for de-serializing the type/action of message and such.
Liveness Issues due to Real Memory Circuit Breaker
- Heavy load by one type of request can prevent other requests from executing reliably
Since under heavy load request currently get interrupted by the real memory circuit breaker, all connections currently share a single
memory pool. This introduces liveness issues like e.g.:
- Heavy search load on a data-node can lead to the circuit breaker killing a replication request leading to the replica failing.
- You could technically argue that a node limited by available memory and under sustained heavy search load can't hold a stable replica
- Search phases interfering with each other Prioritize FetchPhase over QueryPhase Under High Search Workload #39891
Under heavy load, the system does not fairly distribute resources across connections and thus a single misbehaving connection can
monopolize a node. This behavior makes the system less efficient in its memory use under heavy load than it is without memory pressure due to wasted allocations on interrupt and retrying.
Suggested Solution
To improve behavior under heavy load we should globally limit network buffer memory use by not reading from connections when no memory is available to deal with a message anyway. We can't simply introduce a global limit on network buffer memory use as it would simply move the liveness issues mentioned above to a different layer.
Instead, we could roughly (implementation details in the sub-points are a little rough):
- Limit the amount of network buffer memory available per connection and stop reading once no memory is available
- For this to be efficient we should move the full deserialization of messages off of the IO thread (for messages that fork off to another pool). This way we save a lot of heap memory because we don't have to store serialized requests/responses on-heap as part of tasks waiting to be executed and thus only have de-serialized messages live on-heap for as long as they're needed on the heap. Connections that keep sending messages below the per-connection memory limit and that are handled on the IO thread would always be live. Also, we could make sure to only release the memory associated with a message once its corresponding task was complete instead of when the message was de-serialized to get smoother behavior for e.g. heavy search requests on coordinating nodes.
- Have a global reserve pool for all connections to allocate from if additional memory beyond the per-connection limit is required and fairly wait for allocations of memory from it
- Larger messages that exceed the per-connection limit would be throttled when having to wait for memory from the reserve pool without breaking liveness for other connections.
- Interrupt messages (like we currently do with the circuit breaker kicks in) when allocating enough memory from the global pool takes too long to communicate back-pressure to the connection.
- The timeout of this interruption could be made dependent on the action of the message. E.g. Search requests could be aggressively interrupted after a short wait because the cost of retrying is low for the client, search responses or replica requests could get a higher timeout due to the cost/waste associated with interrupting these.