-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
If the cluster shuts down while updating the root repository data blob then it will set BlobStoreRepository#uncleanStart on startup, which causes Elasticsearch to skip the caching of RepositoryData in favour of reading the blob afresh from the repository each time it's needed.
If on startup ILM finds indices waiting to move to the searchable snapshot phase then it will attempt to create snapshots of each such index. Each create-snapshot task holds a reference to the RepositoryData it captured when the task was submitted.
The trouble is that each RepositoryData instance could be tens of MBs in size and while uncleanStart is set there is no sharing between these instances. In the case of this I saw, RepositoryData was ~58MiB and there were 17 create-snapshot tasks in the queue, so these tasks alone consumed almost 1GiB of heap. There were also 6 snapshot_meta threads all busy loading more copies of RepositoryData with a total of 530MiB of local state.
Relates #77466
Workaround
Clearing the uncleanStart flag should restore the caching (and hence sharing) of RepositoryData again:
- Disable ILM (needs to happen immediately after startup before it triggers any snapshots).
- Take a single snapshot manually to complete the pending write of the root metadata blob. The content of the snapshot doesn't matter, so you may as well restrict it to just a single small index.
- When that snapshot completes, it is safe to enable ILM again.