Skip to content

Conversation

@joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Sep 10, 2025

Skip unnecessary loading of IndexMetadata during snapshot deletion

During snapshot deletion we load the metadata for an index into heap purely to calculate the shard count. On nodes with small heaps, loading multiple objects concurrently can cause them to OOMe.

Rather than recomputing the shard count value for the same index across multiple snapshots, this change stores the shard count in a map. This stops loading multiple IndexMetaData objects into heap for the same index.

This means that for commands such as:

assertAcked(client().admin().cluster().prepareDeleteSnapshot(timeout, repoName, snapshotName1, snapshotName2).get())

assuming both snapshots include Index1, we only loaded Index1's metadata once.

For commands such as:

assertAcked(client().admin().cluster().prepareDeleteSnapshot(timeout, repoName, snapshotName1).get())
assertAcked(client().admin().cluster().prepareDeleteSnapshot(timeout, repoName, snapshotName2).get())

since these are different delete snapshot requests, we load Index1's metadata once for each

Relates to: #131822

Closes ES-12539

During snapshot deletion we load the metadata for an index into
heap purely to calculate the shard count. Since the number of
primary shards does not change for an index, we can store this to
avoid recomputation, and avoid repeatedly loading large metadata
objects into heap which can cause small nodes to OOMe.

Closes ES-12539
During snapshot deletion we load the metadata for an index into
heap purely to calculate the shard count. Since the number of
primary shards does not change for an index, we can store this when the
snapshot is being created, so that upon deletion we avoid recomputing
this value and loading large metadata objects into heap which can cause
small nodes to OOMe.

Closes ES-12539
Closes ES-12538
Closes elastic#100569
elasticsearchmachine and others added 3 commits September 10, 2025 10:47
During snapshot deletion we load the metadata for an index into
heap purely to calculate the shard count. On nodes with small heaps,
loading multiple objects concurrently causes them to OOMe.

Rather than recomputing the shard count value for the same index across
multiple snapshots, this change stores the shard count in a map.
This stops loading multiple IndexMetaData objects for the same index
into memory.

Closes ES-12539
…a-adams-1/elasticsearch into unnecessary-loading-index-metadata
+ indexMetaData.getMappingVersion()
+ "-"
+ indexMetaData.getAliasesVersion();
return indexMetaData.getIndexUUID() + DELIMITER + indexMetaData.getSettings()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the other format but spotless keeps reformatting it to this

@joshua-adams-1 joshua-adams-1 added >non-issue :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. labels Sep 10, 2025
Comment on lines 177 to 178
// Cannot use - as a delimiter because getIndexUUID() produces UUIDs with - in them
private static final String DELIMITER = "/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or for worse I don't think we can change this. There are repositories out there in production that use - as their delimiter, we can't just ignore them like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this identifier is used as a UUID for the index meta data object, not for a repository

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, when I think about it, if this code is deployed to a cluster already with a repository, and already with index metadata classes instantiated (with - inside their UUIDs) this code would break, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly.

/**
* Maps the Index UUID to its shard count
*/
private final ConcurrentMap<String, Integer> indexUUIDToShardCountMap = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to grow to enormous size. I'd much prefer to scope this tracking down to just a single IndexSnapshotsDeletion which gets discarded as soon as we've finished processing that index.

Comment on lines 1294 to 1295
for (SnapshotId snapshotId : snapshotIds.stream().filter(snapshotsWithIndex::contains).collect(Collectors.toSet())) {
snapshotExecutor.execute(ActionRunnable.run(listeners.acquire(), () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the IDs in snapshotIds are already unique so there's no need to build a separate Set if we're doing it like this. But I don't think we should remove the .map() call here, we should process each indexMetadataId once, rather than spawning a task for every single snapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the map since I need to use the snapshotId to generate two subsequent IDs. I'm not sure how I can (cleanly) write the for loop to include the map, but also generate the required indexMetadataId from the snapshotId

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you need to call snapshotIndexMetadataIdentifier yourself? I think calling indexMetaDataGenerations().indexMetaBlobId() is sufficient. If the ID that this returns doesn't contain an index UUID and all the other junk (i.e. it's just the plain snapshot UUID) then you have to load it.

@DaveCTurner
Copy link
Contributor

would not load any index metadata into heap, assuming the repository was not deleted.

I think the safest way to do this would be to record the (max) shard count for each index in the RepositoryData stored in the repository. We can't reasonably keep it in memory on the master, updating it as we finalize snapshots, because then it'd be lost on a master failover.

@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review October 22, 2025 08:26
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the production code. I skimmed through the tests and they look to be doing the right sort of thing, will come back to them in a later cycle.

// unnecessary to read multiple metadata blobs corresponding to the same index UUID.
// TODO Skip this unnecessary work? Maybe track the shard count in RepositoryData?
snapshotExecutor.execute(ActionRunnable.run(listeners.acquire(), () -> getOneShardCount(indexMetaGeneration)));
snapshotExecutor.execute(ActionRunnable.run(listeners.acquire(), () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we didn't fork the no-op tasks - can we check the index UUID first?

/**
* Maps the Index UUID to its shard count
*/
private final ConcurrentMap<String, Integer> indexUUIDToShardCountMap = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a map here, we're looking for the max so we can just update the shardCount field as needed.

Comment on lines 1325 to 1327
// indexUUIDToShardCountMap is shared across all threads. Therefore, while there may be an entry for this
// UUID in the map, there is no guarantee that we've encountered it in this thread.
updateShardCount(indexUUIDToShardCountMap.get(indexUUID));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary, we don't read the shardCount field until all these tasks have finished.

// index UUID; the shard count is going to be the same for all metadata with the same index UUID, so it is
// unnecessary to read multiple metadata blobs corresponding to the same index UUID.
String indexUUID = originalRepositoryData.indexMetaDataGenerations().convertBlobIdToIndexUUID(blobId);
if (indexUUIDToShardCountMap.containsKey(indexUUID) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still risks duplicates since we call containsKey separately from put. Once we're just tracking the set of the index UUIDs we can just call add here which returns a flag indicating whether the item was newly added.

// NB since 7.9.0 we deduplicate index metadata blobs, and one of the components of the deduplication key is the
// index UUID; the shard count is going to be the same for all metadata with the same index UUID, so it is
// unnecessary to read multiple metadata blobs corresponding to the same index UUID.
String indexUUID = originalRepositoryData.indexMetaDataGenerations().convertBlobIdToIndexUUID(blobId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be null if the index metadata blob is in the pre-7.9.0 format, but we have to load all such blobs. Not sure if the map accepts a null key, but even if it does it won't accept more than one of them.

Comment on lines 209 to 210
for (Map.Entry<String, String> entry : this.identifiers.entrySet()) {
if (Objects.equals(entry.getValue(), blobId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a linear scan for each blob ID, could we compute a Map<String,String> which inverts identifiers and use that for all the lookups?

// The uniqueIdentifier was built in buildUniqueIdentifier, is of the format IndexUUID-String-long-long-long,
// and uses '-' as a delimiter.
// The below regex accounts for the fact that the IndexUUID can also contain the '-' character
Pattern pattern = Pattern.compile("^(.*?)-[^-]+-\\d+-\\d+-\\d+$");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

The index UUID (and the history UUID) are fixed-length. Can we just split the string at the appropriate index?

Copy link
Contributor Author

@joshua-adams-1 joshua-adams-1 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the length is set to 22, here, or set to _na_ if unknown, here. Will simplify this accordingly

- Remove unreferenced method
- Rename convertBlobIdToIndexUUID to getIndexUUIDFromBlobId
- Tweak comments
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry got to dash but I saw a couple more things. Will give it a more thorough look next week.

/**
* Map of blob uuid to index metadata identifier. This is a reverse lookup of the identifiers map.
*/
final Map<String, String> blobUuidToIndexMetadataMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need this for a snapshot deletion, and it could be pretty big, so I'd suggest we don't pre-compute it for every IndexMetaDataGenerations instance. Instead, let's have a method that computes and returns this map on demand which the deletion calls once up-front.

Comment on lines 234 to 235
if (na) {
return ClusterState.UNKNOWN_UUID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is possible for a blob that actually ended up in the repository - we use _na_ for the index UUID only if the index hasn't been created yet, but in that case we couldn't possibly have snapshotted it.

return ClusterState.UNKNOWN_UUID;
}
assert uniqueIdentifier.length() >= 22;
return uniqueIdentifier.substring(0, 22);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing this parsing on demand I'd suggest we put the substring in the blobUuidToIndexMetadataMap right away. That'll reduce the size of all the values in this map by more than half.

Comment on lines 207 to 210
// Map of blob id to index uuid. This is a reverse lookup of the identifiers map.
final Map<String, String> blobUuidToIndexUUIDMap = identifiers.entrySet()
.stream()
.collect(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to walk identifiers and construct a new complete map for every single call to getIndexUUIDFromBlobId (of which there may be thousands++). We need to construct this map once at the start of the snapshot deletion process and then use it for each lookup.

Comment on lines 414 to 415
// Wait for the cluster to be green after deletion
ensureGreen();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this seems like a surprising side-effect of this method: if the cluster was green before the delete then it'll be green immediately after (no shards will have become unassigned) and conversely if the cluster wasn't green first then maybe we shouldn't wait for it to become green here.

// NB if the index metadata blob is in the pre-7.9.0 format then this will return null
String indexUUID = originalRepositoryData.indexMetaDataGenerations().getIndexUUIDFromBlobId(blobId);

// Without an index UUID, we don't know if we've encountered this index before and must read it's IndexMetadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar nit

Suggested change
// Without an index UUID, we don't know if we've encountered this index before and must read it's IndexMetadata
// Without an index UUID, we don't know if we've encountered this index before and must read its IndexMetadata

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the massive savings we've been able to find in #137210 I wonder if we should defer this for now. Computing blobIdToIndexUuidMap is kinda expensive, even if we do it just once, and its expense scales with the size of the repository rather than the number of indices being deleted - if we're only deleting a few indices then most of that expense is wasted. We can find ways to refine this for sure, and it's kinda nice that this might reduce the number of blobs we read during a large delete, but I don't think this is critical once #137210 lands and I worry that it might end up causing more problems than it solves.

private void determineShardCount(ActionListener<Void> listener) {
try (var listeners = new RefCountingListener(listener)) {
for (final var indexMetaGeneration : snapshotIds.stream()
Map<String, String> blobIdToIndexUuidMap = originalRepositoryData.indexMetaDataGenerations().getBlobIdToIndexUuidMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets called once per index, which could still be (hundreds-of-)thousands of times, but it returns the same value each time. It really needs to be done once for the entire snapshot deletion process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the above suggestion irrespective of whether we merge this change 👍

@joshua-adams-1
Copy link
Contributor Author

I wonder if we should defer this for now

@DaveCTurner I agree that there is a computation cost in generating the blobIdToIndexUuidMap , even if this computation is moved into the SnapshotsDeletion class and therefore only computed once per snapshot deletion. However, for large deletes, there will be a benefit in reducing the number of reads from heap, especially given that #137210 only reduces allocations and not the amount of time / work during parsing (since we still have to read to the end of the object).

Therefore, I don't want to introduce "magic numbers" but is it worth considering logic of "if this is a big delete then compute this map so we don't have to keep reading from heap memory, else set this map to null so that we're forced to read from the heap everytime, but we don't care because we're only deleting 3 indices and it's quicker that way"? We could use a setting for this value, whatever we decide that to be

/**
* A map of blob id to index UUID
*/
private final Map<String, String> blobIdToIndexUuidMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a field here we will be retaining this map until the very end of the post-deletion cleanup when the SnapshotsDeletion instance becomes unreachable, but that will overlap with the next snapshot operation (which may be another deletion, generating another such map, etc).

However, we only need this map for the determineShardCount calls at the start of each IndexSnapshotsDeletion, all of which happen before we update the RepositoryData root blob. I'd rather we dropped this potentially-large map as soon as possible, and definitely before allowing the next snapshot operation to proceed. It'd be best if it were a local variable computed in writeUpdatedShardMetadataAndComputeDeletes and passed as an argument to each determineShardCount call via IndexSnapshotsDeletion#run.

@DaveCTurner
Copy link
Contributor

is it worth considering logic of "if this is a big delete then compute this map so we don't have to keep reading from heap memory, else set this map to null so that we're forced to read from the heap everytime, but we don't care because we're only deleting 3 indices and it's quicker that way"?

Right yeah if we really wanted to do this that would be something worth considering. The time taken to generate this map is less of a concern to me than the failures it might cause: with this change a tiny master may now go OOM during every snapshot delete if the repository has so many indices that this map's size becomes overwhelming. We could make it conditional (maybe even a function of the heap space available on the master and the size of the repository) and that might have been worthwhile when we started but the extra complexity carries its own costs and #137210 reduces the marginal benefits of this change.

@ywangd
Copy link
Member

ywangd commented Nov 12, 2025

#137210 reduces the marginal benefits of this change.

I was wondering whether we still need this PR. My hunch is no. It might be helpful in certain theoretical cases. But still probably better to address them when actually observed?

@joshua-adams-1
Copy link
Contributor Author

Closing as per the above conversation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants