-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Skip unnecessary loading of IndexMetadata during snapshot deletion
#134441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip unnecessary loading of IndexMetadata during snapshot deletion
#134441
Conversation
During snapshot deletion we load the metadata for an index into heap purely to calculate the shard count. Since the number of primary shards does not change for an index, we can store this to avoid recomputation, and avoid repeatedly loading large metadata objects into heap which can cause small nodes to OOMe. Closes ES-12539
During snapshot deletion we load the metadata for an index into heap purely to calculate the shard count. Since the number of primary shards does not change for an index, we can store this when the snapshot is being created, so that upon deletion we avoid recomputing this value and loading large metadata objects into heap which can cause small nodes to OOMe. Closes ES-12539 Closes ES-12538 Closes elastic#100569
During snapshot deletion we load the metadata for an index into heap purely to calculate the shard count. On nodes with small heaps, loading multiple objects concurrently causes them to OOMe. Rather than recomputing the shard count value for the same index across multiple snapshots, this change stores the shard count in a map. This stops loading multiple IndexMetaData objects for the same index into memory. Closes ES-12539
…a-adams-1/elasticsearch into unnecessary-loading-index-metadata
| + indexMetaData.getMappingVersion() | ||
| + "-" | ||
| + indexMetaData.getAliasesVersion(); | ||
| return indexMetaData.getIndexUUID() + DELIMITER + indexMetaData.getSettings() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the other format but spotless keeps reformatting it to this
| // Cannot use - as a delimiter because getIndexUUID() produces UUIDs with - in them | ||
| private static final String DELIMITER = "/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better or for worse I don't think we can change this. There are repositories out there in production that use - as their delimiter, we can't just ignore them like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this identifier is used as a UUID for the index meta data object, not for a repository
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, when I think about it, if this code is deployed to a cluster already with a repository, and already with index metadata classes instantiated (with - inside their UUIDs) this code would break, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly.
| /** | ||
| * Maps the Index UUID to its shard count | ||
| */ | ||
| private final ConcurrentMap<String, Integer> indexUUIDToShardCountMap = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to grow to enormous size. I'd much prefer to scope this tracking down to just a single IndexSnapshotsDeletion which gets discarded as soon as we've finished processing that index.
| for (SnapshotId snapshotId : snapshotIds.stream().filter(snapshotsWithIndex::contains).collect(Collectors.toSet())) { | ||
| snapshotExecutor.execute(ActionRunnable.run(listeners.acquire(), () -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the IDs in snapshotIds are already unique so there's no need to build a separate Set if we're doing it like this. But I don't think we should remove the .map() call here, we should process each indexMetadataId once, rather than spawning a task for every single snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped the map since I need to use the snapshotId to generate two subsequent IDs. I'm not sure how I can (cleanly) write the for loop to include the map, but also generate the required indexMetadataId from the snapshotId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you need to call snapshotIndexMetadataIdentifier yourself? I think calling indexMetaDataGenerations().indexMetaBlobId() is sufficient. If the ID that this returns doesn't contain an index UUID and all the other junk (i.e. it's just the plain snapshot UUID) then you have to load it.
I think the safest way to do this would be to record the (max) shard count for each index in the |
Adds extra test capability to handle deleted indices
…om/joshua-adams-1/elasticsearch into unnecessary-loading-index-metadata
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the production code. I skimmed through the tests and they look to be doing the right sort of thing, will come back to them in a later cycle.
| // unnecessary to read multiple metadata blobs corresponding to the same index UUID. | ||
| // TODO Skip this unnecessary work? Maybe track the shard count in RepositoryData? | ||
| snapshotExecutor.execute(ActionRunnable.run(listeners.acquire(), () -> getOneShardCount(indexMetaGeneration))); | ||
| snapshotExecutor.execute(ActionRunnable.run(listeners.acquire(), () -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather we didn't fork the no-op tasks - can we check the index UUID first?
| /** | ||
| * Maps the Index UUID to its shard count | ||
| */ | ||
| private final ConcurrentMap<String, Integer> indexUUIDToShardCountMap = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a map here, we're looking for the max so we can just update the shardCount field as needed.
| // indexUUIDToShardCountMap is shared across all threads. Therefore, while there may be an entry for this | ||
| // UUID in the map, there is no guarantee that we've encountered it in this thread. | ||
| updateShardCount(indexUUIDToShardCountMap.get(indexUUID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary, we don't read the shardCount field until all these tasks have finished.
| // index UUID; the shard count is going to be the same for all metadata with the same index UUID, so it is | ||
| // unnecessary to read multiple metadata blobs corresponding to the same index UUID. | ||
| String indexUUID = originalRepositoryData.indexMetaDataGenerations().convertBlobIdToIndexUUID(blobId); | ||
| if (indexUUIDToShardCountMap.containsKey(indexUUID) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still risks duplicates since we call containsKey separately from put. Once we're just tracking the set of the index UUIDs we can just call add here which returns a flag indicating whether the item was newly added.
| // NB since 7.9.0 we deduplicate index metadata blobs, and one of the components of the deduplication key is the | ||
| // index UUID; the shard count is going to be the same for all metadata with the same index UUID, so it is | ||
| // unnecessary to read multiple metadata blobs corresponding to the same index UUID. | ||
| String indexUUID = originalRepositoryData.indexMetaDataGenerations().convertBlobIdToIndexUUID(blobId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be null if the index metadata blob is in the pre-7.9.0 format, but we have to load all such blobs. Not sure if the map accepts a null key, but even if it does it won't accept more than one of them.
| for (Map.Entry<String, String> entry : this.identifiers.entrySet()) { | ||
| if (Objects.equals(entry.getValue(), blobId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a linear scan for each blob ID, could we compute a Map<String,String> which inverts identifiers and use that for all the lookups?
| // The uniqueIdentifier was built in buildUniqueIdentifier, is of the format IndexUUID-String-long-long-long, | ||
| // and uses '-' as a delimiter. | ||
| // The below regex accounts for the fact that the IndexUUID can also contain the '-' character | ||
| Pattern pattern = Pattern.compile("^(.*?)-[^-]+-\\d+-\\d+-\\d+$"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Remove unreferenced method - Rename convertBlobIdToIndexUUID to getIndexUUIDFromBlobId - Tweak comments
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry got to dash but I saw a couple more things. Will give it a more thorough look next week.
| /** | ||
| * Map of blob uuid to index metadata identifier. This is a reverse lookup of the identifiers map. | ||
| */ | ||
| final Map<String, String> blobUuidToIndexMetadataMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need this for a snapshot deletion, and it could be pretty big, so I'd suggest we don't pre-compute it for every IndexMetaDataGenerations instance. Instead, let's have a method that computes and returns this map on demand which the deletion calls once up-front.
| if (na) { | ||
| return ClusterState.UNKNOWN_UUID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is possible for a blob that actually ended up in the repository - we use _na_ for the index UUID only if the index hasn't been created yet, but in that case we couldn't possibly have snapshotted it.
| return ClusterState.UNKNOWN_UUID; | ||
| } | ||
| assert uniqueIdentifier.length() >= 22; | ||
| return uniqueIdentifier.substring(0, 22); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing this parsing on demand I'd suggest we put the substring in the blobUuidToIndexMetadataMap right away. That'll reduce the size of all the values in this map by more than half.
| // Map of blob id to index uuid. This is a reverse lookup of the identifiers map. | ||
| final Map<String, String> blobUuidToIndexUUIDMap = identifiers.entrySet() | ||
| .stream() | ||
| .collect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to walk identifiers and construct a new complete map for every single call to getIndexUUIDFromBlobId (of which there may be thousands++). We need to construct this map once at the start of the snapshot deletion process and then use it for each lookup.
| // Wait for the cluster to be green after deletion | ||
| ensureGreen(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this seems like a surprising side-effect of this method: if the cluster was green before the delete then it'll be green immediately after (no shards will have become unassigned) and conversely if the cluster wasn't green first then maybe we shouldn't wait for it to become green here.
| // NB if the index metadata blob is in the pre-7.9.0 format then this will return null | ||
| String indexUUID = originalRepositoryData.indexMetaDataGenerations().getIndexUUIDFromBlobId(blobId); | ||
|
|
||
| // Without an index UUID, we don't know if we've encountered this index before and must read it's IndexMetadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar nit
| // Without an index UUID, we don't know if we've encountered this index before and must read it's IndexMetadata | |
| // Without an index UUID, we don't know if we've encountered this index before and must read its IndexMetadata |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the massive savings we've been able to find in #137210 I wonder if we should defer this for now. Computing blobIdToIndexUuidMap is kinda expensive, even if we do it just once, and its expense scales with the size of the repository rather than the number of indices being deleted - if we're only deleting a few indices then most of that expense is wasted. We can find ways to refine this for sure, and it's kinda nice that this might reduce the number of blobs we read during a large delete, but I don't think this is critical once #137210 lands and I worry that it might end up causing more problems than it solves.
| private void determineShardCount(ActionListener<Void> listener) { | ||
| try (var listeners = new RefCountingListener(listener)) { | ||
| for (final var indexMetaGeneration : snapshotIds.stream() | ||
| Map<String, String> blobIdToIndexUuidMap = originalRepositoryData.indexMetaDataGenerations().getBlobIdToIndexUuidMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets called once per index, which could still be (hundreds-of-)thousands of times, but it returns the same value each time. It really needs to be done once for the entire snapshot deletion process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the above suggestion irrespective of whether we merge this change 👍
@DaveCTurner I agree that there is a computation cost in generating the Therefore, I don't want to introduce "magic numbers" but is it worth considering logic of "if this is a big delete then compute this map so we don't have to keep reading from heap memory, else set this map to null so that we're forced to read from the heap everytime, but we don't care because we're only deleting 3 indices and it's quicker that way"? We could use a setting for this value, whatever we decide that to be |
| /** | ||
| * A map of blob id to index UUID | ||
| */ | ||
| private final Map<String, String> blobIdToIndexUuidMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a field here we will be retaining this map until the very end of the post-deletion cleanup when the SnapshotsDeletion instance becomes unreachable, but that will overlap with the next snapshot operation (which may be another deletion, generating another such map, etc).
However, we only need this map for the determineShardCount calls at the start of each IndexSnapshotsDeletion, all of which happen before we update the RepositoryData root blob. I'd rather we dropped this potentially-large map as soon as possible, and definitely before allowing the next snapshot operation to proceed. It'd be best if it were a local variable computed in writeUpdatedShardMetadataAndComputeDeletes and passed as an argument to each determineShardCount call via IndexSnapshotsDeletion#run.
Right yeah if we really wanted to do this that would be something worth considering. The time taken to generate this map is less of a concern to me than the failures it might cause: with this change a tiny master may now go OOM during every snapshot delete if the repository has so many indices that this map's size becomes overwhelming. We could make it conditional (maybe even a function of the heap space available on the master and the size of the repository) and that might have been worthwhile when we started but the extra complexity carries its own costs and #137210 reduces the marginal benefits of this change. |
I was wondering whether we still need this PR. My hunch is no. It might be helpful in certain theoretical cases. But still probably better to address them when actually observed? |
|
Closing as per the above conversation |

Skip unnecessary loading of
IndexMetadataduring snapshot deletionDuring snapshot deletion we load the metadata for an index into heap purely to calculate the shard count. On nodes with small heaps, loading multiple objects concurrently can cause them to OOMe.
Rather than recomputing the shard count value for the same index across multiple snapshots, this change stores the shard count in a map. This stops loading multiple IndexMetaData objects into heap for the same index.
This means that for commands such as:
assuming both snapshots include Index1, we only loaded Index1's metadata once.
For commands such as:
since these are different delete snapshot requests, we load Index1's metadata once for each
Relates to: #131822
Closes ES-12539