Skip to content

Conversation

@original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Aug 25, 2022

This is very important for #77466. Profiling showed that serializing snapshots-in-progress when there's a few snapshots with high shard count running takes a significant amount of CPU and heap for sending the full data structure over an over.
This PR adds diffing in the simplest way I could think of on top of the existing data structure.

closes #88732

@original-brownbear original-brownbear added WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Aug 25, 2022
*/
private record ByRepoDiff(
DiffableUtils.MapDiff<String, Entry, Map<String, Entry>> diffBySnapshotUUID,
DiffableUtils.MapDiff<String, Integer, Map<String, Integer>> positionDiff
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the nicest solution ever to diffing a list that can see both entries change and move about in the list but the best I could come up with without refactoring things in this class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, could we do that prep work first and avoid having to do this? If not, could we at least sprinkle a bunch of assertions around here to validate our assumptions about the diff (no gaps, no duplicates, etc)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a big refactoring that avoids the current structure of a map of lists will be hard to pull off in the short-term. It's a massive change-set that would be required here and I'd much rather do it in increments.
The beauty of the code here is that at least for the positions we need no such assertions since List.of(...) will be unforgiving to gaps. So if we take the size of the array from the uuid->snaphot map and then use the positions for the iteration we can't really run into trouble ever can we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I'll suggest a couple of extra assertions in some following comments.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 5, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @original-brownbear, I've created a changelog YAML for you.


@Override
public Version getMinimalSupportedVersion() {
return Version.CURRENT.minimumCompatibilityVersion();
Copy link
Contributor

@idegtiarenko idegtiarenko Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be DIFFABLE_VERSION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because we were always allegedly "diffable" but used a non-diff. That's why I had to add the BwC new SimpleDiffable.CompleteDiff<>(after).writeTo(out); below. So in a sense this is just a wire format change from the perspective of the checks on this I think.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look ok now, I left a couple of other ideas.

*/
private record ByRepoDiff(
DiffableUtils.MapDiff<String, Entry, Map<String, Entry>> diffBySnapshotUUID,
DiffableUtils.MapDiff<String, Integer, Map<String, Integer>> positionDiff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, could we do that prep work first and avoid having to do this? If not, could we at least sprinkle a bunch of assertions around here to validate our assumptions about the diff (no gaps, no duplicates, etc)?

@original-brownbear
Copy link
Contributor Author

@DaveCTurner ping in case you have a second, this would be quite nice to have in for our snapshot benchmarking. It's a surprisingly large speedup :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small comments

}

@SuppressWarnings("unchecked")
EntryDiff(Entry before, Entry after) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be explicit about the fields that must not change between before and after here, at least asserting that they're the same but maybe also protecting against that kind of bug in production too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I added actual exception throwing in production. If we ever run into a bug here, it's still better to stall everything and (maybe) for a full cluster restart to fix things than to corrupt the repo IMO :)

entries.set(i, mutateEntryWithLegalChange(entry));
}
}
updatedInstance = updatedInstance.withUpdatedEntriesForRepo(perRepoEntries.get(0).repository(), entries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also reorder the entries for one or more repositories? Really just to give the ByRepoDiff stuff a proper workout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jup added shuffling the list. Had to change the way we generate the repo name for that. It was a random string per snapshot and we'd effectively never see collisions there. Now we actually have multiple snapshots per repo here and the position reordering is covered by tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 sounds good

*/
private record ByRepoDiff(
DiffableUtils.MapDiff<String, Entry, Map<String, Entry>> diffBySnapshotUUID,
DiffableUtils.MapDiff<String, Integer, Map<String, Integer>> positionDiff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I'll suggest a couple of extra assertions in some following comments.

@original-brownbear
Copy link
Contributor Author

Thanks @DaveCTurner all points addressed I think :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Contributor Author

Thanks David!

@original-brownbear original-brownbear merged commit b69d1bd into elastic:main Sep 13, 2022
@original-brownbear original-brownbear deleted the nicer-sn-in-progress branch September 13, 2022 11:34
@original-brownbear original-brownbear restored the nicer-sn-in-progress branch April 18, 2023 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make SnapshotsInProgress diffable

4 participants