-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Distribute trained model allocations across availability zones #89822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Distribute trained model allocations across availability zones #89822
Conversation
When a model deployment is started with 2 or more allocations and availability zones are present we should distribute the allocations across availability zones so that there is resilience. This commit adds a `ZoneAwareAssignmentPlanner` that attempts to evenly distribute the allocations of a deployment across the available zones.
|
Pinging @elastic/ml-core (Team:ML) |
|
Hi @dimitris-athanasiou, I've created a changelog YAML for you. |
|
run elasticsearch-ci/part-1 |
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A couple of minor nits and one much more far-reaching observation. But I am happy to merge the PR for 8.5 if you could just adjust the release note to make clear it's referring to trained models.
...ava/org/elasticsearch/xpack/ml/inference/assignment/planning/ZoneAwareAssignmentPlanner.java
Outdated
Show resolved
Hide resolved
| ) { | ||
| Map<String, Integer> modelIdToTargetAllocations = modelIdToRemainingAllocations.entrySet() | ||
| .stream() | ||
| .collect(Collectors.toMap(e -> e.getKey(), e -> (e.getValue() - 1) / remainingZones + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an interesting effect here.
Suppose we have 3 AZs, some models that have been given 3 allocations and some jobs that have been given 1 allocation.
We'll try to give all the jobs 1 allocation in the first AZ. This likely won't fit, but it could be that we end up choosing the models that are going to have 1 allocation in total. In the worst case we could end up with all the models with 1 allocation in the first AZ, and the models with 3 allocations having to have 1 and 2 allocations in the other two AZs. This would be highly unintuitive to a user who would expect that the models they've asked for 3 allocations of to go one per AZ, and the ones they've asked for 1 allocation of to be fitted in around that.
One possibility might be to exclude models with only one allocation requested from the per-zone plans, so they get fitted in at the very end by the overall plan. If you think this is too hard for the time available for 8.5 then I am happy to leave this as-is for 8.5. I think what's here at the moment is better than what we have now, so the PR in its current form is a step forwards. But we might need to revisit this again in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yes, that is possible. However, if we excluded 1-allocation models from the per-zone planning, then it is possible that they do not get any allocations if the multi-allocation models took all resources. It would be like tweaking the objective function to prefer multi-allocation models. Not sure if that is better.
…rence/assignment/planning/ZoneAwareAssignmentPlanner.java Co-authored-by: David Roberts <[email protected]>
Co-authored-by: David Roberts <[email protected]>
|
@elasticmachine update branch |
* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...
* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java
* main: (283 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...
When a model deployment is started with 2 or more allocations
and availability zones are present we should distribute the allocations
across availability zones so that there is resilience.
This commit adds a
ZoneAwareAssignmentPlannerthat attempts to evenlydistribute the allocations of a deployment across the available zones.