[ML] Distribute trained model allocations across availability zones #89822

dimitris-athanasiou · 2022-09-06T14:36:39Z

When a model deployment is started with 2 or more allocations
and availability zones are present we should distribute the allocations
across availability zones so that there is resilience.

This commit adds a ZoneAwareAssignmentPlanner that attempts to evenly
distribute the allocations of a deployment across the available zones.

When a model deployment is started with 2 or more allocations and availability zones are present we should distribute the allocations across availability zones so that there is resilience. This commit adds a `ZoneAwareAssignmentPlanner` that attempts to evenly distribute the allocations of a deployment across the available zones.

elasticsearchmachine · 2022-09-06T14:37:05Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2022-09-06T14:37:05Z

Hi @dimitris-athanasiou, I've created a changelog YAML for you.

dimitris-athanasiou · 2022-09-06T15:57:21Z

run elasticsearch-ci/part-1

droberts195

LGTM

A couple of minor nits and one much more far-reaching observation. But I am happy to merge the PR for 8.5 if you could just adjust the release note to make clear it's referring to trained models.

docs/changelog/89822.yaml

...ava/org/elasticsearch/xpack/ml/inference/assignment/planning/ZoneAwareAssignmentPlanner.java

droberts195 · 2022-09-06T16:40:13Z

...ava/org/elasticsearch/xpack/ml/inference/assignment/planning/ZoneAwareAssignmentPlanner.java

+    ) {
+        Map<String, Integer> modelIdToTargetAllocations = modelIdToRemainingAllocations.entrySet()
+            .stream()
+            .collect(Collectors.toMap(e -> e.getKey(), e -> (e.getValue() - 1) / remainingZones + 1));


There's an interesting effect here.

Suppose we have 3 AZs, some models that have been given 3 allocations and some jobs that have been given 1 allocation.

We'll try to give all the jobs 1 allocation in the first AZ. This likely won't fit, but it could be that we end up choosing the models that are going to have 1 allocation in total. In the worst case we could end up with all the models with 1 allocation in the first AZ, and the models with 3 allocations having to have 1 and 2 allocations in the other two AZs. This would be highly unintuitive to a user who would expect that the models they've asked for 3 allocations of to go one per AZ, and the ones they've asked for 1 allocation of to be fitted in around that.

One possibility might be to exclude models with only one allocation requested from the per-zone plans, so they get fitted in at the very end by the overall plan. If you think this is too hard for the time available for 8.5 then I am happy to leave this as-is for 8.5. I think what's here at the moment is better than what we have now, so the PR in its current form is a step forwards. But we might need to revisit this again in the future.

Right, yes, that is possible. However, if we excluded 1-allocation models from the per-zone planning, then it is possible that they do not get any allocations if the multi-allocation models took all resources. It would be like tweaking the objective function to prefer multi-allocation models. Not sure if that is better.

…rence/assignment/planning/ZoneAwareAssignmentPlanner.java Co-authored-by: David Roberts <[email protected]>

Co-authored-by: David Roberts <[email protected]>

dimitris-athanasiou · 2022-09-07T12:38:17Z

@elasticmachine update branch

* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...

* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java

* main: (283 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...

dimitris-athanasiou added >enhancement :ml Machine learning v8.5.0 labels Sep 6, 2022

dimitris-athanasiou requested a review from droberts195 September 6, 2022 14:36

elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 6, 2022

Update docs/changelog/89822.yaml

dd62999

droberts195 approved these changes Sep 6, 2022

View reviewed changes

dimitris-athanasiou changed the title ~~[ML] Distribute allocations across availability zones~~ [ML] Distribute trained model allocations across availability zones Sep 7, 2022

dimitris-athanasiou and others added 3 commits September 7, 2022 11:03

Update x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/infe…

02557ee

…rence/assignment/planning/ZoneAwareAssignmentPlanner.java Co-authored-by: David Roberts <[email protected]>

Update docs/changelog/89822.yaml

5e06196

Co-authored-by: David Roberts <[email protected]>

Fixes a bug and sorts zone-nodes map for determinism

5884e1d

dimitris-athanasiou added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Sep 7, 2022

dimitris-athanasiou merged commit c733eb8 into elastic:main Sep 7, 2022

dimitris-athanasiou deleted the distribute-model-allocations-across-availability-zones branch September 7, 2022 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Distribute trained model allocations across availability zones #89822

[ML] Distribute trained model allocations across availability zones #89822

Uh oh!

dimitris-athanasiou commented Sep 6, 2022

Uh oh!

elasticsearchmachine commented Sep 6, 2022

Uh oh!

elasticsearchmachine commented Sep 6, 2022

Uh oh!

dimitris-athanasiou commented Sep 6, 2022

Uh oh!

droberts195 left a comment

Uh oh!

Uh oh!

Uh oh!

droberts195 Sep 6, 2022

Uh oh!

dimitris-athanasiou Sep 7, 2022

Uh oh!

dimitris-athanasiou commented Sep 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ML] Distribute trained model allocations across availability zones #89822

[ML] Distribute trained model allocations across availability zones #89822

Uh oh!

Conversation

dimitris-athanasiou commented Sep 6, 2022

Uh oh!

elasticsearchmachine commented Sep 6, 2022

Uh oh!

elasticsearchmachine commented Sep 6, 2022

Uh oh!

dimitris-athanasiou commented Sep 6, 2022

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

droberts195 Sep 6, 2022

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou commented Sep 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants