Skip to content

Conversation

@dimitris-athanasiou
Copy link
Contributor

When a model deployment is started with 2 or more allocations
and availability zones are present we should distribute the allocations
across availability zones so that there is resilience.

This commit adds a ZoneAwareAssignmentPlanner that attempts to evenly
distribute the allocations of a deployment across the available zones.

When a model deployment is started with 2 or more allocations
and availability zones are present we should distribute the allocations
across availability zones so that there is resilience.

This commit adds a `ZoneAwareAssignmentPlanner` that attempts to evenly
distribute the allocations of a deployment across the available zones.
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 6, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @dimitris-athanasiou, I've created a changelog YAML for you.

@dimitris-athanasiou
Copy link
Contributor Author

run elasticsearch-ci/part-1

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

A couple of minor nits and one much more far-reaching observation. But I am happy to merge the PR for 8.5 if you could just adjust the release note to make clear it's referring to trained models.

) {
Map<String, Integer> modelIdToTargetAllocations = modelIdToRemainingAllocations.entrySet()
.stream()
.collect(Collectors.toMap(e -> e.getKey(), e -> (e.getValue() - 1) / remainingZones + 1));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an interesting effect here.

Suppose we have 3 AZs, some models that have been given 3 allocations and some jobs that have been given 1 allocation.

We'll try to give all the jobs 1 allocation in the first AZ. This likely won't fit, but it could be that we end up choosing the models that are going to have 1 allocation in total. In the worst case we could end up with all the models with 1 allocation in the first AZ, and the models with 3 allocations having to have 1 and 2 allocations in the other two AZs. This would be highly unintuitive to a user who would expect that the models they've asked for 3 allocations of to go one per AZ, and the ones they've asked for 1 allocation of to be fitted in around that.

One possibility might be to exclude models with only one allocation requested from the per-zone plans, so they get fitted in at the very end by the overall plan. If you think this is too hard for the time available for 8.5 then I am happy to leave this as-is for 8.5. I think what's here at the moment is better than what we have now, so the PR in its current form is a step forwards. But we might need to revisit this again in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yes, that is possible. However, if we excluded 1-allocation models from the per-zone planning, then it is possible that they do not get any allocations if the multi-allocation models took all resources. It would be like tweaking the objective function to prefer multi-allocation models. Not sure if that is better.

@dimitris-athanasiou dimitris-athanasiou changed the title [ML] Distribute allocations across availability zones [ML] Distribute trained model allocations across availability zones Sep 7, 2022
@dimitris-athanasiou dimitris-athanasiou added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Sep 7, 2022
@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou dimitris-athanasiou merged commit c733eb8 into elastic:main Sep 7, 2022
@dimitris-athanasiou dimitris-athanasiou deleted the distribute-model-allocations-across-availability-zones branch September 7, 2022 12:50
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 8, 2022
* main: (175 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 8, 2022
* main: (175 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...

# Conflicts:
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 8, 2022
* main: (283 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants