Skip to content

Conversation

@droberts195
Copy link

@droberts195 droberts195 commented Apr 21, 2020

The ML info endpoint returns the max_model_memory_limit setting
if one is configured. However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.

This change adds an extra piece of information,
limits.effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.

The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.

Relates elastic/kibana#63942

The ML info endpoint returns the max_model_memory_limit setting
if one is configured.  However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.

This change adds an extra piece of information,
limits.current_effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.

The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.

Relates elastic/kibana#63942
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

if (maxModelMemoryLimit != null && maxModelMemoryLimit.getBytes() > 0) {
limits.put("max_model_memory_limit", maxModelMemoryLimit);
limits.put("max_model_memory_limit", maxModelMemoryLimit.getStringRep());
if (currentEffectiveMaxModelMemoryLimit == null || currentEffectiveMaxModelMemoryLimit.compareTo(maxModelMemoryLimit) > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to indicate that there is room available for larger jobs if they increased their MAX_MODEL_MEMORY_LIMIT setting.

But, in the scenarios where the user could take action, it seems to me that they SHOULD already know the native memory available.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main scenario where MAX_MODEL_MEMORY_LIMIT is in Cloud, where it's controlled by the Cloud environment.

The other scenario where we envisage it being used is when an administrator wants to lower powered users from using all the resources with a single job.

In both cases, the user seeing the effect of the restriction wouldn't have the power to increase the limit. It's extremely unlikely there would be a scenario where the user being affected by the limit had the power to change it. Superusers who are using ML and have complete control of their hardware probably don't have the setting set at all.

In the event that both the hard maximum and effective maximum constrain the size of a job the UI should report the hard maximum.

For Elastic Cloud there is the desire for the UI to suggest upgrading to more powerful nodes if limits are hit, as that's just a case of a few clicks in the Cloud console (and paying more). But I think this endpoint still provides enough information to facilitate that because within the Cloud environment we're already setting a hard maximum limit.

@droberts195
Copy link
Author

Jenkins test this please

@droberts195
Copy link
Author

Jenkins test this please

@droberts195
Copy link
Author

Jenkins run elasticsearch-ci/packaging-sample-unix-docker

@droberts195 droberts195 changed the title [ML] Add effective current max model memory limit to ML info [ML] Add effective max model memory limit to ML info Apr 22, 2020
We decided that using two words was overly verbose
@droberts195 droberts195 merged commit d1a9b3a into elastic:master Apr 22, 2020
@droberts195 droberts195 deleted the add_current_mem_limit_to_info branch April 22, 2020 10:37
droberts195 pushed a commit to droberts195/elasticsearch that referenced this pull request Apr 22, 2020
The ML info endpoint returns the max_model_memory_limit setting
if one is configured.  However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.

This change adds an extra piece of information,
limits.effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.

The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.

Backport of elastic#55529
droberts195 pushed a commit that referenced this pull request Apr 22, 2020
The ML info endpoint returns the max_model_memory_limit setting
if one is configured.  However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.

This change adds an extra piece of information,
limits.effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.

The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.

Backport of #55529
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 24, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 29, 2020
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Jun 29, 2020
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Jun 29, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 29, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants