-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Add effective max model memory limit to ML info #55529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add effective max model memory limit to ML info #55529
Conversation
The ML info endpoint returns the max_model_memory_limit setting if one is configured. However, it is still possible to create a job that cannot run anywhere in the current cluster because no node in the cluster has enough memory to accommodate it. This change adds an extra piece of information, limits.current_effective_max_model_memory_limit, to the ML info response that returns the biggest model memory limit that could be run in the current cluster assuming no other jobs were running. The idea is that the ML UI will be able to warn users who try to create jobs with higher model memory limits that their jobs will not be able to start unless they add a bigger ML node to their cluster. Relates elastic/kibana#63942
|
Pinging @elastic/ml-core (:ml) |
| if (maxModelMemoryLimit != null && maxModelMemoryLimit.getBytes() > 0) { | ||
| limits.put("max_model_memory_limit", maxModelMemoryLimit); | ||
| limits.put("max_model_memory_limit", maxModelMemoryLimit.getStringRep()); | ||
| if (currentEffectiveMaxModelMemoryLimit == null || currentEffectiveMaxModelMemoryLimit.compareTo(maxModelMemoryLimit) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to indicate that there is room available for larger jobs if they increased their MAX_MODEL_MEMORY_LIMIT setting.
But, in the scenarios where the user could take action, it seems to me that they SHOULD already know the native memory available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main scenario where MAX_MODEL_MEMORY_LIMIT is in Cloud, where it's controlled by the Cloud environment.
The other scenario where we envisage it being used is when an administrator wants to lower powered users from using all the resources with a single job.
In both cases, the user seeing the effect of the restriction wouldn't have the power to increase the limit. It's extremely unlikely there would be a scenario where the user being affected by the limit had the power to change it. Superusers who are using ML and have complete control of their hardware probably don't have the setting set at all.
In the event that both the hard maximum and effective maximum constrain the size of a job the UI should report the hard maximum.
For Elastic Cloud there is the desire for the UI to suggest upgrading to more powerful nodes if limits are hit, as that's just a case of a few clicks in the Cloud console (and paying more). But I think this endpoint still provides enough information to facilitate that because within the Cloud environment we're already setting a hard maximum limit.
|
Jenkins test this please |
|
Jenkins test this please |
|
Jenkins run elasticsearch-ci/packaging-sample-unix-docker |
We decided that using two words was overly verbose
The ML info endpoint returns the max_model_memory_limit setting if one is configured. However, it is still possible to create a job that cannot run anywhere in the current cluster because no node in the cluster has enough memory to accommodate it. This change adds an extra piece of information, limits.effective_max_model_memory_limit, to the ML info response that returns the biggest model memory limit that could be run in the current cluster assuming no other jobs were running. The idea is that the ML UI will be able to warn users who try to create jobs with higher model memory limits that their jobs will not be able to start unless they add a bigger ML node to their cluster. Backport of elastic#55529
The ML info endpoint returns the max_model_memory_limit setting if one is configured. However, it is still possible to create a job that cannot run anywhere in the current cluster because no node in the cluster has enough memory to accommodate it. This change adds an extra piece of information, limits.effective_max_model_memory_limit, to the ML info response that returns the biggest model memory limit that could be run in the current cluster assuming no other jobs were running. The idea is that the ML UI will be able to warn users who try to create jobs with higher model memory limits that their jobs will not be able to start unless they add a bigger ML node to their cluster. Backport of #55529
Relates: elastic/elasticsearch#55529, #4803 Co-authored-by: Russ Cam <[email protected]>
Relates: elastic/elasticsearch#55529, #4803 Co-authored-by: Russ Cam <[email protected]>
The ML info endpoint returns the max_model_memory_limit setting
if one is configured. However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.
This change adds an extra piece of information,
limits.effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.
The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.
Relates elastic/kibana#63942