Skip to content

Conversation

@benwtrent
Copy link
Member

When a model is starting, it has been rarely observed that it will lock up while trying to restore the model objects to the native process.

This would manifest as a trained model being stuck in "starting" while also being assigned to a node. So, there is a native process started and task available on the assigned nodes, but the model state never gets out of "starting".

@benwtrent benwtrent added >bug :ml Machine learning cloud-deploy Publish cloud docker image for Cloud-First-Testing v8.4.0 v8.5.0 labels Jul 29, 2022
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 29, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Just a suggestion for an error message.

…rence/deployment/DeploymentManager.java

Co-authored-by: Dimitris Athanasiou <[email protected]>
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent merged commit 83136ef into elastic:main Aug 1, 2022
@benwtrent benwtrent deleted the bugfix/ml-address-potential-model-start-thread-lock branch August 1, 2022 14:06
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 1, 2022
… after being allocated to node (elastic#88945)

When a model is starting, it has been rarely observed that it will lock up while trying to restore the model objects to the native process.

This would manifest as a trained model being stuck in "starting" while also being assigned to a node. So, there is a native process started and task available on the assigned nodes, but the model state never gets out of "starting".
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.4

elasticsearchmachine pushed a commit that referenced this pull request Aug 1, 2022
… after being allocated to node (#88945) (#88992)

When a model is starting, it has been rarely observed that it will lock up while trying to restore the model objects to the native process.

This would manifest as a trained model being stuck in "starting" while also being assigned to a node. So, there is a native process started and task available on the assigned nodes, but the model state never gets out of "starting".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning Team:ML Meta label for the ML team v8.4.0 v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants