[ML] Abort starting process if kill request is received #74415

dimitris-athanasiou · 2021-06-22T11:58:42Z

While the job is opening it is possible that the kill process action is called.
If the kill process action is received before the job process has started,
we currently start the process anyway. The process will eventually timeout
to connect to anything and will exit. However, it may cause an unexpected
failure if the job is opened again as it won't be able to launch a process as
one would already exist.

This commit ensures the JobTask.isClosing() reports true when
the kill process action has been called in order to abort opening the
process.

Closes #74141

This commit checks if the job has been requested to close after the reset action completes as part of allocating the job to a new node. This ensures we do not proceed to start the job process even though the job had been requested to close. Closes elastic#74141

elasticmachine · 2021-06-22T11:58:45Z

Pinging @elastic/ml-core (Team:ML)

dimitris-athanasiou · 2021-06-22T15:17:15Z

run elasticsearch-ci/part-1

droberts195

LGTM

…#74441) While the job is opening it is possible that the kill process action is called. If the kill process action is received before the job process has started, we currently start the process anyway. The process will eventually timeout to connect to anything and will exit. However, it may cause an unexpected failure if the job is opened again as it won't be able to launch a process as one would already exist. This commit ensures the JobTask.isClosing() reports true when the kill process action has been called in order to abort opening the process. Closes #74141 Backport of #74415

The changes of elastic#74415 made some of the changes of elastic#71656 redundant. This commit is deleting code from elastic#71656 that would never execute now.

This is a followup to elastic#74976. The changes of elastic#74976 reverted many of the changes of elastic#71656 because elastic#74415 made them redundant. elastic#74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of elastic#71656 but in a different place that hopefully won't reintroduce the problems that led to elastic#74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates elastic#75069

This is a followup to #74976. The changes of #74976 reverted many of the changes of #71656 because #74415 made them redundant. #74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of #71656 but in a different place that hopefully won't reintroduce the problems that led to #74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates #75069

This is a followup to elastic#74976. The changes of elastic#74976 reverted many of the changes of elastic#71656 because elastic#74415 made them redundant. elastic#74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of elastic#71656 but in a different place that hopefully won't reintroduce the problems that led to elastic#74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates elastic#75069

…5116) This is a followup to #74976. The changes of #74976 reverted many of the changes of #71656 because #74415 made them redundant. #74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of #71656 but in a different place that hopefully won't reintroduce the problems that led to #74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates #75069 Co-authored-by: David Roberts <[email protected]>

…5117) This is a followup to #74976. The changes of #74976 reverted many of the changes of #71656 because #74415 made them redundant. #74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of #71656 but in a different place that hopefully won't reintroduce the problems that led to #74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates #75069 Co-authored-by: David Roberts <[email protected]>

dimitris-athanasiou added >non-issue :ml Machine learning v8.0.0 v7.14.0 labels Jun 22, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jun 22, 2021

Kill process should set job task closing

629dba2

dimitris-athanasiou changed the title ~~[ML] Abort opening job if close is requested during reset~~ [ML] Abort opening job if kill process is called before the process starts Jun 22, 2021

dimitris-athanasiou changed the title ~~[ML] Abort opening job if kill process is called before the process starts~~ [ML] Abort starting process if kill request is received Jun 22, 2021

Remove unnecessary this qualification

c654a4a

droberts195 approved these changes Jun 22, 2021

View reviewed changes

dimitris-athanasiou merged commit 9326f7b into elastic:master Jun 22, 2021

dimitris-athanasiou deleted the abort-opening-job-if-close-requested-during-reset branch June 22, 2021 16:11

dimitris-athanasiou mentioned this pull request Jun 22, 2021

[7.x][ML] Abort starting process if kill request is received (#74415) #74441

Merged

droberts195 pushed a commit to droberts195/elasticsearch that referenced this pull request Jul 7, 2021

Simplification

23d704d

The changes of elastic#74415 made some of the changes of elastic#71656 redundant. This commit is deleting code from elastic#71656 that would never execute now.

droberts195 mentioned this pull request Jul 8, 2021

[ML] Fix race condition between job open, close and kill #75113

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Abort starting process if kill request is received #74415

[ML] Abort starting process if kill request is received #74415

Uh oh!

dimitris-athanasiou commented Jun 22, 2021 •

edited

Loading

Uh oh!

elasticmachine commented Jun 22, 2021

Uh oh!

dimitris-athanasiou commented Jun 22, 2021

Uh oh!

droberts195 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ML] Abort starting process if kill request is received #74415

[ML] Abort starting process if kill request is received #74415

Uh oh!

Conversation

dimitris-athanasiou commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jun 22, 2021

Uh oh!

dimitris-athanasiou commented Jun 22, 2021

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dimitris-athanasiou commented Jun 22, 2021 •

edited

Loading