Skip to content

Conversation

@dimitris-athanasiou
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou commented Jun 22, 2021

While the job is opening it is possible that the kill process action is called.
If the kill process action is received before the job process has started,
we currently start the process anyway. The process will eventually timeout
to connect to anything and will exit. However, it may cause an unexpected
failure if the job is opened again as it won't be able to launch a process as
one would already exist.

This commit ensures the JobTask.isClosing() reports true when
the kill process action has been called in order to abort opening the
process.

Closes #74141

This commit checks if the job has been requested to close after
the reset action completes as part of allocating the job to a new node.
This ensures we do not proceed to start the job process even though
the job had been requested to close.

Closes elastic#74141
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 22, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@dimitris-athanasiou dimitris-athanasiou changed the title [ML] Abort opening job if close is requested during reset [ML] Abort opening job if kill process is called before the process starts Jun 22, 2021
@dimitris-athanasiou dimitris-athanasiou changed the title [ML] Abort opening job if kill process is called before the process starts [ML] Abort starting process if kill request is received Jun 22, 2021
@dimitris-athanasiou
Copy link
Contributor Author

run elasticsearch-ci/part-1

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dimitris-athanasiou dimitris-athanasiou merged commit 9326f7b into elastic:master Jun 22, 2021
@dimitris-athanasiou dimitris-athanasiou deleted the abort-opening-job-if-close-requested-during-reset branch June 22, 2021 16:11
dimitris-athanasiou added a commit that referenced this pull request Jun 22, 2021
…#74441)

While the job is opening it is possible that the kill process action is called.
If the kill process action is received before the job process has started,
we currently start the process anyway. The process will eventually timeout
to connect to anything and will exit. However, it may cause an unexpected
failure if the job is opened again as it won't be able to launch a process as
one would already exist.

This commit ensures the JobTask.isClosing() reports true when
the kill process action has been called in order to abort opening the
process.

Closes #74141

Backport of #74415
droberts195 pushed a commit to droberts195/elasticsearch that referenced this pull request Jul 7, 2021
The changes of elastic#74415 made some of the changes of elastic#71656
redundant. This commit is deleting code from elastic#71656 that
would never execute now.
droberts195 pushed a commit to droberts195/elasticsearch that referenced this pull request Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine pushed a commit that referenced this pull request Jul 8, 2021
This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine added a commit that referenced this pull request Jul 8, 2021
…5116)

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

Co-authored-by: David Roberts <[email protected]>
elasticsearchmachine added a commit that referenced this pull request Jul 8, 2021
…5117)

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

Co-authored-by: David Roberts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >non-issue Team:ML Meta label for the ML team v7.14.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing

4 participants