Skip to content

Conversation

@droberts195
Copy link

This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later. It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state. It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.

This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

logger.debug("Aborted opening job [{}] as it has been closed", jobId);
return;
}
if (processContext.getState() != ProcessContext.ProcessStateName.NOT_RUNNING) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check needs to be done under lock to prevent a close sneaking in after this check but before the open actions are performed.

if (communicator == null) {
logger.debug("Job [{}] is being closed before its process is started", jobId);
jobTask.markAsCompleted();
return;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning here without removing the entry from processByAllocation is the cause of the eventual filling up to capacity of processByAllocation. It's nasty because jobTask.markAsCompleted() causes the externally visible effects of closing the job to all look as expected. It's only the internal data structure that's left with a spurious entry.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM That was tricky to find! Thanks for fixing this!

@droberts195
Copy link
Author

Jenkins run elasticsearch-ci/1

3 similar comments
@droberts195
Copy link
Author

Jenkins run elasticsearch-ci/1

@droberts195
Copy link
Author

Jenkins run elasticsearch-ci/1

@droberts195
Copy link
Author

Jenkins run elasticsearch-ci/1

@droberts195 droberts195 merged commit 5eb38ec into elastic:master May 24, 2019
@droberts195 droberts195 deleted the fix_race_on_close_during_open branch May 24, 2019 19:04
droberts195 pushed a commit that referenced this pull request May 24, 2019
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
droberts195 pushed a commit that referenced this pull request May 24, 2019
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
droberts195 pushed a commit that referenced this pull request May 24, 2019
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
droberts195 pushed a commit that referenced this pull request May 24, 2019
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
…#42506)

This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants