-
Notifications
You must be signed in to change notification settings - Fork 59
[support bundle] Monitor for cancellation without accidental task-blocking #9268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this fix looks correct and I'm happy to merge it as-is. I had a few small notes, hopefully some of them are useful!
// Returns if the bundle has been cancelled explicitly, or if we | ||
// cannot successfully check the bundle state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this makes me wonder if we actually want to have the cancellation checking task bail out if it encounters a DB error. might we want to just try again in another yield_interval
seconds, in that case?
i suppose we really would want to handle different error types there, so that it bails out on e.g. NotFound
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in aac4dc7
// Returns if the bundle has been cancelled explicitly, or if we | ||
// cannot successfully check the bundle state. | ||
why = &mut check_for_cancellation => { | ||
let why = why.expect("Should not cancel the bundle-checking task without returning"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turbo nit: can we wrap this string at the 80th column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (actually removed, but, done)
}, | ||
// Otherwise, keep making progress on the collection itself. | ||
report = &mut collection => { | ||
check_for_cancellation.abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, take it or leave it: might we consider wrapping check_for_cancellation
in an AbortOnDropHandle
, rather than doing this explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I ended up taking a slightly different path here. Looking at this again, I kinda asked myself:
Why is this work in w tokio task, but the collection work isn't?
I restructured this code slightly:
- Both cancellation and collection are "just futures" now
- When we select on them, we don't select on "&mut future" anymore (which is good - if we take one branch, we'll cancel the other immediately)
- Now there's no need to "abort" the other task, nor pin either future
return why; | ||
}, | ||
// Otherwise, keep making progress on the collection itself. | ||
report = &mut collection => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: since there's now only one select!
, rather than a select!
in a loop, we don't need the &mut
here (though we do need the &mut check_for_cancellation
in the other select arm, so that we can cancel it here, but using AbortOnDropHandle
would obviate that...)
report = &mut collection => { | |
report = collection => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - for both futures, now!
// We run this task to "check for cancellation" as a whole tokio task | ||
// for a critical, but subtle reason: After the tick timer yields, | ||
// we may then try to "await" a database function. | ||
// | ||
// This, at a surface-level glance seems innocent enough. However, there | ||
// is something potentially insidious here: if calling a datastore | ||
// function - such as "support_bundle_get" - blocks on acquiring access | ||
// to a connection from the connection pool, while creating the | ||
// collection ALSO potentially blocks on acquiring access to the | ||
// connection pool, it is possible for: | ||
// | ||
// 1. The "&mut collection" arm to have created a future, currently | ||
// yielded, which wants access to this underlying resource. | ||
// 2. The current task of execution, in "support_bundle_get", to be | ||
// blocked "await-ing" for this same underlying resource. | ||
// | ||
// In this specific case, the connection pool would be attempting to | ||
// yield to the "&mut collection" arm, which cannot run, if we were | ||
// blocking on the body of a different async select arm. This would | ||
// result in a deadlock. | ||
// | ||
// In the future, we may attempt to make access to the connection pool | ||
// safer from concurrent asynchronous access - it is unsettling that | ||
// multiple concurrent ".claim()" functions can cause this behavior - | ||
// but in the meantime, we spawn this cancellation check in an entirely | ||
// new tokio task. Because of this separation, each task (the one | ||
// checking for cancellation, and the main thread attempting to collect | ||
// the bundle) do not risk preventing the other from being polled | ||
// indefinitely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up, this is excellent. I have a couple very small nits:
- I know we often colloquially refer to async tasks that are waiting for something as 'blocking" on that thing, but I might avoid that here to make sure the reader doesn't confuse this with the notion of actually blocking the worker thread
- I changed a use of the word "task", which I think refers to the general concept of "a thing we are doing", to "operation", to make it clear that we are not referring to a Tokio task there
- It might be worth including a link to Nexus node timing out on API requests #9259?
// We run this task to "check for cancellation" as a whole tokio task | |
// for a critical, but subtle reason: After the tick timer yields, | |
// we may then try to "await" a database function. | |
// | |
// This, at a surface-level glance seems innocent enough. However, there | |
// is something potentially insidious here: if calling a datastore | |
// function - such as "support_bundle_get" - blocks on acquiring access | |
// to a connection from the connection pool, while creating the | |
// collection ALSO potentially blocks on acquiring access to the | |
// connection pool, it is possible for: | |
// | |
// 1. The "&mut collection" arm to have created a future, currently | |
// yielded, which wants access to this underlying resource. | |
// 2. The current task of execution, in "support_bundle_get", to be | |
// blocked "await-ing" for this same underlying resource. | |
// | |
// In this specific case, the connection pool would be attempting to | |
// yield to the "&mut collection" arm, which cannot run, if we were | |
// blocking on the body of a different async select arm. This would | |
// result in a deadlock. | |
// | |
// In the future, we may attempt to make access to the connection pool | |
// safer from concurrent asynchronous access - it is unsettling that | |
// multiple concurrent ".claim()" functions can cause this behavior - | |
// but in the meantime, we spawn this cancellation check in an entirely | |
// new tokio task. Because of this separation, each task (the one | |
// checking for cancellation, and the main thread attempting to collect | |
// the bundle) do not risk preventing the other from being polled | |
// indefinitely. | |
// We run this task to "check for cancellation" as a whole tokio task | |
// for a critical, but subtle reason: After the tick timer yields, | |
// we may then try to `await` a database function. | |
// | |
// This, at a surface-level glance seems innocent enough. However, there | |
// is something potentially insidious here: if calling a datastore | |
// function - such as "support_bundle_get" - awaits acquiring access | |
// to a connection from the connection pool, while creating the | |
// collection ALSO potentially awaits acquiring access to the | |
// connection pool, it is possible for: | |
// | |
// 1. The `&mut collection` arm to have created a future, currently | |
// yielded, which wants access to this underlying resource. | |
// 2. The current operation executing in `support_bundle_get` to | |
// be awaiting access to this same underlying resource. | |
// | |
// In this specific case, the connection pool would be attempting to | |
// yield to the `&mut collection` arm, which cannot run, if we were | |
// awaiting in the body of a different async select arm. This would | |
// result in a deadlock. | |
// | |
// In the future, we may attempt to make access to the connection pool | |
// safer from concurrent asynchronous access - it is unsettling that | |
// multiple concurrent `.claim()` functions can cause this behavior - | |
// but in the meantime, we spawn this cancellation check in an entirely | |
// new tokio task. Because of this separation, each task (the one | |
// checking for cancellation, and the main thread attempting to collect | |
// the bundle) do not risk preventing the other from being polled | |
// indefinitely. | |
// | |
// For more details, see: | |
// https://github.com/oxidecomputer/omicron/issues/9259 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, i took a revised version of this, revised because we are not spawning tasks.
}, | ||
} | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally optional style nit - this could maybe move to its own method, which lets this method be entirely focused on "we're selecting on two futures and this is why it's written this way"; e.g.
let collection = self.collect_bundle_as_file(&dir);
let check_for_cancellation = self.wait_for_cancellation();
// ... many comments ...
tokio::select! { .. }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4281d27
Fixes #9259