Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Oct 22, 2025

Fixes #9259

@hawkw hawkw self-requested a review October 22, 2025 01:40
@smklein smklein marked this pull request as ready for review October 22, 2025 17:38
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this fix looks correct and I'm happy to merge it as-is. I had a few small notes, hopefully some of them are useful!

Comment on lines 569 to 570
// Returns if the bundle has been cancelled explicitly, or if we
// cannot successfully check the bundle state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this makes me wonder if we actually want to have the cancellation checking task bail out if it encounters a DB error. might we want to just try again in another yield_interval seconds, in that case?

i suppose we really would want to handle different error types there, so that it bails out on e.g. NotFounds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in aac4dc7

// Returns if the bundle has been cancelled explicitly, or if we
// cannot successfully check the bundle state.
why = &mut check_for_cancellation => {
let why = why.expect("Should not cancel the bundle-checking task without returning");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit: can we wrap this string at the 80th column?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (actually removed, but, done)

},
// Otherwise, keep making progress on the collection itself.
report = &mut collection => {
check_for_cancellation.abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: might we consider wrapping check_for_cancellation in an AbortOnDropHandle, rather than doing this explicitly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I ended up taking a slightly different path here. Looking at this again, I kinda asked myself:

Why is this work in w tokio task, but the collection work isn't?

I restructured this code slightly:

  • Both cancellation and collection are "just futures" now
  • When we select on them, we don't select on "&mut future" anymore (which is good - if we take one branch, we'll cancel the other immediately)
  • Now there's no need to "abort" the other task, nor pin either future

return why;
},
// Otherwise, keep making progress on the collection itself.
report = &mut collection => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: since there's now only one select!, rather than a select! in a loop, we don't need the &mut here (though we do need the &mut check_for_cancellation in the other select arm, so that we can cancel it here, but using AbortOnDropHandle would obviate that...)

Suggested change
report = &mut collection => {
report = collection => {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - for both futures, now!

Comment on lines 516 to 544
// We run this task to "check for cancellation" as a whole tokio task
// for a critical, but subtle reason: After the tick timer yields,
// we may then try to "await" a database function.
//
// This, at a surface-level glance seems innocent enough. However, there
// is something potentially insidious here: if calling a datastore
// function - such as "support_bundle_get" - blocks on acquiring access
// to a connection from the connection pool, while creating the
// collection ALSO potentially blocks on acquiring access to the
// connection pool, it is possible for:
//
// 1. The "&mut collection" arm to have created a future, currently
// yielded, which wants access to this underlying resource.
// 2. The current task of execution, in "support_bundle_get", to be
// blocked "await-ing" for this same underlying resource.
//
// In this specific case, the connection pool would be attempting to
// yield to the "&mut collection" arm, which cannot run, if we were
// blocking on the body of a different async select arm. This would
// result in a deadlock.
//
// In the future, we may attempt to make access to the connection pool
// safer from concurrent asynchronous access - it is unsettling that
// multiple concurrent ".claim()" functions can cause this behavior -
// but in the meantime, we spawn this cancellation check in an entirely
// new tokio task. Because of this separation, each task (the one
// checking for cancellation, and the main thread attempting to collect
// the bundle) do not risk preventing the other from being polled
// indefinitely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up, this is excellent. I have a couple very small nits:

  • I know we often colloquially refer to async tasks that are waiting for something as 'blocking" on that thing, but I might avoid that here to make sure the reader doesn't confuse this with the notion of actually blocking the worker thread
  • I changed a use of the word "task", which I think refers to the general concept of "a thing we are doing", to "operation", to make it clear that we are not referring to a Tokio task there
  • It might be worth including a link to Nexus node timing out on API requests #9259?
Suggested change
// We run this task to "check for cancellation" as a whole tokio task
// for a critical, but subtle reason: After the tick timer yields,
// we may then try to "await" a database function.
//
// This, at a surface-level glance seems innocent enough. However, there
// is something potentially insidious here: if calling a datastore
// function - such as "support_bundle_get" - blocks on acquiring access
// to a connection from the connection pool, while creating the
// collection ALSO potentially blocks on acquiring access to the
// connection pool, it is possible for:
//
// 1. The "&mut collection" arm to have created a future, currently
// yielded, which wants access to this underlying resource.
// 2. The current task of execution, in "support_bundle_get", to be
// blocked "await-ing" for this same underlying resource.
//
// In this specific case, the connection pool would be attempting to
// yield to the "&mut collection" arm, which cannot run, if we were
// blocking on the body of a different async select arm. This would
// result in a deadlock.
//
// In the future, we may attempt to make access to the connection pool
// safer from concurrent asynchronous access - it is unsettling that
// multiple concurrent ".claim()" functions can cause this behavior -
// but in the meantime, we spawn this cancellation check in an entirely
// new tokio task. Because of this separation, each task (the one
// checking for cancellation, and the main thread attempting to collect
// the bundle) do not risk preventing the other from being polled
// indefinitely.
// We run this task to "check for cancellation" as a whole tokio task
// for a critical, but subtle reason: After the tick timer yields,
// we may then try to `await` a database function.
//
// This, at a surface-level glance seems innocent enough. However, there
// is something potentially insidious here: if calling a datastore
// function - such as "support_bundle_get" - awaits acquiring access
// to a connection from the connection pool, while creating the
// collection ALSO potentially awaits acquiring access to the
// connection pool, it is possible for:
//
// 1. The `&mut collection` arm to have created a future, currently
// yielded, which wants access to this underlying resource.
// 2. The current operation executing in `support_bundle_get` to
// be awaiting access to this same underlying resource.
//
// In this specific case, the connection pool would be attempting to
// yield to the `&mut collection` arm, which cannot run, if we were
// awaiting in the body of a different async select arm. This would
// result in a deadlock.
//
// In the future, we may attempt to make access to the connection pool
// safer from concurrent asynchronous access - it is unsettling that
// multiple concurrent `.claim()` functions can cause this behavior -
// but in the meantime, we spawn this cancellation check in an entirely
// new tokio task. Because of this separation, each task (the one
// checking for cancellation, and the main thread attempting to collect
// the bundle) do not risk preventing the other from being polled
// indefinitely.
//
// For more details, see:
// https://github.com/oxidecomputer/omicron/issues/9259

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, i took a revised version of this, revised because we are not spawning tasks.

@smklein smklein changed the title [support bundle] Create a new task to monitor for cancellation [support bundle] Monitor for cancellation without accidental task-blocking Oct 23, 2025
},
}
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally optional style nit - this could maybe move to its own method, which lets this method be entirely focused on "we're selecting on two futures and this is why it's written this way"; e.g.

let collection = self.collect_bundle_as_file(&dir);
let check_for_cancellation = self.wait_for_cancellation();

// ... many comments ...

tokio::select! { .. }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4281d27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nexus node timing out on API requests

3 participants