[support bundle] Monitor for cancellation without accidental task-blocking #9268

smklein · 2025-10-22T01:14:52Z

hawkw

Overall, this fix looks correct and I'm happy to merge it as-is. I had a few small notes, hopefully some of them are useful!

hawkw · 2025-10-22T19:09:36Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+            // Returns if the bundle has been cancelled explicitly, or if we
+            // cannot successfully check the bundle state.


hmm, this makes me wonder if we actually want to have the cancellation checking task bail out if it encounters a DB error. might we want to just try again in another yield_interval seconds, in that case?

i suppose we really would want to handle different error types there, so that it bails out on e.g. NotFounds.

Updated in aac4dc7

hawkw · 2025-10-22T19:10:20Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+            // Returns if the bundle has been cancelled explicitly, or if we
+            // cannot successfully check the bundle state.
+            why = &mut check_for_cancellation => {
+                let why = why.expect("Should not cancel the bundle-checking task without returning");


turbo nit: can we wrap this string at the 80th column?

Done (actually removed, but, done)

hawkw · 2025-10-22T19:11:15Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+            },
+            // Otherwise, keep making progress on the collection itself.
+            report = &mut collection => {
+                check_for_cancellation.abort();


nit, take it or leave it: might we consider wrapping check_for_cancellation in an AbortOnDropHandle, rather than doing this explicitly?

So, I ended up taking a slightly different path here. Looking at this again, I kinda asked myself:

Why is this work in w tokio task, but the collection work isn't?

I restructured this code slightly:

Both cancellation and collection are "just futures" now

When we select on them, we don't select on "&mut future" anymore (which is good - if we take one branch, we'll cancel the other immediately)

Now there's no need to "abort" the other task, nor pin either future

hawkw · 2025-10-22T19:12:29Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+                return why;
+            },
+            // Otherwise, keep making progress on the collection itself.
+            report = &mut collection => {


Nit: since there's now only one select!, rather than a select! in a loop, we don't need the &mut here (though we do need the &mut check_for_cancellation in the other select arm, so that we can cancel it here, but using AbortOnDropHandle would obviate that...)

Suggested change

report = &mut collection => {

report = collection => {

Done - for both futures, now!

hawkw · 2025-10-22T19:19:17Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+        // We run this task to "check for cancellation" as a whole tokio task
+        // for a critical, but subtle reason: After the tick timer yields,
+        // we may then try to "await" a database function.
+        //
+        // This, at a surface-level glance seems innocent enough. However, there
+        // is something potentially insidious here: if calling a datastore
+        // function - such as "support_bundle_get" - blocks on acquiring access
+        // to a connection from the connection pool, while creating the
+        // collection ALSO potentially blocks on acquiring access to the
+        // connection pool, it is possible for:
+        //
+        // 1. The "&mut collection" arm to have created a future, currently
+        //    yielded, which wants access to this underlying resource.
+        // 2. The current task of execution, in "support_bundle_get", to be
+        //    blocked "await-ing" for this same underlying resource.
+        //
+        // In this specific case, the connection pool would be attempting to
+        // yield to the "&mut collection" arm, which cannot run, if we were
+        // blocking on the body of a different async select arm. This would
+        // result in a deadlock.
+        //
+        // In the future, we may attempt to make access to the connection pool
+        // safer from concurrent asynchronous access - it is unsettling that
+        // multiple concurrent ".claim()" functions can cause this behavior -
+        // but in the meantime, we spawn this cancellation check in an entirely
+        // new tokio task. Because of this separation, each task (the one
+        // checking for cancellation, and the main thread attempting to collect
+        // the bundle) do not risk preventing the other from being polled
+        // indefinitely.


Thanks for writing this up, this is excellent. I have a couple very small nits:

I know we often colloquially refer to async tasks that are waiting for something as 'blocking" on that thing, but I might avoid that here to make sure the reader doesn't confuse this with the notion of actually blocking the worker thread

I changed a use of the word "task", which I think refers to the general concept of "a thing we are doing", to "operation", to make it clear that we are not referring to a Tokio task there

It might be worth including a link to Nexus node timing out on API requests #9259?

Suggested change

// We run this task to "check for cancellation" as a whole tokio task

// for a critical, but subtle reason: After the tick timer yields,

// we may then try to "await" a database function.

//

// This, at a surface-level glance seems innocent enough. However, there

// is something potentially insidious here: if calling a datastore

// function - such as "support_bundle_get" - blocks on acquiring access

// to a connection from the connection pool, while creating the

// collection ALSO potentially blocks on acquiring access to the

// connection pool, it is possible for:

//

// 1. The "&mut collection" arm to have created a future, currently

// yielded, which wants access to this underlying resource.

// 2. The current task of execution, in "support_bundle_get", to be

// blocked "await-ing" for this same underlying resource.

//

// In this specific case, the connection pool would be attempting to

// yield to the "&mut collection" arm, which cannot run, if we were

// blocking on the body of a different async select arm. This would

// result in a deadlock.

//

// In the future, we may attempt to make access to the connection pool

// safer from concurrent asynchronous access - it is unsettling that

// multiple concurrent ".claim()" functions can cause this behavior -

// but in the meantime, we spawn this cancellation check in an entirely

// new tokio task. Because of this separation, each task (the one

// checking for cancellation, and the main thread attempting to collect

// the bundle) do not risk preventing the other from being polled

// indefinitely.

// We run this task to "check for cancellation" as a whole tokio task

// for a critical, but subtle reason: After the tick timer yields,

// we may then try to `await` a database function.

//

// This, at a surface-level glance seems innocent enough. However, there

// is something potentially insidious here: if calling a datastore

// function - such as "support_bundle_get" - awaits acquiring access

// to a connection from the connection pool, while creating the

// collection ALSO potentially awaits acquiring access to the

// connection pool, it is possible for:

//

// 1. The `&mut collection` arm to have created a future, currently

// yielded, which wants access to this underlying resource.

// 2. The current operation executing in `support_bundle_get` to

// be awaiting access to this same underlying resource.

//

// In this specific case, the connection pool would be attempting to

// yield to the `&mut collection` arm, which cannot run, if we were

// awaiting in the body of a different async select arm. This would

// result in a deadlock.

//

// In the future, we may attempt to make access to the connection pool

// safer from concurrent asynchronous access - it is unsettling that

// multiple concurrent `.claim()` functions can cause this behavior -

// but in the meantime, we spawn this cancellation check in an entirely

// new tokio task. Because of this separation, each task (the one

// checking for cancellation, and the main thread attempting to collect

// the bundle) do not risk preventing the other from being polled

// indefinitely.

//

// For more details, see:

// https://github.com/oxidecomputer/omicron/issues/9259

Sounds good, i took a revised version of this, revised because we are not spawning tasks.

jgallagher · 2025-10-24T14:04:08Z

nexus/src/app/background/tasks/support_bundle_collector.rs

-                },
+                }
            }
+        };


Totally optional style nit - this could maybe move to its own method, which lets this method be entirely focused on "we're selecting on two futures and this is why it's written this way"; e.g.

let collection = self.collect_bundle_as_file(&dir); let check_for_cancellation = self.wait_for_cancellation(); // ... many comments ... tokio::select! { .. }

Done in 4281d27

smklein added 2 commits October 21, 2025 18:13

cancel-task

c5a31ce

less loopy

0bdb888

smklein mentioned this pull request Oct 22, 2025

Nexus node timing out on API requests #9259

Open

hawkw self-requested a review October 22, 2025 01:40

smklein marked this pull request as ready for review October 22, 2025 17:38

hawkw approved these changes Oct 22, 2025

View reviewed changes

smklein added 3 commits October 23, 2025 16:12

feedback

f25f00e

I guess we don't need to pin either

ba86606

don't bail on db fail

aac4dc7

smklein changed the title ~~[support bundle] Create a new task to monitor for cancellation~~ [support bundle] Monitor for cancellation without accidental task-blocking Oct 23, 2025

jgallagher approved these changes Oct 24, 2025

View reviewed changes

refactor

4281d27

		// Returns if the bundle has been cancelled explicitly, or if we
		// cannot successfully check the bundle state.

[support bundle] Monitor for cancellation without accidental task-blocking #9268

Are you sure you want to change the base?

[support bundle] Monitor for cancellation without accidental task-blocking #9268

Conversation

smklein commented Oct 22, 2025

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants