Overhaul block attachment and request dispatch #953

pfmooney · 2025-10-03T02:46:24Z

This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.

pfmooney · 2025-10-03T02:48:00Z

This should not be merged before the R17 release branch is cut. We'll want soak time, consider the scope of work. I wanted to get it up and available for folks to review.

iximeow

i've mostly not looked at propolis/src/block yet but figured this first round of thoughts would be useful. on the whole this seems pretty reasonable! but I see that transmute in attachment.rs which I look forward to in.. round 2

lib/propolis/src/block/minder.rs

lib/propolis/src/hw/nvme/admin.rs

lib/propolis/src/hw/nvme/mod.rs

iximeow

somewhat fewer comments on the back half (plus what we've been talking about re. NoneProcessing)

lib/propolis/src/block/attachment.rs

pfmooney

Thanks for looking through so far. I think I've at least touched on all the points you've raised.

lib/propolis/src/block/minder.rs

lib/propolis/src/block/attachment.rs

lib/propolis/src/hw/nvme/mod.rs

pfmooney · 2025-10-08T15:23:14Z

lib/propolis/src/hw/nvme/mod.rs

-        size: u32,
+        nvme: &PciNvme,
        mem: &MemCtx,
    ) -> Result<Arc<SubQueue>, NvmeError> {


I'm not sold on returning the QueueId at the moment, but we definitely need some stronger checks here. I've included some additional logic around the admin queues, and fixed the part where we were not doing proper association during state import.

lib/propolis/src/block/attachment.rs

lib/propolis/src/block/crucible.rs

leftwo

I have no further comments, thanks.

iximeow

just getting back to earlier conversations and resolving where appropriate. if only there was some kind of collaborative forge where the review interface was amenable to conversations...

lib/propolis/src/hw/nvme/mod.rs

lib/propolis/src/hw/nvme/queue.rs

lib/propolis/src/block/minder.rs

lib/propolis/src/block/mod.rs

lib/propolis/src/hw/nvme/admin.rs

iximeow

I was wondering about the migration::from_base tests failing: that's because with this as the destination, we don't have #958. So we try to initiate migration from a Propolis that wants a version header when we don't send one because #953 has a base from before that change.

on a lark I merged master into this locally and ran them here: no surprise migration issue, those pass too.

pfmooney · 2025-10-21T17:55:35Z

on a lark I merged master into this locally and ran them here: no surprise migration issue, those pass too.

Yeah, exactly. I wasn't going to worry about it until I squashed/rebased.

pfmooney · 2025-10-24T02:29:51Z

Although I need to redo testing after the squash, review feedback, and import/export changes to the doorbell elision stuff, I figured I'd get the other two commits posted for review.

lib/propolis/src/hw/nvme/admin.rs

This tries doing a bunch of random operations against an NVMe device and checks the operations against a limited model of what the results of those operations should be. As-is this doesn't catch anything new, but it is sufficient to tickle a bug in an intermediate state of #953 (which other phd tests did notice anyway). This fuzzing would probably be best with actual I/O operations mixed in, and I think that *should* be relatively straightforward to add from here... This would probably be best phrased as a `cargo-fuzz` test to at least get coverage-guided fuzzing. Because of the statefulness of NVMe I think either way we'd want the model of expected device state and a pick-actions-then-run execution to further guide `cargo-fuzz` into useful parts of the device state. The initial approach at this allowed for device reset and migration at arbitrary times via a separate thread. When that required synchronizing the model of device state it was effectively interleaved with "guest" operations on the device, and in practice admin commands are serialized by the `NvmeCtrl` state lock anyway. It may be more interesting to revisit with concurrent I/O operations on submission/completion queues.

This tries doing a bunch of random operations against an NVMe device and checks the operations against a limited model of what the results of those operations should be. The initial stab at this is what caught #965, and it caught a bug in an intermediate state of #953 (which other phd tests did notice anyway). This fuzzing would probably be best with actual I/O operations mixed in, and I think that *should* be relatively straightforward to add from here., but as-is it's useful! This would probably be best phrased as a `cargo-fuzz` test to at least get coverage-guided fuzzing. Because of the statefulness of NVMe I think either way we'd want the model of expected device state and a pick-actions-then-run execution to further guide `cargo-fuzz` into useful parts of the device state. The initial approach at this allowed for device reset and migration at arbitrary times via a separate thread. When that required synchronizing the model of device state it was effectively interleaved with "guest" operations on the device, and in practice admin commands are serialized by the `NvmeCtrl` state lock anyway. It may be more interesting to revisit with concurrent I/O operations on submission/completion queues.

iximeow

re-upping that the additional patches here seem good, though i have a few comments on the accessor changes upon second look.

lib/propolis/src/accessors.rs

iximeow · 2025-10-29T21:21:49Z

lib/propolis/src/accessors.rs

+        // Swap out the existing root resource
+        let old = std::mem::replace(&mut self.res_root, new_root);


after this point the root of the tree has value in res_root, a Node may have a res_leaf: Some(old_root_resource), and an Accessor for that node can .access() concurrently regardless of the tree itself being locked. so we've kind of torn the update of whatever resource.

it's wildly gross, but should we lock all the nodes first, update res_root, do all the res_leaf.take(), then drop the node locks?

either way we can still make the statement that when set_root_resource() returns there are no references through this tree to the old root resource, so i think this is really just an awkward moment if you were debugging and printing out parts of the tree. if you want to just note that res_leaf may differ from res_root given this specific race, i think that would be OK too.

For now, I think the tearing is acceptable. A structure like that would be necessary to handle cases like a PIO access through PCI cfg space causing the PIO mappings themselves to be updated (and invalidated through an Accessor-held tree) due to a modified BAR mapping.

This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.

The Doorbell Buffer feature permits consumer drivers to detect conditions when it is appropriate to elide doorbells for submission and completion queues. This is valuable primarily for virtualized NVMe devices, where such doorbells incur real overhead costs from the associated VM exit(s).

iximeow · 2025-11-04T20:40:19Z

time to pull these in and bump Omicron!

leftwo · 2025-11-04T23:06:45Z

time to pull these in and bump Omicron!

Could you give me a sec and I'll bump crucible, then take both all the way to Omicron?

pfmooney requested a review from iximeow October 3, 2025 02:48

iximeow reviewed Oct 7, 2025

View reviewed changes

lib/propolis/src/block/attachment.rs Outdated Show resolved Hide resolved

lib/propolis/src/block/attachment.rs Show resolved Hide resolved

lib/propolis/src/block/attachment.rs Show resolved Hide resolved

pfmooney commented Oct 8, 2025

View reviewed changes

leftwo reviewed Oct 20, 2025

View reviewed changes

leftwo approved these changes Oct 20, 2025

View reviewed changes

iximeow reviewed Oct 20, 2025

View reviewed changes

iximeow approved these changes Oct 21, 2025

View reviewed changes

pfmooney force-pushed the block-dispatch branch from af91230 to 943e808 Compare October 24, 2025 02:28

iximeow reviewed Oct 24, 2025

View reviewed changes

lib/propolis/src/hw/nvme/admin.rs Show resolved Hide resolved

iximeow mentioned this pull request Oct 27, 2025

nvme: CQEs with command-specific error 0 are acceptable #965

Merged

pfmooney force-pushed the block-dispatch branch from 943e808 to 5b07180 Compare October 27, 2025 22:09

iximeow mentioned this pull request Oct 28, 2025

Rudimentary NVMe emulation fuzzer #966

Open

iximeow approved these changes Oct 29, 2025

View reviewed changes

pfmooney force-pushed the block-dispatch branch from 989d1e6 to 31f6045 Compare November 3, 2025 22:26

pfmooney added 3 commits November 3, 2025 16:59

Rework resource accessors to alleviate lock contention

c34539f

pfmooney force-pushed the block-dispatch branch from 31f6045 to c34539f Compare November 3, 2025 22:59

iximeow merged commit 5c39de3 into oxidecomputer:master Nov 4, 2025
10 of 11 checks passed

pfmooney deleted the block-dispatch branch November 5, 2025 13:26

		// Swap out the existing root resource
		let old = std::mem::replace(&mut self.res_root, new_root);

Overhaul block attachment and request dispatch #953

Overhaul block attachment and request dispatch #953

Uh oh!

Conversation

pfmooney commented Oct 3, 2025

Uh oh!

pfmooney commented Oct 3, 2025

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pfmooney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pfmooney Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leftwo left a comment

Choose a reason for hiding this comment

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

pfmooney commented Oct 21, 2025

Uh oh!

pfmooney commented Oct 24, 2025

Uh oh!

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iximeow Oct 29, 2025

Choose a reason for hiding this comment