Skip to content

Conversation

@pfmooney
Copy link
Contributor

@pfmooney pfmooney commented Oct 3, 2025

This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.

@pfmooney
Copy link
Contributor Author

pfmooney commented Oct 3, 2025

This should not be merged before the R17 release branch is cut. We'll want soak time, consider the scope of work. I wanted to get it up and available for folks to review.

@pfmooney pfmooney requested a review from iximeow October 3, 2025 02:48
Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've mostly not looked at propolis/src/block yet but figured this first round of thoughts would be useful. on the whole this seems pretty reasonable! but I see that transmute in attachment.rs which I look forward to in.. round 2

Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat fewer comments on the back half (plus what we've been talking about re. NoneProcessing)

Copy link
Contributor Author

@pfmooney pfmooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking through so far. I think I've at least touched on all the points you've raised.

size: u32,
nvme: &PciNvme,
mem: &MemCtx,
) -> Result<Arc<SubQueue>, NvmeError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sold on returning the QueueId at the moment, but we definitely need some stronger checks here. I've included some additional logic around the admin queues, and fixed the part where we were not doing proper association during state import.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no further comments, thanks.

Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just getting back to earlier conversations and resolving where appropriate. if only there was some kind of collaborative forge where the review interface was amenable to conversations...

Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about the migration::from_base tests failing: that's because with this as the destination, we don't have #958. So we try to initiate migration from a Propolis that wants a version header when we don't send one because #953 has a base from before that change.

on a lark I merged master into this locally and ran them here: no surprise migration issue, those pass too.

@pfmooney
Copy link
Contributor Author

on a lark I merged master into this locally and ran them here: no surprise migration issue, those pass too.

Yeah, exactly. I wasn't going to worry about it until I squashed/rebased.

@pfmooney
Copy link
Contributor Author

Although I need to redo testing after the squash, review feedback, and import/export changes to the doorbell elision stuff, I figured I'd get the other two commits posted for review.

iximeow added a commit that referenced this pull request Oct 28, 2025
This tries doing a bunch of random operations against an NVMe device and
checks the operations against a limited model of what the results of
those operations should be.

As-is this doesn't catch anything new, but it is sufficient to tickle a
bug in an intermediate state of #953 (which other phd tests did notice
anyway). This fuzzing would probably be best with actual I/O operations
mixed in, and I think that *should* be relatively straightforward to add
from here...

This would probably be best phrased as a `cargo-fuzz` test to at least
get coverage-guided fuzzing. Because of the statefulness of NVMe I
think either way we'd want the model of expected device state and a
pick-actions-then-run execution to further guide `cargo-fuzz` into
useful parts of the device state.

The initial approach at this allowed for device reset and migration at
arbitrary times via a separate thread. When that required synchronizing
the model of device state it was effectively interleaved with "guest"
operations on the device, and in practice admin commands are serialized
by the `NvmeCtrl` state lock anyway. It may be more interesting to
revisit with concurrent I/O operations on submission/completion queues.
iximeow added a commit that referenced this pull request Oct 28, 2025
This tries doing a bunch of random operations against an NVMe device and
checks the operations against a limited model of what the results of
those operations should be.

As-is this doesn't catch anything new, but it is sufficient to tickle a
bug in an intermediate state of #953 (which other phd tests did notice
anyway). This fuzzing would probably be best with actual I/O operations
mixed in, and I think that *should* be relatively straightforward to add
from here...

This would probably be best phrased as a `cargo-fuzz` test to at least
get coverage-guided fuzzing. Because of the statefulness of NVMe I
think either way we'd want the model of expected device state and a
pick-actions-then-run execution to further guide `cargo-fuzz` into
useful parts of the device state.

The initial approach at this allowed for device reset and migration at
arbitrary times via a separate thread. When that required synchronizing
the model of device state it was effectively interleaved with "guest"
operations on the device, and in practice admin commands are serialized
by the `NvmeCtrl` state lock anyway. It may be more interesting to
revisit with concurrent I/O operations on submission/completion queues.
iximeow added a commit that referenced this pull request Oct 28, 2025
This tries doing a bunch of random operations against an NVMe device and
checks the operations against a limited model of what the results of
those operations should be.

As-is this doesn't catch anything new, but it is sufficient to tickle a
bug in an intermediate state of #953 (which other phd tests did notice
anyway). This fuzzing would probably be best with actual I/O operations
mixed in, and I think that *should* be relatively straightforward to add
from here...

This would probably be best phrased as a `cargo-fuzz` test to at least
get coverage-guided fuzzing. Because of the statefulness of NVMe I
think either way we'd want the model of expected device state and a
pick-actions-then-run execution to further guide `cargo-fuzz` into
useful parts of the device state.

The initial approach at this allowed for device reset and migration at
arbitrary times via a separate thread. When that required synchronizing
the model of device state it was effectively interleaved with "guest"
operations on the device, and in practice admin commands are serialized
by the `NvmeCtrl` state lock anyway. It may be more interesting to
revisit with concurrent I/O operations on submission/completion queues.
iximeow added a commit that referenced this pull request Oct 28, 2025
This tries doing a bunch of random operations against an NVMe device and
checks the operations against a limited model of what the results of
those operations should be.

The initial stab at this is what caught
#965, and it caught a bug
in an intermediate state of #953 (which other phd tests did notice
anyway). This fuzzing would probably be best with actual I/O operations
mixed in, and I think that *should* be relatively straightforward to add
from here., but as-is it's useful!

This would probably be best phrased as a `cargo-fuzz` test to at least
get coverage-guided fuzzing. Because of the statefulness of NVMe I
think either way we'd want the model of expected device state and a
pick-actions-then-run execution to further guide `cargo-fuzz` into
useful parts of the device state.

The initial approach at this allowed for device reset and migration at
arbitrary times via a separate thread. When that required synchronizing
the model of device state it was effectively interleaved with "guest"
operations on the device, and in practice admin commands are serialized
by the `NvmeCtrl` state lock anyway. It may be more interesting to
revisit with concurrent I/O operations on submission/completion queues.
Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-upping that the additional patches here seem good, though i have a few comments on the accessor changes upon second look.

Comment on lines +293 to +294
// Swap out the existing root resource
let old = std::mem::replace(&mut self.res_root, new_root);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this point the root of the tree has value in res_root, a Node may have a res_leaf: Some(old_root_resource), and an Accessor for that node can .access() concurrently regardless of the tree itself being locked. so we've kind of torn the update of whatever resource.

it's wildly gross, but should we lock all the nodes first, update res_root, do all the res_leaf.take(), then drop the node locks?

either way we can still make the statement that when set_root_resource() returns there are no references through this tree to the old root resource, so i think this is really just an awkward moment if you were debugging and printing out parts of the tree. if you want to just note that res_leaf may differ from res_root given this specific race, i think that would be OK too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think the tearing is acceptable. A structure like that would be necessary to handle cases like a PIO access through PCI cfg space causing the PIO mappings themselves to be updated (and invalidated through an Accessor-held tree) due to a modified BAR mapping.

This addresses considerable lock contention and thundering herd issues
present in the block IO abstractions.  Devices with multiple queues
(read: NVMe), serviced by multiple workers, should be in much better
shape when it comes to efficiently farming out work.
The Doorbell Buffer feature permits consumer drivers to detect
conditions when it is appropriate to elide doorbells for submission and
completion queues.  This is valuable primarily for virtualized NVMe
devices, where such doorbells incur real overhead costs from the
associated VM exit(s).
@iximeow
Copy link
Member

iximeow commented Nov 4, 2025

time to pull these in and bump Omicron!

@iximeow iximeow merged commit 5c39de3 into oxidecomputer:master Nov 4, 2025
10 of 11 checks passed
@leftwo
Copy link
Contributor

leftwo commented Nov 4, 2025

time to pull these in and bump Omicron!

Could you give me a sec and I'll bump crucible, then take both all the way to Omicron?

@pfmooney pfmooney deleted the block-dispatch branch November 5, 2025 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants