-
Notifications
You must be signed in to change notification settings - Fork 27
Overhaul block attachment and request dispatch #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This should not be merged before the R17 release branch is cut. We'll want soak time, consider the scope of work. I wanted to get it up and available for folks to review. |
iximeow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've mostly not looked at propolis/src/block yet but figured this first round of thoughts would be useful. on the whole this seems pretty reasonable! but I see that transmute in attachment.rs which I look forward to in.. round 2
iximeow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhat fewer comments on the back half (plus what we've been talking about re. NoneProcessing)
pfmooney
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking through so far. I think I've at least touched on all the points you've raised.
| size: u32, | ||
| nvme: &PciNvme, | ||
| mem: &MemCtx, | ||
| ) -> Result<Arc<SubQueue>, NvmeError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sold on returning the QueueId at the moment, but we definitely need some stronger checks here. I've included some additional logic around the admin queues, and fixed the part where we were not doing proper association during state import.
leftwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no further comments, thanks.
iximeow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just getting back to earlier conversations and resolving where appropriate. if only there was some kind of collaborative forge where the review interface was amenable to conversations...
iximeow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about the migration::from_base tests failing: that's because with this as the destination, we don't have #958. So we try to initiate migration from a Propolis that wants a version header when we don't send one because #953 has a base from before that change.
on a lark I merged master into this locally and ran them here: no surprise migration issue, those pass too.
Yeah, exactly. I wasn't going to worry about it until I squashed/rebased. |
af91230 to
943e808
Compare
|
Although I need to redo testing after the squash, review feedback, and import/export changes to the doorbell elision stuff, I figured I'd get the other two commits posted for review. |
943e808 to
5b07180
Compare
This tries doing a bunch of random operations against an NVMe device and checks the operations against a limited model of what the results of those operations should be. As-is this doesn't catch anything new, but it is sufficient to tickle a bug in an intermediate state of #953 (which other phd tests did notice anyway). This fuzzing would probably be best with actual I/O operations mixed in, and I think that *should* be relatively straightforward to add from here... This would probably be best phrased as a `cargo-fuzz` test to at least get coverage-guided fuzzing. Because of the statefulness of NVMe I think either way we'd want the model of expected device state and a pick-actions-then-run execution to further guide `cargo-fuzz` into useful parts of the device state. The initial approach at this allowed for device reset and migration at arbitrary times via a separate thread. When that required synchronizing the model of device state it was effectively interleaved with "guest" operations on the device, and in practice admin commands are serialized by the `NvmeCtrl` state lock anyway. It may be more interesting to revisit with concurrent I/O operations on submission/completion queues.
This tries doing a bunch of random operations against an NVMe device and checks the operations against a limited model of what the results of those operations should be. As-is this doesn't catch anything new, but it is sufficient to tickle a bug in an intermediate state of #953 (which other phd tests did notice anyway). This fuzzing would probably be best with actual I/O operations mixed in, and I think that *should* be relatively straightforward to add from here... This would probably be best phrased as a `cargo-fuzz` test to at least get coverage-guided fuzzing. Because of the statefulness of NVMe I think either way we'd want the model of expected device state and a pick-actions-then-run execution to further guide `cargo-fuzz` into useful parts of the device state. The initial approach at this allowed for device reset and migration at arbitrary times via a separate thread. When that required synchronizing the model of device state it was effectively interleaved with "guest" operations on the device, and in practice admin commands are serialized by the `NvmeCtrl` state lock anyway. It may be more interesting to revisit with concurrent I/O operations on submission/completion queues.
This tries doing a bunch of random operations against an NVMe device and checks the operations against a limited model of what the results of those operations should be. As-is this doesn't catch anything new, but it is sufficient to tickle a bug in an intermediate state of #953 (which other phd tests did notice anyway). This fuzzing would probably be best with actual I/O operations mixed in, and I think that *should* be relatively straightforward to add from here... This would probably be best phrased as a `cargo-fuzz` test to at least get coverage-guided fuzzing. Because of the statefulness of NVMe I think either way we'd want the model of expected device state and a pick-actions-then-run execution to further guide `cargo-fuzz` into useful parts of the device state. The initial approach at this allowed for device reset and migration at arbitrary times via a separate thread. When that required synchronizing the model of device state it was effectively interleaved with "guest" operations on the device, and in practice admin commands are serialized by the `NvmeCtrl` state lock anyway. It may be more interesting to revisit with concurrent I/O operations on submission/completion queues.
This tries doing a bunch of random operations against an NVMe device and checks the operations against a limited model of what the results of those operations should be. The initial stab at this is what caught #965, and it caught a bug in an intermediate state of #953 (which other phd tests did notice anyway). This fuzzing would probably be best with actual I/O operations mixed in, and I think that *should* be relatively straightforward to add from here., but as-is it's useful! This would probably be best phrased as a `cargo-fuzz` test to at least get coverage-guided fuzzing. Because of the statefulness of NVMe I think either way we'd want the model of expected device state and a pick-actions-then-run execution to further guide `cargo-fuzz` into useful parts of the device state. The initial approach at this allowed for device reset and migration at arbitrary times via a separate thread. When that required synchronizing the model of device state it was effectively interleaved with "guest" operations on the device, and in practice admin commands are serialized by the `NvmeCtrl` state lock anyway. It may be more interesting to revisit with concurrent I/O operations on submission/completion queues.
iximeow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-upping that the additional patches here seem good, though i have a few comments on the accessor changes upon second look.
| // Swap out the existing root resource | ||
| let old = std::mem::replace(&mut self.res_root, new_root); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this point the root of the tree has value in res_root, a Node may have a res_leaf: Some(old_root_resource), and an Accessor for that node can .access() concurrently regardless of the tree itself being locked. so we've kind of torn the update of whatever resource.
it's wildly gross, but should we lock all the nodes first, update res_root, do all the res_leaf.take(), then drop the node locks?
either way we can still make the statement that when set_root_resource() returns there are no references through this tree to the old root resource, so i think this is really just an awkward moment if you were debugging and printing out parts of the tree. if you want to just note that res_leaf may differ from res_root given this specific race, i think that would be OK too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I think the tearing is acceptable. A structure like that would be necessary to handle cases like a PIO access through PCI cfg space causing the PIO mappings themselves to be updated (and invalidated through an Accessor-held tree) due to a modified BAR mapping.
989d1e6 to
31f6045
Compare
This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.
The Doorbell Buffer feature permits consumer drivers to detect conditions when it is appropriate to elide doorbells for submission and completion queues. This is valuable primarily for virtualized NVMe devices, where such doorbells incur real overhead costs from the associated VM exit(s).
31f6045 to
c34539f
Compare
|
time to pull these in and bump Omicron! |
Could you give me a sec and I'll bump crucible, then take both all the way to Omicron? |
This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.