Skip to content

Conversation

@iximeow
Copy link
Member

@iximeow iximeow commented Oct 28, 2025

This tries doing a bunch of random operations against an NVMe device and checks the operations against a limited model of what the results of those operations should be.

The initial stab at this is what caught #965, and it caught a bug in an intermediate state of #953 (which other phd tests did notice anyway). This fuzzing would probably be best with actual I/O operations mixed in, and I think that should be relatively straightforward to add from here., but as-is it's useful!

This would probably be best phrased as a cargo-fuzz test to at least get coverage-guided fuzzing. Because of the statefulness of NVMe I think either way we'd want the model of expected device state and a pick-actions-then-run execution to further guide cargo-fuzz into useful parts of the device state.

The initial approach at this allowed for device reset and migration at arbitrary times via a separate thread. When that required synchronizing the model of device state it was effectively interleaved with "guest" operations on the device, and in practice admin commands are serialized by the NvmeCtrl state lock anyway. It may be more interesting to revisit with concurrent I/O operations on submission/completion queues.

This tries doing a bunch of random operations against an NVMe device and
checks the operations against a limited model of what the results of
those operations should be.

The initial stab at this is what caught
#965, and it caught a bug
in an intermediate state of #953 (which other phd tests did notice
anyway). This fuzzing would probably be best with actual I/O operations
mixed in, and I think that *should* be relatively straightforward to add
from here., but as-is it's useful!

This would probably be best phrased as a `cargo-fuzz` test to at least
get coverage-guided fuzzing. Because of the statefulness of NVMe I
think either way we'd want the model of expected device state and a
pick-actions-then-run execution to further guide `cargo-fuzz` into
useful parts of the device state.

The initial approach at this allowed for device reset and migration at
arbitrary times via a separate thread. When that required synchronizing
the model of device state it was effectively interleaved with "guest"
operations on the device, and in practice admin commands are serialized
by the `NvmeCtrl` state lock anyway. It may be more interesting to
revisit with concurrent I/O operations on submission/completion queues.
@iximeow iximeow requested review from hawkw and pfmooney October 28, 2025 18:20

let mut rng = Pcg64::seed_from_u64(seed);

for _ in 0..1_000 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo test gets through this in about 4 seconds, but cargo test --release is .. almost immediate. I'd been running this at 100k iterations instead, there. Even a thousand seems like an OK place to be for CI?

@iximeow iximeow requested review from jordanhendricks and removed request for pfmooney November 13, 2025 21:38
/// any point as `PciNvme` technically does. (In practice, reset immediately
/// locks the inner `NvmeCtrl` to do the reset, for administrative options it is
/// effectively serialized anyway.)
struct FuzzCtx {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd talked with Patrick just a bit about the idea of having a more composable fuzzing framework as part of Propolis (the library). the implementation in this file is simultaneously:

  • an NVMe driver, which is driven ~randomly to exercise propolis/src/hw/nvme
  • a model of the NVMe controller state, based on the driver's operations (what queues are created, is the device initialized, etc)
  • a model of the synthesis of NVMe spec and controller model: given the current state, what NVMe operations should have what outcomes, and what state transitions are permissible from the current state?

you could swap out the word "NVMe" above for any other devices (the chipset would be interesting!). a reasonable thing to want would be fuzzing a pair of NVMe devices concurrently on the same PCI bridge. or poking a disk and NIC concurrently, or configuring a pair of NVMe devices to write on each others' queues, or ...

in the limit this seems to me like assembling an increasingly chaotic VM and configuring how chaotic it should be. I don't plan on adjusting this in that direction right now, but it seems like an interesting future direction this could take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants