[nexus] Snarf ereports from CRDB into support bundles #8739

hawkw · 2025-07-31T18:25:24Z

PR #8269 added CRDB tables for storing ereports received from both
service processors and the sled host OS. These ereports are generated to
indicate a fault or other important event, so they contain information
that's probably worth including in service bundles. So we should do
that.

This branch adds code to the SupportBundleCollector background task
for querying the database for ereports and putting them in the bundle.
This, in turn, required adding code for querying ereports over a
specified time range. The BundleRequest can be constructed with a set
of filters for ereports, including the time window, and a list of serial
numbers to collect ereports from. Presently, we always just use the
default: we collect ereports from all serial numbers from the last 7
days prior to bundle collection. But, I anticipate that this will be
used more in the future when we add a notion of targeted support
bundles: for instance, if we generate a support bundle for a particular
sled, we would probably only grab ereports from that sled.

Ereports are stored in an ereports directory in the bundle, with
subdirectories for each serial number that emitted an ereport. Each
serial number directory has a subdirectory for each ereport restart ID
of that serial, and the individual ereports are stored within the
restart ID directory as JSON files. The path to an individual ereport
will be ereports/${SERIAL_NUMBER}/${RESTART_ID}/${ENA}.json. I'm open
to changing this organization scheme if others think there's a better
approach --- for example, we could place the restart ID in the filename
rather than in a subdirectory if that would be more useful.

Ereport collection is done in parallel to the rest of the support bundle
collection by spawning Tokio tasks to collect host OS and service
processor ereports. tokio_util::task::AbortOnDropHandle is used to
wrap the JoinHandles for these tasks to ensure they're aborted if the
ereport collection future is dropped, so that we stop collecting
ereports if the support bundle is cancelled.

Fixes #8649

hawkw · 2025-07-31T18:26:31Z

Huh, the CI failure is due to a checksum mismatch downloading the console --- that seems almost certainly not my doing. I'm hoping merging main makes that go away...

david-crespo · 2025-07-31T18:28:10Z

Yep, that was fixed by #8669

nexus/src/app/background/tasks/support_bundle_collector.rs

nexus/types/src/internal_api/background.rs

nexus/src/app/background/tasks/support_bundle_collector.rs

Co-authored-by: Sean Klein <[email protected]>

this shouldn't ever actually happen due to `panic="abort"`, but whatever

In #8739, I added code for collecting ereports into support bundles which stores the ereport JSON in directories for each sled/switch/PSC serial number from which an ereport was received. Unfortunately, I failed failed to consider that the version 1 Oxide serial numbers are only unique within the namespace of a particular part, and not globally --- so (for example) a switch and a compute sled may have colliding serials. This means that the current code could incorrectly group ereports reported by two totally different devices. While the ereport JSON files _do_ contain additional information that disambiguates this (it includes includes the part number, as well as MGS metadata with the SP type for SP ereports), and restart IDs are additionally capable of distinguishing between reporters, putting ereports from two different systems within the same directory still has the potential to be quite misleading. Thus, this branch changes the paths for ereports to include the part number as well as the serial number, in the format: ``` {part_number}-{serial_number}/{restart_id}/{ENA}.json ``` In order to include part numbers for host OS ereports, I decided to add a part number column to the `host_ereport` table as well. Initially, I had opted not to do this, as I was thinking that, since `host_ereport` includes a sled UUID, we could just join with the `sled` table to get the part number. However, it occurred to me that ereports may be received from a sled that's later expunged from the rack, and the `sled` record for the sled may eventually be deleted, so such a join would fail. We might retain such ereports past the lifetime of the sled in the rack. So, I thought it was better to always include the part number in the ereport record. I've added a migration that attempts to backfill the `host_ereport.part_number` column from the `sled` table for existing host OS ereport records. In practice, this won't do anything, since we're not collecting them yet,but it seemed nice to have. Sadly, the column had to be left nullable, since we may theoretically encounter an ereport with a sled UUID that points to an already-deleted sled record, but...whatever. Since there aren't currently any host OS ereport records anyway, this shouldn't happen, and we'll just handle the nullability; this isn't terrible as we must already do so for SP ereport records. Fixes #8765

hawkw added 5 commits July 28, 2025 13:43

wip

77684b4

draw the rest of the owl

5b3c7ed

+ integration tests

28cac03

reticulate naming

aced60b

add ereports to tests

1bcd661

hawkw requested a review from smklein July 31, 2025 18:25

Merge branch 'main' into eliza/bundle-snarf-ereports

de20ddf

hawkw added 2 commits July 31, 2025 11:40

hakari

d2c3a7a

clippy

05b0d30

smklein reviewed Jul 31, 2025

View reviewed changes

nexus/src/app/background/tasks/support_bundle_collector.rs Outdated Show resolved Hide resolved

smklein reviewed Jul 31, 2025

View reviewed changes

nexus/src/app/background/tasks/support_bundle_collector.rs Show resolved Hide resolved

smklein reviewed Jul 31, 2025

View reviewed changes

nexus/src/app/background/tasks/support_bundle_collector.rs Outdated Show resolved Hide resolved

Update nexus/src/app/background/tasks/support_bundle_collector.rs

a605c67

Co-authored-by: Sean Klein <[email protected]>

hawkw mentioned this pull request Jul 31, 2025

want validation for Oxide serial numbers in the control plane #8742

Open

hawkw added 3 commits August 1, 2025 11:59

ensure ereport count is accurate in face of abrupt task failures

51fa882

this shouldn't ever actually happen due to `panic="abort"`, but whatever

document + fix ereport paths

9c5a404

sanitize serial number strings in ereport paths

23f1d3c

hawkw requested a review from smklein August 1, 2025 20:43

smklein approved these changes Aug 1, 2025

View reviewed changes

hawkw enabled auto-merge (squash) August 1, 2025 21:39

hawkw merged commit d581075 into main Aug 1, 2025
17 checks passed

hawkw deleted the eliza/bundle-snarf-ereports branch August 1, 2025 22:16

hawkw mentioned this pull request Aug 4, 2025

support bundle ereport directories must be namespaced by part #8765

Closed

hawkw mentioned this pull request Aug 5, 2025

[nexus] Add part to service bundle ereport paths #8767

Merged

hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Nov 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[nexus] Snarf ereports from CRDB into support bundles #8739

[nexus] Snarf ereports from CRDB into support bundles #8739

Uh oh!

hawkw commented Jul 31, 2025

Uh oh!

hawkw commented Jul 31, 2025

Uh oh!

david-crespo commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[nexus] Snarf ereports from CRDB into support bundles #8739

[nexus] Snarf ereports from CRDB into support bundles #8739

Uh oh!

Conversation

hawkw commented Jul 31, 2025

Uh oh!

hawkw commented Jul 31, 2025

Uh oh!

david-crespo commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants