Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Jul 31, 2025

PR #8269 added CRDB tables for storing ereports received from both
service processors and the sled host OS. These ereports are generated to
indicate a fault or other important event, so they contain information
that's probably worth including in service bundles. So we should do
that.

This branch adds code to the SupportBundleCollector background task
for querying the database for ereports and putting them in the bundle.
This, in turn, required adding code for querying ereports over a
specified time range. The BundleRequest can be constructed with a set
of filters for ereports, including the time window, and a list of serial
numbers to collect ereports from. Presently, we always just use the
default: we collect ereports from all serial numbers from the last 7
days prior to bundle collection. But, I anticipate that this will be
used more in the future when we add a notion of targeted support
bundles: for instance, if we generate a support bundle for a particular
sled, we would probably only grab ereports from that sled.

Ereports are stored in an ereports directory in the bundle, with
subdirectories for each serial number that emitted an ereport. Each
serial number directory has a subdirectory for each ereport restart ID
of that serial, and the individual ereports are stored within the
restart ID directory as JSON files. The path to an individual ereport
will be ereports/${SERIAL_NUMBER}/${RESTART_ID}/${ENA}.json. I'm open
to changing this organization scheme if others think there's a better
approach --- for example, we could place the restart ID in the filename
rather than in a subdirectory if that would be more useful.

Ereport collection is done in parallel to the rest of the support bundle
collection by spawning Tokio tasks to collect host OS and service
processor ereports. tokio_util::task::AbortOnDropHandle is used to
wrap the JoinHandles for these tasks to ensure they're aborted if the
ereport collection future is dropped, so that we stop collecting
ereports if the support bundle is cancelled.

Fixes #8649

@hawkw hawkw requested a review from smklein July 31, 2025 18:25
@hawkw
Copy link
Member Author

hawkw commented Jul 31, 2025

Huh, the CI failure is due to a checksum mismatch downloading the console --- that seems almost certainly not my doing. I'm hoping merging main makes that go away...

@david-crespo
Copy link
Contributor

Yep, that was fixed by #8669

@hawkw hawkw requested a review from smklein August 1, 2025 20:43
@hawkw hawkw enabled auto-merge (squash) August 1, 2025 21:39
@hawkw hawkw merged commit d581075 into main Aug 1, 2025
17 checks passed
@hawkw hawkw deleted the eliza/bundle-snarf-ereports branch August 1, 2025 22:16
hawkw added a commit that referenced this pull request Aug 5, 2025
In #8739, I added code for collecting ereports into support bundles
which stores the ereport JSON in directories for each sled/switch/PSC
serial number from which an ereport was received. Unfortunately, I
failed failed to consider that the version 1 Oxide serial numbers are
only unique within the namespace of a particular part, and not globally
--- so (for example) a switch and a compute sled may have colliding
serials. This means that the current code could incorrectly group
ereports reported by two totally different devices. While the ereport
JSON files _do_ contain additional information that disambiguates this
(it includes includes the part number, as well as MGS metadata with the
SP type for SP ereports), and restart IDs are additionally capable of
distinguishing between reporters, putting ereports from two different
systems within the same directory still has the potential to be quite
misleading.

Thus, this branch changes the paths for ereports to include the part
number as well as the serial number, in the format:

```
{part_number}-{serial_number}/{restart_id}/{ENA}.json
```

In order to include part numbers for host OS ereports, I decided to add
a part number column to the `host_ereport` table as well. Initially, I
had opted not to do this, as I was thinking that, since `host_ereport`
includes a sled UUID, we could just join with the `sled` table to get
the part number. However, it occurred to me that ereports may be
received from a sled that's later expunged from the rack, and the `sled`
record for the sled may eventually be deleted, so such a join would
fail. We might retain such ereports past the lifetime of the sled in the
rack. So, I thought it was better to always include the part number in
the ereport record.

I've added a migration that attempts to backfill the
`host_ereport.part_number` column from the `sled` table for existing
host OS ereport records. In practice, this won't do anything, since
we're not collecting them yet,but it seemed nice to have. Sadly, the
column had to be left nullable, since we may theoretically encounter an
ereport with a sled UUID that points to an already-deleted sled record,
but...whatever. Since there aren't currently any host OS ereport records
anyway, this shouldn't happen, and we'll just handle the nullability;
this isn't terrible as we must already do so for SP ereport records.

Fixes #8765
hawkw added a commit that referenced this pull request Aug 10, 2025
In #8739, I added code for collecting ereports into support bundles
which stores the ereport JSON in directories for each sled/switch/PSC
serial number from which an ereport was received. Unfortunately, I
failed failed to consider that the version 1 Oxide serial numbers are
only unique within the namespace of a particular part, and not globally
--- so (for example) a switch and a compute sled may have colliding
serials. This means that the current code could incorrectly group
ereports reported by two totally different devices. While the ereport
JSON files _do_ contain additional information that disambiguates this
(it includes includes the part number, as well as MGS metadata with the
SP type for SP ereports), and restart IDs are additionally capable of
distinguishing between reporters, putting ereports from two different
systems within the same directory still has the potential to be quite
misleading.

Thus, this branch changes the paths for ereports to include the part
number as well as the serial number, in the format:

```
{part_number}-{serial_number}/{restart_id}/{ENA}.json
```

In order to include part numbers for host OS ereports, I decided to add
a part number column to the `host_ereport` table as well. Initially, I
had opted not to do this, as I was thinking that, since `host_ereport`
includes a sled UUID, we could just join with the `sled` table to get
the part number. However, it occurred to me that ereports may be
received from a sled that's later expunged from the rack, and the `sled`
record for the sled may eventually be deleted, so such a join would
fail. We might retain such ereports past the lifetime of the sled in the
rack. So, I thought it was better to always include the part number in
the ereport record.

I've added a migration that attempts to backfill the
`host_ereport.part_number` column from the `sled` table for existing
host OS ereport records. In practice, this won't do anything, since
we're not collecting them yet,but it seemed nice to have. Sadly, the
column had to be left nullable, since we may theoretically encounter an
ereport with a sled UUID that points to an already-deleted sled record,
but...whatever. Since there aren't currently any host OS ereport records
anyway, this shouldn't happen, and we'll just handle the nullability;
this isn't terrible as we must already do so for SP ereport records.

Fixes #8765
@hawkw hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

snarf ereports into service bundles

4 participants