Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Aug 5, 2025

In #8739, I added code for collecting ereports into support bundles which stores the ereport JSON in directories for each sled/switch/PSC serial number from which an ereport was received. Unfortunately, I failed failed to consider that the version 1 Oxide serial numbers are only unique within the namespace of a particular part, and not globally --- so (for example) a switch and a compute sled may have colliding serials. This means that the current code could incorrectly group ereports reported by two totally different devices. While the ereport JSON files do contain additional information that disambiguates this (it includes includes the part number, as well as MGS metadata with the SP type for SP ereports), and restart IDs are additionally capable of distinguishing between reporters, putting ereports from two different systems within the same directory still has the potential to be quite misleading.

Thus, this branch changes the paths for ereports to include the part number as well as the serial number, in the format:

{part_number}-{serial_number}/{restart_id}/{ENA}.json

In order to include part numbers for host OS ereports, I decided to add a part number column to the host_ereport table as well. Initially, I had opted not to do this, as I was thinking that, since host_ereport includes a sled UUID, we could just join with the sled table to get the part number. However, it occurred to me that ereports may be received from a sled that's later expunged from the rack, and the sled record for the sled may eventually be deleted, so such a join would fail. We might retain such ereports past the lifetime of the sled in the rack. So, I thought it was better to always include the part number in the ereport record.

I've added a migration that attempts to backfill the host_ereport.part_number column from the sled table for existing host OS ereport records. In practice, this won't do anything, since we're not collecting them yet,but it seemed nice to have. Sadly, the column had to be left nullable, since we may theoretically encounter an ereport with a sled UUID that points to an already-deleted sled record, but...whatever. Since there aren't currently any host OS ereport records anyway, this shouldn't happen, and we'll just handle the nullability; this isn't terrible as we must already do so for SP ereport records.

Fixes #8765

In #8739, I added code for collecting ereports into support bundles
which stores the ereport JSON in directories for each sled/switch/PSC
serial number from which an ereport was received. Unfortunately, I
failed failed to consider that the version 1 Oxide serial numbers are
only unique within the namespace of a particular part, and not globally
--- so (for example) a switch and a compute sled may have colliding
serials. This means that the current code could incorrectly group
ereports reported by two totally different devices. While the ereport
JSON files _do_ contain additional information that disambiguates this
(it includes includes the part number, as well as MGS metadata with the
SP type for SP ereports), and restart IDs are additionally capable of
distinguishing between reporters, putting ereports from two different
systems within the same directory still has the potential to be quite
misleading.

Thus, this branch changes the paths for ereports to include the part
number as well as the serial number, in the format:

```
{part_number}-{serial_number}/{restart_id}/{ENA}.json
```

In order to include part numbers for host OS ereports, I decided to add
a part number column to the `host_ereport` table as well. Initially, I
had opted not to do this, as I was thinking that, since `host_ereport`
includes a sled UUID, we could just join with the `sled` table to get
the part number. However, it occurred to me that ereports may be
received from a sled that's later expunged from the rack, and the `sled`
record for the sled may eventually be deleted, so such a join would
fail. We might retain such ereports past the lifetime of the sled in the
rack. So, I thought it was better to always include the part number in
the ereport record.

I've added a migration that attempts to backfill the
`host_ereport.part_number` column from the `sled` table for existing
host OS ereport records. In practice, this won't do anything, since
we're not collecting them yet,but it seemed nice to have. Sadly, the
column had to be left nullable, since we may theoretically encounter an
ereport with a sled UUID that points to an already-deleted sled record,
but...whatever. Since there aren't currently any host OS ereport records
anyway, this shouldn't happen, and we'll just handle the nullability;
this isn't terrible as we must already do so for SP ereport records.

Fixes #8765
@hawkw hawkw requested a review from smklein August 5, 2025 18:14
@hawkw hawkw enabled auto-merge (squash) August 5, 2025 19:35
@hawkw hawkw merged commit 6ab7e96 into main Aug 10, 2025
16 checks passed
@hawkw hawkw deleted the eliza/namespace-support-bundle-ereports branch August 10, 2025 01:41
@hawkw hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support bundle ereport directories must be namespaced by part

3 participants