-
Notifications
You must be signed in to change notification settings - Fork 62
[nexus] Snarf ereports from CRDB into support bundles #8739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
Author
|
Huh, the CI failure is due to a checksum mismatch downloading the console --- that seems almost certainly not my doing. I'm hoping merging |
Contributor
|
Yep, that was fixed by #8669 |
smklein
reviewed
Jul 31, 2025
smklein
reviewed
Jul 31, 2025
smklein
reviewed
Jul 31, 2025
smklein
reviewed
Jul 31, 2025
Co-authored-by: Sean Klein <[email protected]>
this shouldn't ever actually happen due to `panic="abort"`, but whatever
smklein
approved these changes
Aug 1, 2025
hawkw
added a commit
that referenced
this pull request
Aug 5, 2025
In #8739, I added code for collecting ereports into support bundles which stores the ereport JSON in directories for each sled/switch/PSC serial number from which an ereport was received. Unfortunately, I failed failed to consider that the version 1 Oxide serial numbers are only unique within the namespace of a particular part, and not globally --- so (for example) a switch and a compute sled may have colliding serials. This means that the current code could incorrectly group ereports reported by two totally different devices. While the ereport JSON files _do_ contain additional information that disambiguates this (it includes includes the part number, as well as MGS metadata with the SP type for SP ereports), and restart IDs are additionally capable of distinguishing between reporters, putting ereports from two different systems within the same directory still has the potential to be quite misleading. Thus, this branch changes the paths for ereports to include the part number as well as the serial number, in the format: ``` {part_number}-{serial_number}/{restart_id}/{ENA}.json ``` In order to include part numbers for host OS ereports, I decided to add a part number column to the `host_ereport` table as well. Initially, I had opted not to do this, as I was thinking that, since `host_ereport` includes a sled UUID, we could just join with the `sled` table to get the part number. However, it occurred to me that ereports may be received from a sled that's later expunged from the rack, and the `sled` record for the sled may eventually be deleted, so such a join would fail. We might retain such ereports past the lifetime of the sled in the rack. So, I thought it was better to always include the part number in the ereport record. I've added a migration that attempts to backfill the `host_ereport.part_number` column from the `sled` table for existing host OS ereport records. In practice, this won't do anything, since we're not collecting them yet,but it seemed nice to have. Sadly, the column had to be left nullable, since we may theoretically encounter an ereport with a sled UUID that points to an already-deleted sled record, but...whatever. Since there aren't currently any host OS ereport records anyway, this shouldn't happen, and we'll just handle the nullability; this isn't terrible as we must already do so for SP ereport records. Fixes #8765
hawkw
added a commit
that referenced
this pull request
Aug 10, 2025
In #8739, I added code for collecting ereports into support bundles which stores the ereport JSON in directories for each sled/switch/PSC serial number from which an ereport was received. Unfortunately, I failed failed to consider that the version 1 Oxide serial numbers are only unique within the namespace of a particular part, and not globally --- so (for example) a switch and a compute sled may have colliding serials. This means that the current code could incorrectly group ereports reported by two totally different devices. While the ereport JSON files _do_ contain additional information that disambiguates this (it includes includes the part number, as well as MGS metadata with the SP type for SP ereports), and restart IDs are additionally capable of distinguishing between reporters, putting ereports from two different systems within the same directory still has the potential to be quite misleading. Thus, this branch changes the paths for ereports to include the part number as well as the serial number, in the format: ``` {part_number}-{serial_number}/{restart_id}/{ENA}.json ``` In order to include part numbers for host OS ereports, I decided to add a part number column to the `host_ereport` table as well. Initially, I had opted not to do this, as I was thinking that, since `host_ereport` includes a sled UUID, we could just join with the `sled` table to get the part number. However, it occurred to me that ereports may be received from a sled that's later expunged from the rack, and the `sled` record for the sled may eventually be deleted, so such a join would fail. We might retain such ereports past the lifetime of the sled in the rack. So, I thought it was better to always include the part number in the ereport record. I've added a migration that attempts to backfill the `host_ereport.part_number` column from the `sled` table for existing host OS ereport records. In practice, this won't do anything, since we're not collecting them yet,but it seemed nice to have. Sadly, the column had to be left nullable, since we may theoretically encounter an ereport with a sled UUID that points to an already-deleted sled record, but...whatever. Since there aren't currently any host OS ereport records anyway, this shouldn't happen, and we'll just handle the nullability; this isn't terrible as we must already do so for SP ereport records. Fixes #8765
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR #8269 added CRDB tables for storing ereports received from both
service processors and the sled host OS. These ereports are generated to
indicate a fault or other important event, so they contain information
that's probably worth including in service bundles. So we should do
that.
This branch adds code to the
SupportBundleCollectorbackground taskfor querying the database for ereports and putting them in the bundle.
This, in turn, required adding code for querying ereports over a
specified time range. The
BundleRequestcan be constructed with a setof filters for ereports, including the time window, and a list of serial
numbers to collect ereports from. Presently, we always just use the
default: we collect ereports from all serial numbers from the last 7
days prior to bundle collection. But, I anticipate that this will be
used more in the future when we add a notion of targeted support
bundles: for instance, if we generate a support bundle for a particular
sled, we would probably only grab ereports from that sled.
Ereports are stored in an
ereportsdirectory in the bundle, withsubdirectories for each serial number that emitted an ereport. Each
serial number directory has a subdirectory for each ereport restart ID
of that serial, and the individual ereports are stored within the
restart ID directory as JSON files. The path to an individual ereport
will be
ereports/${SERIAL_NUMBER}/${RESTART_ID}/${ENA}.json. I'm opento changing this organization scheme if others think there's a better
approach --- for example, we could place the restart ID in the filename
rather than in a subdirectory if that would be more useful.
Ereport collection is done in parallel to the rest of the support bundle
collection by spawning Tokio tasks to collect host OS and service
processor ereports.
tokio_util::task::AbortOnDropHandleis used towrap the
JoinHandles for these tasks to ensure they're aborted if theereport collection future is dropped, so that we stop collecting
ereports if the support bundle is cancelled.
Fixes #8649