Skip to content

Conversation

@jgallagher
Copy link
Contributor

@askfongjojo noticed that our uptimes on dogfood are all nonsense:

 8  BRM44220011        ok: 14:05:34    up 14066 day(s), 14:06,  1 user,  load average: 5.54, 5.60, 5.62
 9  BRM44220005        ok: 14:05:34    up 14066 day(s), 14:06,  1 user,  load average: 17.51, 17.81, 17.81
10  BRM42220009        ok: 14:05:35    up 14066 day(s), 14:06,  1 user,  load average: 15.00, 14.46, 14.01
11  BRM42220006        ok: 14:05:35    up 14066 day(s), 14:06,  0 users,  load average: 11.49, 11.21, 11.10
12  BRM42220057        ok: 14:05:35    up 14066 day(s), 14:06,  0 users,  load average: 2.88, 2.62, 2.03
...

I think #8064 introduced this. It shuffled around how time sync is checked and added a callback that the config-reconciler is supposed to run when it detects time has synchronized; that callback is responsible for rewriting uptime (among other things), but it never actually executes the callback. This PR fixes that.

However, we have some racklettes that are running commits that include #8064 that have reasonable uptimes. I'm not sure how that's possible - is there some other way uptime can be correct if sled-agent doesn't fix it?

@jgallagher jgallagher requested review from leftwo and smklein July 2, 2025 14:20
@jgallagher
Copy link
Contributor Author

Tested on berlin: uptimes are back to reasonable.

root@oxz_switch1:~# pilot host exec -c uptime 14-17
14  BRM42220023        ok: 17:46:57    up 49 min(s),  0 users,  load average: 1.96, 2.00, 2.01
15  BRM42220011        ok: 17:46:57    up 48 min(s),  0 users,  load average: 1.17, 1.11, 1.15
16  BRM42220082        ok: 17:46:57    up 51 min(s),  0 users,  load average: 2.22, 2.09, 2.03
17  BRM06240029        ok: 17:46:58    up 46 min(s),  0 users,  load average: 1.53, 1.43, 1.44

"Notified metrics task that time is now synced",
),
Err(e) => error!(
Err(e) => warn!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why switch to a warn here?
Will we retry this? (I know that this is not part of the goal of the PR, just wondering)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not retry this, but I don't think it's particularly detrimental to sled-agent itself, right? I can change it back if you think error! is better. I have a vague sense that those are pretty rare and usually near-fatal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Merging to get the fix into main, but happy to open a PR switching this back if you disagree)

@jgallagher jgallagher merged commit 84c9ffe into main Jul 2, 2025
16 checks passed
@jgallagher jgallagher deleted the john/fix-on-time-sync branch July 2, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants