sled-agent-config-reconciler: actually call `on_time_sync()` #8509

jgallagher · 2025-07-02T14:20:45Z

@askfongjojo noticed that our uptimes on dogfood are all nonsense:

 8  BRM44220011        ok: 14:05:34    up 14066 day(s), 14:06,  1 user,  load average: 5.54, 5.60, 5.62
 9  BRM44220005        ok: 14:05:34    up 14066 day(s), 14:06,  1 user,  load average: 17.51, 17.81, 17.81
10  BRM42220009        ok: 14:05:35    up 14066 day(s), 14:06,  1 user,  load average: 15.00, 14.46, 14.01
11  BRM42220006        ok: 14:05:35    up 14066 day(s), 14:06,  0 users,  load average: 11.49, 11.21, 11.10
12  BRM42220057        ok: 14:05:35    up 14066 day(s), 14:06,  0 users,  load average: 2.88, 2.62, 2.03
...

I think #8064 introduced this. It shuffled around how time sync is checked and added a callback that the config-reconciler is supposed to run when it detects time has synchronized; that callback is responsible for rewriting uptime (among other things), but it never actually executes the callback. This PR fixes that.

However, we have some racklettes that are running commits that include #8064 that have reasonable uptimes. I'm not sure how that's possible - is there some other way uptime can be correct if sled-agent doesn't fix it?

jgallagher · 2025-07-02T17:47:22Z

Tested on berlin: uptimes are back to reasonable.

root@oxz_switch1:~# pilot host exec -c uptime 14-17
14  BRM42220023        ok: 17:46:57    up 49 min(s),  0 users,  load average: 1.96, 2.00, 2.01
15  BRM42220011        ok: 17:46:57    up 48 min(s),  0 users,  load average: 1.17, 1.11, 1.15
16  BRM42220082        ok: 17:46:57    up 51 min(s),  0 users,  load average: 2.22, 2.09, 2.03
17  BRM06240029        ok: 17:46:58    up 46 min(s),  0 users,  load average: 1.53, 1.43, 1.44

leftwo · 2025-07-02T18:05:17Z

sled-agent/src/services.rs

                    "Notified metrics task that time is now synced",
                ),
-                Err(e) => error!(
+                Err(e) => warn!(


Why switch to a warn here?
Will we retry this? (I know that this is not part of the goal of the PR, just wondering)

We will not retry this, but I don't think it's particularly detrimental to sled-agent itself, right? I can change it back if you think error! is better. I have a vague sense that those are pretty rare and usually near-fatal.

(Merging to get the fix into main, but happy to open a PR switching this back if you disagree)

sled-agent-config-reconciler: actually call on_time_sync()

17d2976

jgallagher requested review from leftwo and smklein July 2, 2025 14:20

leftwo reviewed Jul 2, 2025

View reviewed changes

leftwo approved these changes Jul 2, 2025

View reviewed changes

jgallagher merged commit 84c9ffe into main Jul 2, 2025
16 checks passed

jgallagher deleted the john/fix-on-time-sync branch July 2, 2025 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sled-agent-config-reconciler: actually call `on_time_sync()` #8509

sled-agent-config-reconciler: actually call `on_time_sync()` #8509

Uh oh!

jgallagher commented Jul 2, 2025

Uh oh!

jgallagher commented Jul 2, 2025

Uh oh!

leftwo Jul 2, 2025

Uh oh!

jgallagher Jul 2, 2025

Uh oh!

jgallagher Jul 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sled-agent-config-reconciler: actually call on_time_sync() #8509

sled-agent-config-reconciler: actually call on_time_sync() #8509

Uh oh!

Conversation

jgallagher commented Jul 2, 2025

Uh oh!

jgallagher commented Jul 2, 2025

Uh oh!

leftwo Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sled-agent-config-reconciler: actually call `on_time_sync()` #8509

sled-agent-config-reconciler: actually call `on_time_sync()` #8509