-
Notifications
You must be signed in to change notification settings - Fork 54
Closed
Description
After a reboot, the Sled Agent attempts to:
- Delete all zones (fine for now, whether or not we should reconcile some of 'em is a separate discussion)
- Loads local configs (e.g., underlay info, "what services should I launch", etc)
- (If it's a scrimlet) Launch the switch zone
As a part of setting up the switch zone, however, we run into some problems:
17:19:03.713Z INFO SledAgent/BootstrapAgent: Configuring new Omicron zone: oxz_switch
17:19:03.735Z INFO SledAgent/BootstrapAgent: Installing Omicron zone: oxz_switch
17:19:04.469Z INFO SledAgent/BootstrapAgent: creating NAT entry for service
service: ServiceZoneService { id: 75203abd-3d8b-4928-9e29-5235f4ebcdcf, details: BoundaryNtp { ntp_servers: ["ntp.eng.oxide.computer"], dns_servers: ["1.1.1.1", "9.9.9.9"], domain: None, nic: NetworkInterface { id: 50e49355-0dfe-4b4c-9fd4-efd68c5894bc, kind: Service { id: 75203abd-3d8b-4928-9e29-5235f4ebcdcf }, name: Name("ntp-75203abd-3d8b-4928-9e29-5235f4ebcdcf"), ip: 172.30.3.5, mac: MacAddr(MacAddr6([168, 64, 37, 255, 139, 60])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.3.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 0 }, snat_cfg: SourceNatConfig { ip: 192.168.0.222, first_port: 0, last_port: 16383 } } }
17:19:07.793Z INFO SledAgent/BootstrapAgent: Zone booting (zone=oxz_switch)
17:19:12.865Z INFO SledAgent/BootstrapAgent: Ensuring bootstrap address fdb0:18c0:4d0c:f4e5::2 exists in switch zone
17:19:12.865Z INFO SledAgent/BootstrapAgent: Adding bootstrap address (zone=oxz_switch)
17:19:23.445Z INFO SledAgent/BootstrapAgent: Forwarding bootstrap traffic via oxBootstrap38 to fe80::8:20ff:fe5b:f28
17:19:23.452Z INFO SledAgent/BootstrapAgent: GZ addresses: []
17:19:23.452Z INFO SledAgent/BootstrapAgent: Zone using sled underlay as gateway
17:19:23.458Z WARN SledAgent/BootstrapAgent:
Failed to initialize switch: Failed to do 'Adding Route' by running command in zone: Error running command in zone 'oxz_switch': Command [/usr/sbin/route add -inet6 default -inet6 fd00:1122:3344:101::1] executed and failed with status: exit status: 128 stdout: add net default: gateway fd00:1122:3344:101::1: Network is unreachable
stderr:
17:19:23.480Z WARN SledAgent/BootstrapAgent: failed to create NAT entry for service
service: ServiceZoneService { id: 75203abd-3d8b-4928-9e29-5235f4ebcdcf, details: BoundaryNtp { ntp_servers: ["ntp.eng.oxide.computer"], dns_servers: ["1.1.1.1", "9.9.9.9"], domain: None, nic: NetworkInterface { id: 50e49355-0dfe-4b4c-9fd4-efd68c5894bc, kind: Service { id: 75203abd-3d8b-4928-9e29-5235f4ebcdcf }, name: Name("ntp-75203abd-3d8b-4928-9e29-5235f4ebcdcf"), ip: 172.30.3.5, mac: MacAddr(MacAddr6([168, 64, 37, 255, 139, 60])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.3.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 0 }, snat_cfg: SourceNatConfig { ip: 192.168.0.222, first_port: 0, last_port: 16383 } } }
--
error: Communication Error: error sending request for url (http://[fd00:1122:3344:101::2]:12224/nat/ipv4/192.168.0.222/0): operation timed out
What's happening here?
- I believe the Sled Agent is trying to re-load all services. It can reload the internal DNS zone, but it struggles to reload the NTP zone. I believe this is because part of the NTP zone setup involves making calls to the switch zone, which isn't up. This stops all other services from loading, because we're loading them sequentially.
- So what's the deal with the switch zone failing to come up? The "Adding route" command fails. I believe this is because the Switch zone attempts to route through the sled's underlay address, even though the switch zone itself doesn't have an address on the underlay.
TL;DR:
- Switch zone starts without underlay address:
omicron/sled-agent/src/bootstrap/hardware.rs
Lines 71 to 80 in 7d50fd4
sled_hardware::HardwareUpdate::TofinoLoaded => { let baseboard = self.hardware.baseboard(); let switch_zone_ip = None; if let Err(e) = self.services.activate_switch( switch_zone_ip, baseboard, ).await { warn!(self.log, "Failed to activate switch: {e}"); } } - Switch zone setup code peeks at sled's underlay address, and adds a route to it if it exists:
omicron/sled-agent/src/services.rs
Lines 1299 to 1322 in 7d50fd4
let maybe_gateway = if !request.zone.gz_addresses.is_empty() { // If this service supplies its own GZ address, add a route. // // This is currently being used for the DNS service. // // TODO: consider limiting the number of GZ addresses which // can be supplied - now that we're actively using it, we // aren't really handling the "many GZ addresses" case, and it // doesn't seem necessary now. Some(request.zone.gz_addresses[0]) } else if let Some(info) = self.inner.sled_info.get() { // If the service has not supplied a GZ address, simply add // a route to the sled's underlay address. Some(info.underlay_address) } else { // If the underlay doesn't exist, no routing occurs. None }; if let Some(gateway) = maybe_gateway { running_zone.add_default_route(gateway).map_err(|err| { Error::ZoneCommand { intent: "Adding Route".to_string(), err } })?; } - These two should both be in-sync. After reboot, they are not.
Metadata
Metadata
Assignees
Labels
No labels