Skip to content

Reboot: Switch Zone unhappy with route - missing underlay address #3461

@smklein

Description

@smklein

After a reboot, the Sled Agent attempts to:

  • Delete all zones (fine for now, whether or not we should reconcile some of 'em is a separate discussion)
  • Loads local configs (e.g., underlay info, "what services should I launch", etc)
  • (If it's a scrimlet) Launch the switch zone

As a part of setting up the switch zone, however, we run into some problems:

17:19:03.713Z  INFO SledAgent/BootstrapAgent: Configuring new Omicron zone: oxz_switch                                 
17:19:03.735Z  INFO SledAgent/BootstrapAgent: Installing Omicron zone: oxz_switch                                      
17:19:04.469Z  INFO SledAgent/BootstrapAgent: creating NAT entry for service                                           
    service: ServiceZoneService { id: 75203abd-3d8b-4928-9e29-5235f4ebcdcf, details: BoundaryNtp { ntp_servers: ["ntp.eng.oxide.computer"], dns_servers: ["1.1.1.1", "9.9.9.9"], domain: None, nic: NetworkInterface { id: 50e49355-0dfe-4b4c-9fd4-efd68c5894bc, kind: Service { id: 75203abd-3d8b-4928-9e29-5235f4ebcdcf }, name: Name("ntp-75203abd-3d8b-4928-9e29-5235f4ebcdcf"), ip: 172.30.3.5, mac: MacAddr(MacAddr6([168, 64, 37, 255, 139, 60])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.3.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 0 }, snat_cfg: SourceNatConfig { ip: 192.168.0.222, first_port: 0, last_port: 16383 } } }
17:19:07.793Z  INFO SledAgent/BootstrapAgent: Zone booting (zone=oxz_switch)                                           
17:19:12.865Z  INFO SledAgent/BootstrapAgent: Ensuring bootstrap address fdb0:18c0:4d0c:f4e5::2 exists in switch zone  
17:19:12.865Z  INFO SledAgent/BootstrapAgent: Adding bootstrap address (zone=oxz_switch)                               
17:19:23.445Z  INFO SledAgent/BootstrapAgent: Forwarding bootstrap traffic via oxBootstrap38 to fe80::8:20ff:fe5b:f28  
17:19:23.452Z  INFO SledAgent/BootstrapAgent: GZ addresses: []                                                         
17:19:23.452Z  INFO SledAgent/BootstrapAgent: Zone using sled underlay as gateway                                      
17:19:23.458Z  WARN SledAgent/BootstrapAgent:                                                                          
    Failed to initialize switch: Failed to do 'Adding Route' by running command in zone: Error running command in zone 'oxz_switch': Command [/usr/sbin/route add -inet6 default -inet6 fd00:1122:3344:101::1] executed and failed with status: exit status: 128  stdout: add net default: gateway fd00:1122:3344:101::1: Network is unreachable
      stderr:                                                                                                          
17:19:23.480Z  WARN SledAgent/BootstrapAgent: failed to create NAT entry for service                                   
    service: ServiceZoneService { id: 75203abd-3d8b-4928-9e29-5235f4ebcdcf, details: BoundaryNtp { ntp_servers: ["ntp.eng.oxide.computer"], dns_servers: ["1.1.1.1", "9.9.9.9"], domain: None, nic: NetworkInterface { id: 50e49355-0dfe-4b4c-9fd4-efd68c5894bc, kind: Service { id: 75203abd-3d8b-4928-9e29-5235f4ebcdcf }, name: Name("ntp-75203abd-3d8b-4928-9e29-5235f4ebcdcf"), ip: 172.30.3.5, mac: MacAddr(MacAddr6([168, 64, 37, 255, 139, 60])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.3.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 0 }, snat_cfg: SourceNatConfig { ip: 192.168.0.222, first_port: 0, last_port: 16383 } } }
    --                                                                                                                 
    error: Communication Error: error sending request for url (http://[fd00:1122:3344:101::2]:12224/nat/ipv4/192.168.0.222/0): operation timed out

What's happening here?

  • I believe the Sled Agent is trying to re-load all services. It can reload the internal DNS zone, but it struggles to reload the NTP zone. I believe this is because part of the NTP zone setup involves making calls to the switch zone, which isn't up. This stops all other services from loading, because we're loading them sequentially.
  • So what's the deal with the switch zone failing to come up? The "Adding route" command fails. I believe this is because the Switch zone attempts to route through the sled's underlay address, even though the switch zone itself doesn't have an address on the underlay.

TL;DR:

  • Switch zone starts without underlay address:
    sled_hardware::HardwareUpdate::TofinoLoaded => {
    let baseboard = self.hardware.baseboard();
    let switch_zone_ip = None;
    if let Err(e) = self.services.activate_switch(
    switch_zone_ip,
    baseboard,
    ).await {
    warn!(self.log, "Failed to activate switch: {e}");
    }
    }
  • Switch zone setup code peeks at sled's underlay address, and adds a route to it if it exists:
    let maybe_gateway = if !request.zone.gz_addresses.is_empty() {
    // If this service supplies its own GZ address, add a route.
    //
    // This is currently being used for the DNS service.
    //
    // TODO: consider limiting the number of GZ addresses which
    // can be supplied - now that we're actively using it, we
    // aren't really handling the "many GZ addresses" case, and it
    // doesn't seem necessary now.
    Some(request.zone.gz_addresses[0])
    } else if let Some(info) = self.inner.sled_info.get() {
    // If the service has not supplied a GZ address, simply add
    // a route to the sled's underlay address.
    Some(info.underlay_address)
    } else {
    // If the underlay doesn't exist, no routing occurs.
    None
    };
    if let Some(gateway) = maybe_gateway {
    running_zone.add_default_route(gateway).map_err(|err| {
    Error::ZoneCommand { intent: "Adding Route".to_string(), err }
    })?;
    }
  • These two should both be in-sync. After reboot, they are not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions