Skip to content

DNS propagation broken in newly-deployed systems #6951

@davepacheco

Description

@davepacheco

tl;dr: It looks like #6794 changed RSS such that the initial internal DNS configuration has the wrong port number in the SRV records for the internal DNS servers. Since DNS propagation relies on these records, DNS propagation doesn't work, which unfortunately makes it hard to fix a system that has this problem. I do not believe this can ever affect systems deployed prior to #6794, which includes dogfood, colo, and all existing customer systems.


@andrewjstone reported in chat that while testing #6950 (which has minimal changes from "main" for the purpose of this ticket), he had code that was trying to look up newly-added clickhouse admin DNS records, but they weren't found, even though Reconfigurator had written them to DNS. He was seeing:

22:15:09.033Z WARN 80018545-6637-4afa-aaec-bc1f4e7a00f9 (ServerContext): Failed to lookup ClickhouseAdminKeeper in internal DNS: no record found for Query { name: Name("_clickhouse-admin-keeper._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }. Is it enabled via policy?

Querying the DNS servers directly, we were not seeing the records:

root@oxz_switch:~# dig -t SRV _clickhouse-admin-keeper._tcp.control-plane.oxide.internal @fd00:1122:3344:1::1

; <<>> DiG 9.18.14 <<>> -t SRV _clickhouse-admin-keeper._tcp.control-plane.oxide.internal @fd00:1122:3344:1::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 49813
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_clickhouse-admin-keeper._tcp.control-plane.oxide.internal. IN SRV

;; Query time: 0 msec
;; SERVER: fd00:1122:3344:1::1#53(fd00:1122:3344:1::1) (UDP)
;; WHEN: Tue Oct 29 22:47:24 UTC 2024
;; MSG SIZE  rcvd: 76

but they were in the database in omdb db dns names internal 2 (where 2 is the latest internal DNS generation, found via omdb db dns show):

...
 _clickhouse-admin-keeper._tcp                      (records: 5)
      SRV  port  8888 21935905-6d28-4716-9619-a2f5e541e292.host.control-plane.oxide.internal
      SRV  port  8888 2ed19103-3b7e-4992-a99e-0152186e2546.host.control-plane.oxide.internal
      SRV  port  8888 4d87e4c7-d268-43c5-8f28-8fe6963bc509.host.control-plane.oxide.internal
      SRV  port  8888 77d707ee-1b07-400e-ac4a-aeb54d6250f1.host.control-plane.oxide.internal
      SRV  port  8888 943b471f-7790-43ae-9904-d536c5053781.host.control-plane.oxide.internal

so we looked at DNS propagation status:

root@oxz_switch:~# omdb nexus background-tasks show dns_internal
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:104::5]:12221
task: "dns_config_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 71, triggered by a periodic timer firing
    started at 2024-10-29T22:51:55.341Z (12s ago) and ran for 398ms
    last generation found: 2

task: "dns_servers_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 71, triggered by a periodic timer firing
    started at 2024-10-29T22:51:55.340Z (12s ago) and ran for 2ms
    servers found: 3

      DNS_SERVER_ADDR
      [fd00:1122:3344:1::1]:53
      [fd00:1122:3344:2::1]:53
      [fd00:1122:3344:3::1]:53

task: "dns_propagation_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 73, triggered by a periodic timer firing
    started at 2024-10-29T22:51:55.241Z (12s ago) and ran for 364ms
    attempt to propagate generation: 2

      DNS_SERVER_ADDR          LAST_RESULT
      [fd00:1122:3344:1::1]:53 error (see below)
      [fd00:1122:3344:2::1]:53 error (see below)
      [fd00:1122:3344:3::1]:53 error (see below)

    error: server [fd00:1122:3344:1::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:1::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:1::1]:53/config): error sending request for url (http://[fd00:1122:3344:1::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    error: server [fd00:1122:3344:2::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:2::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:2::1]:53/config): error sending request for url (http://[fd00:1122:3344:2::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    error: server [fd00:1122:3344:3::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:3::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:3::1]:53/config): error sending request for url (http://[fd00:1122:3344:3::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)

That's interesting -- I've never seen that fail. It's weird to get "Connection refused" -- we might think the DNS server process crashed, except that we'd just successfully queried it. But the port number there is surprising: 53 is the DNS protocol port, not the one we run an HTTP server on. Indeed, in dogfood the addresses are:

task: "dns_servers_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 8808, triggered by a periodic timer firing
    started at 2024-10-29T22:50:18.557Z (49s ago) and ran for 1ms
    servers found: 3

      DNS_SERVER_ADDR            
      [fd00:1122:3344:1::1]:5353 
      [fd00:1122:3344:2::1]:5353 
      [fd00:1122:3344:3::1]:5353 

which makes more sense. So where do these come from? The main place I think of is inside Reconfigurator:

) => (ServiceName::InternalDns, http_address),

That code uses the http_address, which has the HTTP port, which should be correct. Are we somehow putting the wrong value into http_address? That comes from RSS. I found that we weren't putting the wrong value there, but we are specifying the wrong port in the initial DNS config. This changed in #6794:
https://github.com/oxidecomputer/omicron/pull/6794/files#diff-9ea2b79544fdd0a21914ea354fba0b3670258746b1350d900285445d399861e1R468

Prior to that change, we were putting the HTTP port in there. Now it puts the DNS port in there. In retrospect, it makes sense that this is the relevant code path (not the Reconfigurator one) since @andrewjstone's system was trying to get to generation 2, which means it must have been at generation 1, which would have been generated by RSS.

I'm hopeful this was just a typo and we can just change the port used in RSS (and not that something else in that PR is depending on this being the DNS port).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions