Skip to content

Recover from DNS outage on startup #10186

@spinscale

Description

@spinscale

If you start an elasticsearch node, that has trouble with DNS, it will never recover from this and continue spitting exceptions, even if the DNS problems are fixed. The reason for this is, that in UnicastZenPing constructor we have the following code:

        for (String host : hosts) {
            try {
                TransportAddress[] addresses = transportService.addressesFromString(host);
                // we only limit to 1 addresses, makes no sense to ping 100 ports
                for (int i = 0; (i < addresses.length && i < LIMIT_PORTS_COUNT); i++) {
                    configuredTargetNodes.add(new DiscoveryNode(UNICAST_NODE_PREFIX + unicastNodeIdGenerator.incrementAndGet() + "#", addresses[i], version.minimumCompatibilityVersion()));
                }
            } catch (Exception e) {
                throw new ElasticsearchIllegalArgumentException("Failed to resolve address for [" + host + "]", e);
            }
        }
        this.configuredTargetNodes = configuredTargetNodes.toArray(new DiscoveryNode[configuredTargetNodes.size()]);

transportService.addressesFromString(host) calls InetSocketAddress which in turn tries to resolve the applied hostname and fails, thus marking returning InetSocketAddress.isResolved() as false - forever. This method is used by netty to check if connecting to the endpoint makes sense at all.

How to reproduce locally

If you want to reproduce, take this config and disable network on your system (will work when network is enabled, as localhost.spinscale.de resolves to 127.0.0.1.

discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["localhost.spinscale.de:9300" ]

Fix proposal

  1. First, remove the exception output, catch UnresolvedAddressException in UnicastZenPing.sendPings() and log a single line, telling the problem including the hostname
  2. Make sure the InetAddress and its isResolved() method is not cached. Not sure what is the best approach here, either create the InetSocketAddress object before each connect try or maybe there are some configurable properties around this

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions