Skip to content

Distinguish between unresponsive node and unreachable node #72968

@Leaf-Lin

Description

@Leaf-Lin

Today, Elasticsearch log emit very similar messages around unresponsive node and unreachable node.

As an end-user, it is not easy to tell whether the problem lies in the network (platform) layer when the destination is completely unreachable or in Elasticsearch when it was overwhelmed with requests and becomes slow to respond.

Some of the relevant logs look like:

[2021-05-07T15:02:28,704][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [elastic-05] collector [cluster_stats] timed out when collecting data
[2021-05-07T15:02:57,757][ERROR][o.e.x.m.c.e.EnrichStatsCollector] [elastic-05] collector [enrich_coordinator_stats] timed out when collecting data
[2021-05-07T15:03:07,786][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [elastic-05] collector [index_recovery] timed out when collecting data
[2021-05-07T15:03:17,801][ERROR][o.e.x.m.c.i.IndexStatsCollector] [elastic-05] collector [index-stats] timed out when collecting data
[2021-05-07T15:16:43,101][WARN ][o.e.c.c.Coordinator      ] [elastic-05] failed to validate incoming join request from node [{elastic-04}{uaH7bAt2TgaLhcKCkxpu6Q}{r3IPD7o4SXarFHKmjlNy0Q}{xx.xx.xx.xx}{10.10.10.5:9300}{dim}{xpack.installed=true}] org.elasticsearch.transport.NodeDisconnectedException: [elastic-04][10.10.10.5:9300][internal:cluster/coordination/join/validate] disconnected
[2021-05-07T15:18:12,085][INFO ][o.e.c.c.C.CoordinatorPublication] [elastic-05] after [10s] publication of cluster state version [536546] is still waiting for {elastic-04}{uaH7bAt2TgaLhcKCkxpu6Q}{r3IPD7o4SXarFHKmjlNy0Q}{xx.xx.xx.xx}{10.10.10.5:9300}{dim}{xpack.installed=true} [SENT_APPLY_COMMIT]
[2021-05-07T15:18:28,006][WARN ][o.e.c.r.a.AllocationService] [elastic-05] failing shard [failed shard, shard [index_v1][0], node[uaH7bAt2TgaLhcKCkxpu6Q], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=eOGoLSllTG6UfYJPYg6cNg], unassigned_info[[reason=ALLOCATION_FAILED], at[2021-05-07T05:18:12.222Z], failed_attempts[4], failed_nodes[[uaH7bAt2TgaLhcKCkxpu6Q]], delayed=false, details[failed shard on node [uaH7bAt2TgaLhcKCkxpu6Q]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_v1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_valid_shard_copy]], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_v1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], markAsStale [true]]
[2021-05-07T15:49:33,607][INFO ][o.e.c.s.MasterService    ] [elastic-05] node-left[{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true} reason: disconnected], term: 219, version: 538374, delta: removed {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}
[2021-05-07T15:49:33,661][INFO ][o.e.c.s.ClusterApplierService] [elastic-05] removed {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}, term: 219, version: 538374, reason: Publication{term=219, version=538374}
[2021-05-07T15:50:41,662][INFO ][o.e.c.s.MasterService    ] [elastic-05] node-join[{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true} join existing leader], term: 219, version: 538375, delta: added {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}
[2021-05-07T15:50:42,445][INFO ][o.e.c.s.ClusterApplierService] [elastic-05] added {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}, term: 219, version: 538375, reason: Publication{term=219, version=538375}

It would be great if Elasticsearch can intercept early and stop running some of these checkup services just reporting the node is unreachable via ping and retry later.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Indexing/DistributedA catch all label for anything in the Distributed Indexing Area. Please avoid if you can.>enhancementTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.team-discuss

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions