-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
:Distributed Indexing/DistributedA catch all label for anything in the Distributed Indexing Area. Please avoid if you can.A catch all label for anything in the Distributed Indexing Area. Please avoid if you can.>enhancementTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.team-discuss
Description
Today, Elasticsearch log emit very similar messages around unresponsive node and unreachable node.
As an end-user, it is not easy to tell whether the problem lies in the network (platform) layer when the destination is completely unreachable or in Elasticsearch when it was overwhelmed with requests and becomes slow to respond.
Some of the relevant logs look like:
[2021-05-07T15:02:28,704][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [elastic-05] collector [cluster_stats] timed out when collecting data
[2021-05-07T15:02:57,757][ERROR][o.e.x.m.c.e.EnrichStatsCollector] [elastic-05] collector [enrich_coordinator_stats] timed out when collecting data
[2021-05-07T15:03:07,786][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [elastic-05] collector [index_recovery] timed out when collecting data
[2021-05-07T15:03:17,801][ERROR][o.e.x.m.c.i.IndexStatsCollector] [elastic-05] collector [index-stats] timed out when collecting data
[2021-05-07T15:16:43,101][WARN ][o.e.c.c.Coordinator ] [elastic-05] failed to validate incoming join request from node [{elastic-04}{uaH7bAt2TgaLhcKCkxpu6Q}{r3IPD7o4SXarFHKmjlNy0Q}{xx.xx.xx.xx}{10.10.10.5:9300}{dim}{xpack.installed=true}] org.elasticsearch.transport.NodeDisconnectedException: [elastic-04][10.10.10.5:9300][internal:cluster/coordination/join/validate] disconnected
[2021-05-07T15:18:12,085][INFO ][o.e.c.c.C.CoordinatorPublication] [elastic-05] after [10s] publication of cluster state version [536546] is still waiting for {elastic-04}{uaH7bAt2TgaLhcKCkxpu6Q}{r3IPD7o4SXarFHKmjlNy0Q}{xx.xx.xx.xx}{10.10.10.5:9300}{dim}{xpack.installed=true} [SENT_APPLY_COMMIT]
[2021-05-07T15:18:28,006][WARN ][o.e.c.r.a.AllocationService] [elastic-05] failing shard [failed shard, shard [index_v1][0], node[uaH7bAt2TgaLhcKCkxpu6Q], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=eOGoLSllTG6UfYJPYg6cNg], unassigned_info[[reason=ALLOCATION_FAILED], at[2021-05-07T05:18:12.222Z], failed_attempts[4], failed_nodes[[uaH7bAt2TgaLhcKCkxpu6Q]], delayed=false, details[failed shard on node [uaH7bAt2TgaLhcKCkxpu6Q]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_v1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_valid_shard_copy]], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_v1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], markAsStale [true]]
[2021-05-07T15:49:33,607][INFO ][o.e.c.s.MasterService ] [elastic-05] node-left[{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true} reason: disconnected], term: 219, version: 538374, delta: removed {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}
[2021-05-07T15:49:33,661][INFO ][o.e.c.s.ClusterApplierService] [elastic-05] removed {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}, term: 219, version: 538374, reason: Publication{term=219, version=538374}
[2021-05-07T15:50:41,662][INFO ][o.e.c.s.MasterService ] [elastic-05] node-join[{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true} join existing leader], term: 219, version: 538375, delta: added {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}
[2021-05-07T15:50:42,445][INFO ][o.e.c.s.ClusterApplierService] [elastic-05] added {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}, term: 219, version: 538375, reason: Publication{term=219, version=538375}
It would be great if Elasticsearch can intercept early and stop running some of these checkup services just reporting the node is unreachable via ping and retry later.
Metadata
Metadata
Assignees
Labels
:Distributed Indexing/DistributedA catch all label for anything in the Distributed Indexing Area. Please avoid if you can.A catch all label for anything in the Distributed Indexing Area. Please avoid if you can.>enhancementTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.team-discuss