Skip to content

Thread leak with autodiscover after adding / removing a master node #8057

@fdv

Description

@fdv

I've had an epic bug leading to a thread leak (and ES dying because it had no more RAM to create more threads). Whole story here.

Bug description

4 nodes ES cluster on EC2. 1 routing node, 3 data nodes. The routing node acts as master (configuration provided below)

After I added and removed a master node, one of the data nodes kept sending auto discovery requests to a the gone node. This led to the remaining routing node to create about 1 thread / second (sending auto discovery requests) and never closing them.

Configuration

  • ES 1.0.1 (old, I know)
  • ES Transport Thrift: elasticsearch-transport-thrift-2.0.0
  • AWS cloud plugin: cloud-aws-2.0.0

Routing node

bootstrap:
  mlockall: true
cloud:
  aws:
    access_key: something
    region: us-east-1
    secret_key: something
cluster:
  name: robots
discovery:
  ec2:
    ping_timeout: 360
    tag:
      Cluster: production
  type: ec2
  zen:
    minimum_master_nodes: 1
gateway:
  expected_nodes: 4
  recover_after_nodes: 4
  recover_after_time: 5m
http:
  max_content_length: 100mb
index:
  query:
    bool:
      max_clause_count: 1000000
  refresh_interval: 300
  store:
    type: mmapfs
indices:
  fielddata:
    cache:
      expire: 10m
      size: 30%
  memory:
    index_buffer_size: 10%
network:
  host: 0.0.0.0
node:
  data: false
  master: true
  name: something
path:
  data: /mnt/elasticsearch
  logs: /var/log/elasticsearch

Data nodes

bootstrap:
  mlockall: true
cloud:
  aws:
    access_key: something
    region: us-east-1
    secret_key: something
cluster:
  name: robots
discovery:
  ec2:
    ping_timeout: 360
    tag:
      Cluster: production
  type: ec2
  zen:
    minimum_master_nodes: 1
gateway:
  expected_nodes: 4
  recover_after_nodes: 4
  recover_after_time: 5m
http:
  max_content_length: 100mb
index:
  query:
    bool:
      max_clause_count: 1000000
  refresh_interval: 300
  store:
    type: mmapfs
indices:
  fielddata:
    cache:
      expire: 10m
      size: 30%
  memory:
    index_buffer_size: 10%
network:
  host: 0.0.0.0
node:
  data: true
  master: false
  name: something
path:
  data: /mnt/elasticsearch
  logs: /var/log/elasticsearch

JVM

  • xmx and xms to 4G
  • no fancy GC tuning

Tell me if I you need anything else.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions