Skip to content

More graceful handling of failure in DescribeInstances (ec2-discovery) #28452

@azsolinsky

Description

@azsolinsky

If the DescribeInstances call fails from the EC2 Discovery plugin for any reason, the code just returns an empty list of nodes. This is bad because the code currently caches it until the refresh interval expires. This is bad because the code uses the empty list of nodes immediately, and will try to make the call again on the next get, which potentially doesn't include any retry back-off.

https://github.com/elastic/elasticsearch/blob/139deb535a58de87c602888a121b2791bcd22df2/plugins/discovery-ec2/src/main/java/org/elasticsearch/discovery/ec2/AwsEc2UnicastHostsProvider.java#L106:L119

~~With the default refresh of 10s this is sometimes not catastrophic; however, if throttling is happening a lot it can potentially cause the masters to not be able to communicate with one another and lead to cluster instability. ~~

Also, with this bug, increasing the refresh interval is dangerous because the empty results list is cached until the refresh interval expires. The code should probably not return empty list if it is being throttled and continue to use the list from the last successful call, or possibly retry more with exponential back-off for throttling exceptions.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions