-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Hello.
I have had an incident related to the APM Lambda layer on a production environment:
The Elastic instance had failed because of low disk space and was unable to process incoming data. During this incident, it seems that the lambda function had started holding the execution up until the maximum allowed time (32s), which was correlated with a bunch of MongoDB connection issues: Client network socket disconnected before secure TLS connection was established
.
After further investigation we have concluded that a possible reason for this was an increase in connections open to the Mongo instance which overloaded it.
As you can see, the issues have started appearing around the same time the instance went down, and was only fixed after redeploying the lambda function without APM enabled.
Is there any way the lambda layer was keeping the execution environment alive while trying to send data to the Elastic instance, given the ELASTIC_APM_DATA_RECEIVER_TIMEOUT_SECONDS
variable defaulted to 15s and the Elastic server returned 504?