-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
I ran into issues with TLS certs being regenerated due to these bugs:
Once the TLS certs changed, the MutatingWebhook for PodReadinessGate started failing and blocking the rollout of pods on services using this feature.
This was the error:
Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": Post "https://aws-lb-controller-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
I think this exposes an availability concern because if all the pods backing a service get rescheduled while the mutatingwebhook is broken, the service will go down. My understanding is the PodReadinessGate is a bonus feature to make rollouts more smooth in Kubernetes and I think it'd be preferable for the feature to just not work rather than block rollouts all together.
Steps to reproduce
Break TLS certs on the LB controller while using PodReadinessGates, then reschedule pods backing an LB in that namespace.
Expected outcome
I'd like to either be able to configure the webhooks failure policy or set it to fail open.
Environment
- AWS Load Balancer controller version: 2.4.1
- Kubernetes version: 1.21
- Using EKS (yes/no), if so version? Yes, platform version 7
Additional Context: