diff --git a/.yamllint.yaml b/.yamllint.yaml index c199e193b1..f3306bcd95 100644 --- a/.yamllint.yaml +++ b/.yamllint.yaml @@ -41,6 +41,7 @@ rules: .github/workflows/ deploy/manifests/nginx-gateway.yaml deploy/manifests/crds + tests/longevity/manifests/cronjob.yaml new-line-at-end-of-file: enable new-lines: enable octal-values: disable diff --git a/tests/longevity/longevity.md b/tests/longevity/longevity.md new file mode 100644 index 0000000000..d1c9c2c8dc --- /dev/null +++ b/tests/longevity/longevity.md @@ -0,0 +1,149 @@ +# Longevity Test + +This document describes how we test NGF for longevity. + + + +- [Longevity Test](#longevity-test) + - [Goals](#goals) + - [Test Environment](#test-environment) + - [Steps](#steps) + - [Start](#start) + - [Check the Test is Running Correctly](#check-the-test-is-running-correctly) + - [End](#end) + - [Analyze](#analyze) + - [Results](#results) + + + +## Goals + +- Ensure that NGF successfully processes both control plane and data plane transactions over a period of time much + greater than in our other tests. +- Catch bugs that could only appear over a period of time (like resource leaks). + +## Test Environment + +- A Kubernetes cluster with 3 nodes on GKE + - Node: e2-medium (2 vCPU, 4GB memory) + - Enabled GKE logging. + - Enabled GKE Cloud monitoring with managed Prometheus service, with enabled: + - system. + - kube state - pods, deployments. +- Tester VMs on Google Cloud: + - Configuration: + - Debian + - Install packages: tmux, wrk + - Location - same zone as the Kubernetes cluster. + - First VM - for HTTP traffic + - Second VM - for sending HTTPs traffic +- NGF + - Deployment with 1 replica + - Exposed via a Service with type LoadBalancer, private IP + - Gateway, two listeners - HTTP and HTTPs + - Two apps: + - Coffee - 3 replicas + - Tea - 3 replicas + - Two HTTPRoutes + - Coffee (HTTP) + - Tea (HTTPS) + +## Steps + +### Start + +Test duration - 4 days. + +1. Create a Kubernetes cluster on GKE. +2. Deploy NGF. +3. Expose NGF via a LoadBalancer Service with `"networking.gke.io/load-balancer-type":"Internal"` annotation to + allocate an internal load balancer. +4. Apply the manifests which will: + 1. Deploy the coffee and tea backends. + 2. Configure HTTP and HTTPS listeners on the Gateway. + 3. Expose coffee via HTTP listener and tea via HTTPS listener. + 4. Create two CronJobs to re-rollout backends: + 1. Coffee - every minute for an hour every 6 hours + 2. Tea - every minute for an hour every 6 hours, 3 hours apart from coffee. + 5. Configure Prometheus on GKE to pick up NGF metrics. + + ```shell + kubectl apply -f files + ``` + +5. In Tester VMs, update `/etc/hosts` to have an entry with the External IP of the NGF Service (`10.128.0.10` in this + case): + + ```text + 10.128.0.10 cafe.example.com + ``` + +6. In Tester VMs, start a tmux session (this is needed so that even if you disconnect from the VM, any launched command + will keep running): + + ```shell + tmux + ``` + +7. In First VM, start wrk for 4 days for coffee via HTTP: + + ```shell + wrk -t2 -c100 -d96h http://cafe.example.com/coffee + ``` + +8. In Second VM, start wrk for 4 days for tea via HTTPS: + + ```shell + wrk -t2 -c100 -d96h https://cafe.example.com/tea + ``` + +Notes: + +- The updated coffee and tea backends in cafe.yaml include extra configuration for zero time upgrades, so that + wrk in Tester VMs doesn't get 502 from NGF. Based on https://learnk8s.io/graceful-shutdown + +### Check the Test is Running Correctly + +Check that you don't see any errors: + +1. Check that GKE exports NGF pod logs to Google Cloud Operations Logging and Prometheus metrics to Google Cloud + Monitoring. +2. Check that traffic is flowing - look at the access logs of NGINX in Google Cloud Operations Logging. +3. Check that CronJob can run. + + ```shell + kubectl create job --from=cronjob/coffee-rollout-mgr coffee-test + kubectl create job --from=cronjob/tea-rollout-mgr tea-test + ``` + +In case of errors, double check if you prepared the environment and launched the test correctly. + +### End + +- Remove CronJobs. + +## Analyze + +- Traffic + - Tester VMs (clients) + - As wrk stop, they will print output upon termination. To connect to the tmux session with wrk, + run `tmux attach -t 0` + - Check for errors, latency, RPS +- Logs + - Check the logs for errors in Google Cloud Operations Logging. + - NGF + - NGINX +- Check metrics in Google Cloud Monitoring. + - NGF + - CPU usage + - NGINX + - NGF + - Memory usage + - NGINX + - NGF + - NGINX metrics + - Reloads + +## Results + +- [1.0.0](results/1.0.0/1.0.0.md) diff --git a/tests/longevity/manifests/cafe-routes.yaml b/tests/longevity/manifests/cafe-routes.yaml new file mode 100644 index 0000000000..e679756d6e --- /dev/null +++ b/tests/longevity/manifests/cafe-routes.yaml @@ -0,0 +1,37 @@ +apiVersion: gateway.networking.k8s.io/v1beta1 +kind: HTTPRoute +metadata: + name: coffee +spec: + parentRefs: + - name: gateway + sectionName: http + hostnames: + - "cafe.example.com" + rules: + - matches: + - path: + type: PathPrefix + value: /coffee + backendRefs: + - name: coffee + port: 80 +--- +apiVersion: gateway.networking.k8s.io/v1beta1 +kind: HTTPRoute +metadata: + name: tea +spec: + parentRefs: + - name: gateway + sectionName: https + hostnames: + - "cafe.example.com" + rules: + - matches: + - path: + type: PathPrefix + value: /tea + backendRefs: + - name: tea + port: 80 diff --git a/tests/longevity/manifests/cafe-secret.yaml b/tests/longevity/manifests/cafe-secret.yaml new file mode 100644 index 0000000000..4510460bba --- /dev/null +++ b/tests/longevity/manifests/cafe-secret.yaml @@ -0,0 +1,8 @@ +apiVersion: v1 +kind: Secret +metadata: + name: cafe-secret +type: kubernetes.io/tls +data: + tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUNzakNDQVpvQ0NRQzdCdVdXdWRtRkNEQU5CZ2txaGtpRzl3MEJBUXNGQURBYk1Sa3dGd1lEVlFRRERCQmoKWVdabExtVjRZVzF3YkdVdVkyOXRNQjRYRFRJeU1EY3hOREl4TlRJek9Wb1hEVEl6TURjeE5ESXhOVEl6T1ZvdwpHekVaTUJjR0ExVUVBd3dRWTJGbVpTNWxlR0Z0Y0d4bExtTnZiVENDQVNJd0RRWUpLb1pJaHZjTkFRRUJCUUFECmdnRVBBRENDQVFvQ2dnRUJBTHFZMnRHNFc5aStFYzJhdnV4Q2prb2tnUUx1ek10U1Rnc1RNaEhuK3ZRUmxIam8KVzFLRnMvQVdlS25UUStyTWVKVWNseis4M3QwRGtyRThwUisxR2NKSE50WlNMb0NEYUlRN0Nhck5nY1daS0o4Qgo1WDNnVS9YeVJHZjI2c1REd2xzU3NkSEQ1U2U3K2Vab3NPcTdHTVF3K25HR2NVZ0VtL1Q1UEMvY05PWE0zZWxGClRPL051MStoMzROVG9BbDNQdTF2QlpMcDNQVERtQ0thaEROV0NWbUJQUWpNNFI4VERsbFhhMHQ5Z1o1MTRSRzUKWHlZWTNtdzZpUzIrR1dYVXllMjFuWVV4UEhZbDV4RHY0c0FXaGRXbElweHlZQlNCRURjczN6QlI2bFF1OWkxZAp0R1k4dGJ3blVmcUVUR3NZdWxzc05qcU95V1VEcFdJelhibHhJZVVDQXdFQUFUQU5CZ2txaGtpRzl3MEJBUXNGCkFBT0NBUUVBcjkrZWJ0U1dzSnhLTGtLZlRkek1ISFhOd2Y5ZXFVbHNtTXZmMGdBdWVKTUpUR215dG1iWjlpbXQKL2RnWlpYVE9hTElHUG9oZ3BpS0l5eVVRZVdGQ2F0NHRxWkNPVWRhbUloOGk0Q1h6QVJYVHNvcUNOenNNLzZMRQphM25XbFZyS2lmZHYrWkxyRi8vblc0VVNvOEoxaCtQeDljY0tpRDZZU0RVUERDRGh1RUtFWXcvbHpoUDJVOXNmCnl6cEJKVGQ4enFyM3paTjNGWWlITmgzYlRhQS82di9jU2lyamNTK1EwQXg4RWpzQzYxRjRVMTc4QzdWNWRCKzQKcmtPTy9QNlA0UFlWNTRZZHMvRjE2WkZJTHFBNENCYnExRExuYWRxamxyN3NPbzl2ZzNnWFNMYXBVVkdtZ2todAp6VlZPWG1mU0Z4OS90MDBHUi95bUdPbERJbWlXMGc9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== + tls.key: LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQzZtTnJSdUZ2WXZoSE4KbXI3c1FvNUtKSUVDN3N6TFVrNExFeklSNS9yMEVaUjQ2RnRTaGJQd0ZuaXAwMFBxekhpVkhKYy92TjdkQTVLeApQS1VmdFJuQ1J6YldVaTZBZzJpRU93bXF6WUhGbVNpZkFlVjk0RlAxOGtSbjl1ckV3OEpiRXJIUncrVW51L25tCmFMRHF1eGpFTVBweGhuRklCSnYwK1R3djNEVGx6TjNwUlV6dnpidGZvZCtEVTZBSmR6N3Rid1dTNmR6MHc1Z2kKbW9RelZnbFpnVDBJek9FZkV3NVpWMnRMZllHZWRlRVJ1VjhtR041c09va3R2aGxsMU1udHRaMkZNVHgySmVjUQo3K0xBRm9YVnBTS2NjbUFVZ1JBM0xOOHdVZXBVTHZZdFhiUm1QTFc4SjFINmhFeHJHTHBiTERZNmpzbGxBNlZpCk0xMjVjU0hsQWdNQkFBRUNnZ0VBQnpaRE50bmVTdWxGdk9HZlFYaHRFWGFKdWZoSzJBenRVVVpEcUNlRUxvekQKWlV6dHdxbkNRNlJLczUyandWNTN4cU9kUU94bTNMbjNvSHdNa2NZcEliWW82MjJ2dUczYnkwaVEzaFlsVHVMVgpqQmZCcS9UUXFlL2NMdngvSkczQWhFNmJxdFRjZFlXeGFmTmY2eUtpR1dzZk11WVVXTWs4MGVJVUxuRmZaZ1pOCklYNTlSOHlqdE9CVm9Sa3hjYTVoMW1ZTDFsSlJNM3ZqVHNHTHFybmpOTjNBdWZ3ZGRpK1VDbGZVL2l0K1EvZkUKV216aFFoTlRpNVFkRWJLVStOTnYvNnYvb2JvandNb25HVVBCdEFTUE05cmxFemIralQ1WHdWQjgvLzRGY3VoSwoyVzNpcjhtNHVlQ1JHSVlrbGxlLzhuQmZ0eVhiVkNocVRyZFBlaGlPM1FLQmdRRGlrR3JTOTc3cjg3Y1JPOCtQClpoeXltNXo4NVIzTHVVbFNTazJiOTI1QlhvakpZL2RRZDVTdFVsSWE4OUZKZnNWc1JRcEhHaTFCYzBMaTY1YjIKazR0cE5xcVFoUmZ1UVh0UG9GYXRuQzlPRnJVTXJXbDVJN0ZFejZnNkNQMVBXMEg5d2hPemFKZUdpZVpNYjlYTQoybDdSSFZOcC9jTDlYbmhNMnN0Q1lua2Iwd0tCZ1FEUzF4K0crakEyUVNtRVFWNXA1RnRONGcyamsyZEFjMEhNClRIQ2tTazFDRjhkR0Z2UWtsWm5ZbUt0dXFYeXNtekJGcnZKdmt2eUhqbUNYYTducXlpajBEdDZtODViN3BGcVAKQWxtajdtbXI3Z1pUeG1ZMXBhRWFLMXY4SDNINGtRNVl3MWdrTWRybVJHcVAvaTBGaDVpaGtSZS9DOUtGTFVkSQpDcnJjTzhkUVp3S0JnSHA1MzRXVWNCMVZibzFlYStIMUxXWlFRUmxsTWlwRFM2TzBqeWZWSmtFb1BZSEJESnp2ClIrdzZLREJ4eFoyWmJsZ05LblV0YlhHSVFZd3lGelhNcFB5SGxNVHpiZkJhYmJLcDFyR2JVT2RCMXpXM09PRkgKcmppb21TUm1YNmxhaDk0SjRHU0lFZ0drNGw1SHhxZ3JGRDZ2UDd4NGRjUktJWFpLZ0w2dVJSSUpBb0dCQU1CVApaL2p5WStRNTBLdEtEZHUrYU9ORW4zaGxUN3hrNXRKN3NBek5rbWdGMU10RXlQUk9Xd1pQVGFJbWpRbk9qbHdpCldCZ2JGcXg0M2ZlQ1Z4ZXJ6V3ZEM0txaWJVbWpCTkNMTGtYeGh3ZEVteFQwVit2NzZGYzgwaTNNYVdSNnZZR08KditwVVovL0F6UXdJcWZ6dlVmV2ZxdStrMHlhVXhQOGNlcFBIRyt0bEFvR0FmQUtVVWhqeFU0Ym5vVzVwVUhKegpwWWZXZXZ5TW54NWZyT2VsSmRmNzlvNGMvMHhVSjh1eFBFWDFkRmNrZW96dHNpaVFTNkN6MENRY09XVWxtSkRwCnVrdERvVzM3VmNSQU1BVjY3NlgxQVZlM0UwNm5aL2g2Tkd4Z28rT042Q3pwL0lkMkJPUm9IMFAxa2RjY1NLT3kKMUtFZlNnb1B0c1N1eEpBZXdUZmxDMXc9Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K diff --git a/tests/longevity/manifests/cafe.yaml b/tests/longevity/manifests/cafe.yaml new file mode 100644 index 0000000000..c95bcfb2d0 --- /dev/null +++ b/tests/longevity/manifests/cafe.yaml @@ -0,0 +1,81 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: coffee +spec: + replicas: 3 + selector: + matchLabels: + app: coffee + template: + metadata: + labels: + app: coffee + spec: + containers: + - name: coffee + image: nginxdemos/nginx-hello:plain-text + ports: + - containerPort: 8080 + readinessProbe: + httpGet: + path: / + port: 8080 + lifecycle: + preStop: + exec: + command: ["/bin/sleep", "15"] +--- +apiVersion: v1 +kind: Service +metadata: + name: coffee +spec: + ports: + - port: 80 + targetPort: 8080 + protocol: TCP + name: http + selector: + app: coffee +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: tea +spec: + replicas: 3 + selector: + matchLabels: + app: tea + template: + metadata: + labels: + app: tea + spec: + containers: + - name: tea + image: nginxdemos/nginx-hello:plain-text + ports: + - containerPort: 8080 + readinessProbe: + httpGet: + path: / + port: 8080 + lifecycle: + preStop: + exec: + command: ["/bin/sleep", "15"] +--- +apiVersion: v1 +kind: Service +metadata: + name: tea +spec: + ports: + - port: 80 + targetPort: 8080 + protocol: TCP + name: http + selector: + app: tea diff --git a/tests/longevity/manifests/cronjob.yaml b/tests/longevity/manifests/cronjob.yaml new file mode 100644 index 0000000000..234ff903d8 --- /dev/null +++ b/tests/longevity/manifests/cronjob.yaml @@ -0,0 +1,92 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: rollout-mgr + namespace: default +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: rollout-mgr + namespace: default +rules: +- apiGroups: + - "apps" + resources: + - deployments + verbs: + - patch +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: rollout-mgr + namespace: default +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: rollout-mgr +subjects: +- kind: ServiceAccount + name: rollout-mgr + namespace: default +--- +apiVersion: batch/v1 +kind: CronJob +metadata: + name: coffee-rollout-mgr + namespace: default +spec: + schedule: "* */6 * * *" # every minute every 6 hours + jobTemplate: + spec: + template: + spec: + serviceAccountName: rollout-mgr + containers: + - name: coffee-rollout-mgr + image: curlimages/curl:8.3.0 + imagePullPolicy: IfNotPresent + command: + - /bin/sh + - -c + args: + - | + TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) + RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + curl -X PATCH -s -k -v \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-type: application/merge-patch+json" \ + --data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \ + "https://kubernetes/apis/apps/v1/namespaces/default/deployments/coffee?fieldManager=kubectl-rollout" 2>&1 + restartPolicy: OnFailure +--- +apiVersion: batch/v1 +kind: CronJob +metadata: + name: tea-rollout-mgr + namespace: default +spec: + schedule: "* 3,9,15,21 * * *" # every minute every 6 hours, 3 hours apart from coffee + jobTemplate: + spec: + template: + spec: + serviceAccountName: rollout-mgr + containers: + - name: coffee-rollout-mgr + image: curlimages/curl:8.3.0 + imagePullPolicy: IfNotPresent + command: + - /bin/sh + - -c + args: + - | + TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) + RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + curl -X PATCH -s -k -v \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-type: application/merge-patch+json" \ + --data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \ + "https://kubernetes/apis/apps/v1/namespaces/default/deployments/tea?fieldManager=kubectl-rollout" 2>&1 + restartPolicy: OnFailure diff --git a/tests/longevity/manifests/gateway.yaml b/tests/longevity/manifests/gateway.yaml new file mode 100644 index 0000000000..593d17e496 --- /dev/null +++ b/tests/longevity/manifests/gateway.yaml @@ -0,0 +1,20 @@ +apiVersion: gateway.networking.k8s.io/v1beta1 +kind: Gateway +metadata: + name: gateway +spec: + gatewayClassName: nginx + listeners: + - name: http + port: 80 + protocol: HTTP + hostname: "*.example.com" + - name: https + port: 443 + protocol: HTTPS + hostname: "*.example.com" + tls: + mode: Terminate + certificateRefs: + - kind: Secret + name: cafe-secret diff --git a/tests/longevity/manifests/prom.yaml b/tests/longevity/manifests/prom.yaml new file mode 100644 index 0000000000..e5d35fae72 --- /dev/null +++ b/tests/longevity/manifests/prom.yaml @@ -0,0 +1,12 @@ +apiVersion: monitoring.googleapis.com/v1 +kind: PodMonitoring +metadata: + name: prom-example + namespace: nginx-gateway +spec: + selector: + matchLabels: + app.kubernetes.io/name: nginx-gateway + endpoints: + - port: metrics + interval: 30s diff --git a/tests/longevity/results/1.0.0/1.0.0.md b/tests/longevity/results/1.0.0/1.0.0.md new file mode 100644 index 0000000000..a0110a4d65 --- /dev/null +++ b/tests/longevity/results/1.0.0/1.0.0.md @@ -0,0 +1,234 @@ +# Results for v1.0.0 + + + +- [Results for v1.0.0](#results-for-v100) + - [Versions](#versions) + - [Traffic](#traffic) + - [NGF](#ngf) + - [Error Log](#error-log) + - [NGINX](#nginx) + - [Error Log](#error-log-1) + - [Access Log](#access-log) + - [Key Metrics](#key-metrics) + - [Containers memory](#containers-memory) + - [Containers CPU](#containers-cpu) + - [NGINX metrics](#nginx-metrics) + - [Reloads](#reloads) + - [Opened Issues](#opened-issues) + - [Future Improvements](#future-improvements) + + + +## Versions + +NGF version: + +```text +commit: "07d76315931501d878f3ed079142aa1899be1bd3" +date: "2023-09-28T16:49:51Z" +version: "edge" +``` + +with NGINX: + +```text +nginx/1.25.2 +built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10) +OS: Linux 5.15.109+ +``` + +Kubernetes: + +```text +Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3-gke.100", GitCommit:"6466b51b762a5c49ae3fb6c2c7233ffe1c96e48c", GitTreeState:"clean", BuildDate:"2023-06-23T09:27:28Z", GoVersion:"go1.20.5 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"} +``` + +## Traffic + +HTTP: + +```text +wrk -t2 -c100 -d96h http://cafe.example.com/coffee/long2 + +Running 5760m test @ http://cafe.example.com/coffee/long2 + 2 threads and 100 connections + Thread Stats Avg Stdev Max +/- Stdev + Latency 174.97ms 140.07ms 2.00s 83.57% + Req/Sec 319.15 212.21 2.21k 65.75% + 210108892 requests in 5615.74m, 74.42GB read + Socket errors: connect 0, read 356317, write 0, timeout 4299 +Requests/sec: 623.57 +Transfer/sec: 231.60KB +``` + +HTTPS: + +```text +wrk -t2 -c100 -d96h https://cafe.example.com/tea/long2 + +Running 5760m test @ https://cafe.example.com/tea/long2 + 2 threads and 100 connections + Thread Stats Avg Stdev Max +/- Stdev + Latency 165.13ms 113.14ms 1.99s 68.74% + Req/Sec 317.87 211.81 2.15k 65.17% + 209303259 requests in 5616.52m, 72.99GB read + Socket errors: connect 0, read 351351, write 0, timeout 3 +Requests/sec: 621.09 +Transfer/sec: 227.12KB +``` + +While there are socket errors in the output, there are no connection-related errors in NGINX logs. +Further investigation is out of scope of this test. + +### NGF + +#### Error Log + +```text +resource.type="k8s_container" +resource.labels.pod_name="nginx-gateway-b6cdb65cd-h8bgs" +resource.labels.namespace_name="nginx-gateway" +resource.labels.container_name="nginx-gateway" +severity=ERROR +SEARCH("error") +``` + +Found 104 entries, All entries are similar to: + +```json +{ + "stacktrace": "github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:105\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:68", + "logger": "eventHandler", + "error": "failed to reload NGINX: reload unsuccessful: no new NGINX worker processes started for config version 11135. Please check the NGINX container logs for possible configuration issues: context deadline exceeded", + "level": "error", + "msg": "Failed to update NGINX configuration", + "ts": "2023-10-01T18:44:03Z" +} +``` + +See Key metrics Reloads further in this doc. + +During the first run of the longevity test, for a shorter period (1 day), the following appeared in NGF logs: + +```text +I0926 20:58:42.883382 6 leaderelection.go:250] attempting to acquire leader lease nginx-gateway/nginx-gateway-leader-election... +I0926 20:58:43.073317 6 leaderelection.go:260] successfully acquired lease nginx-gateway/nginx-gateway-leader-election +{"level":"info","ts":"2023-09-26T20:58:43Z","logger":"leaderElector","msg":"Started leading"} +E0927 08:09:20.830614 6 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: Get "https://10.64.0.1:443/apis/coordination.k8s.io/v1/namespaces/nginx-gateway/leases/nginx-gateway-leader-election?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) +E0927 08:09:25.829736 6 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: Get "https://10.64.0.1:443/apis/coordination.k8s.io/v1/namespaces/nginx-gateway/leases/nginx-gateway-leader-election?timeout=5s": context deadline exceeded +I0927 08:09:25.830070 6 leaderelection.go:285] failed to renew lease nginx-gateway/nginx-gateway-leader-election: timed out waiting for the condition +{"level":"info","ts":"2023-09-27T08:09:25Z","logger":"leaderElector","msg":"Stopped leading"} +E0927 08:09:35.862628 6 event.go:289] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"nginx-gateway-leader-election.1788b315c7bd90e5", GenerateName:"", Namespace:"nginx-gateway", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Lease", Namespace:"nginx-gateway", Name:"nginx-gateway-leader-election", UID:"eb133a0d-7622-4b80-a0d1-d49755e52a1f", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"1044977", FieldPath:""}, Reason:"LeaderElection", Message:"nginx-gateway-b6cdb65cd-bt7zg stopped leading", Source:v1.EventSource{Component:"nginx-gateway-fabric-nginx", Host:""}, FirstTimestamp:time.Date(2023, time.September, 27, 8, 9, 25, 831766245, time.Local), LastTimestamp:time.Date(2023, time.September, 27, 8, 9, 25, 831766245, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"nginx-gateway-fabric-nginx", ReportingInstance:""}': 'Post "https://10.64.0.1:443/api/v1/namespaces/nginx-gateway/events?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)'(may retry after sleeping) +{"level":"info","ts":"2023-09-27T17:19:27Z","logger":"statusUpdater","msg":"Skipping updating Nginx Gateway status because not leader"} +{"level":"info","ts":"2023-09-27T19:54:13Z","logger":"statusUpdater","msg":"Skipping updating Gateway API status because not leader"} +{"level":"info","ts":"2023-09-27T19:54:15Z","logger":"statusUpdater","msg":"Skipping updating Gateway API status because not leader"} +{"level":"info","ts":"2023-09-27T19:54:24Z","logger":"statusUpdater","msg":"Skipping updating Gateway API status because not leader"} +{"level":"info","ts":"2023-09-27T19:54:25Z","logger":"statusUpdater","msg":"Skipping updating Gateway API status because not leader"} +``` + +There are two problems: + +- The NGF pod lost its leadership, even though no other NGF pods were + running -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1100 +- The leader elector logs are not structural -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1101 + +### NGINX + +#### Error Log + +Errors: + +```text +resource.type="k8s_container" +resource.labels.pod_name="nginx-gateway-b6cdb65cd-h8bgs" +resource.labels.namespace_name="nginx-gateway" +resource.labels.container_name="nginx" +severity=ERROR +SEARCH("`[warn]`") OR SEARCH("`[error]`") +``` + +No entries found. + +#### Access Log + +Non-200 response codes in NGINX access logs: + +```text +severity=INFO +"GET" "HTTP/1.1" -"200" +``` + +No such responses. + +## Key Metrics + +### Containers memory + +![memory.png](memory.png) + +No unexpected spikes or drops. + +### Containers CPU + +![cpu.png](cpu.png) + +No unexpected spikes or drops. + +### NGINX metrics + +![stub-status.png](stub-status.png) + +The drop of _requests_ on Sep 29 is neither significant nor has any correlated errors in NGINX logs. + +### Reloads + +Rate of reloads - successful and errors: + +![reloads.png](reloads.png) + +Reload spikes correspond to 1 hour periods of backend re-rollouts. +However, small spikes, like at 1pm Sep 29, correspond to periodic reconciliation of Secrets, which (incorrectly) +triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112 + +A small percentage of reloads finished with error. That happened because of the bug in NFG - it wouldn't wait long +enough for the NGINX master to start new worker +processes -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1106 + +Reload time distribution with 50th, 95th and 99th percentiles and the threshold: + +![reload-time.png](reload-time.png) + +Note - 60s is the threshold for waiting for NGINX to be reloaded. + +Reload related metrics at the end: + +```text +# HELP nginx_gateway_fabric_nginx_reloads_milliseconds Duration in milliseconds of NGINX reloads +# TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 5608 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 13926 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 14842 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 14842 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 14842 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 14842 +nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 8.645665e+06 +nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 14842 +``` + +All successful reloads took less than 5 seconds. + +## Opened Issues + +- NGF doesn't wait long enough for new NGINX workers to + start - https://github.com/nginxinc/nginx-gateway-fabric/issues/1106 +- NGF unnecessary reloads NGINX when it reconciles + Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112 +- Statuses not reported because no leader gets elected - https://github.com/nginxinc/nginx-gateway-fabric/issues/1100 +- Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101 + +## Future Improvements + +- Control Plane transactions weren't fully tested. While we tested that NGF processes EndpointSlices changes, we didn't + test any transaction that result into status of updates of resources, like a change in an HTTPRoute. + updated. diff --git a/tests/longevity/results/1.0.0/cpu.png b/tests/longevity/results/1.0.0/cpu.png new file mode 100644 index 0000000000..45f93f96c7 Binary files /dev/null and b/tests/longevity/results/1.0.0/cpu.png differ diff --git a/tests/longevity/results/1.0.0/memory.png b/tests/longevity/results/1.0.0/memory.png new file mode 100644 index 0000000000..ae455df685 Binary files /dev/null and b/tests/longevity/results/1.0.0/memory.png differ diff --git a/tests/longevity/results/1.0.0/reload-time.png b/tests/longevity/results/1.0.0/reload-time.png new file mode 100644 index 0000000000..e2f9b9cbc0 Binary files /dev/null and b/tests/longevity/results/1.0.0/reload-time.png differ diff --git a/tests/longevity/results/1.0.0/reloads.png b/tests/longevity/results/1.0.0/reloads.png new file mode 100644 index 0000000000..706396357b Binary files /dev/null and b/tests/longevity/results/1.0.0/reloads.png differ diff --git a/tests/longevity/results/1.0.0/stub-status.png b/tests/longevity/results/1.0.0/stub-status.png new file mode 100644 index 0000000000..12d84d6d31 Binary files /dev/null and b/tests/longevity/results/1.0.0/stub-status.png differ