5XX Errors During Rolling Updates in AKS with AGIC (App Gateway Ingress Controller)
We are experiencing transient 5XX errors (502/503) during rolling deployments of services in our AKS cluster. The internal rollout behaves correctly, but end users receive errors externally during pod recreation.
Our AKS cluster uses AGIC (Azure Application Gateway Ingress Controller) to expose services via Azure Application Gateway.
During deployments using Kubernetes RollingUpdate
strategy:
Pods terminate and new ones start successfully.
kubectl rollout status
shows no issues.
Readiness and liveness probes are correctly configured.
However, external users receive 5XX errors (502/503) exactly when old pods are terminating and new ones are starting.
We suspect that App Gateway is routing traffic to terminating or not-yet-ready pods, possibly because AGIC has not yet removed or updated the backend pool members.
Steps to Reproduce:
Trigger a rolling update via kubectl apply
or Helm.
Monitor pod status and application logs.
- Observe 502/503 errors from external clients during pod transitions. Detailed Description: Our AKS cluster uses AGIC (Azure Application Gateway Ingress Controller) to expose services via Azure Application Gateway. During deployments using Kubernetes
RollingUpdate
strategy:- Pods terminate and new ones start successfully.
-
kubectl rollout status
shows no issues. - Readiness and liveness probes are correctly configured.
- However, external users receive 5XX errors (502/503) exactly when old pods are terminating and new ones are starting.