Edit

Share via


AKS production upgrade strategies

Upgrade your production Azure Kubernetes Service (AKS) clusters safely by using these proven patterns. Each strategy is optimized for specific business constraints and risk tolerance.

What this article covers

This article provides tested upgrade patterns for production AKS clusters and focuses on:

  • Blue-green deployments for zero-downtime upgrades.
  • Staged fleet upgrades across multiple environments.
  • Safe Kubernetes version adoption with validation gates.
  • Emergency security patching for rapid common vulnerabilities and exposures (CVE) response.
  • Application resilience patterns for seamless upgrades.

These patterns are best for production environments, site reliability engineers, and platform teams that require minimal downtime and maximum safety.

For more information, see these related articles:


For a quick start, select a link for instructions:

Choose your strategy

Your priority Best pattern Downtime Time to complete
Zero downtime Blue-green deployment <2 minutes 45-60 minutes
Multi-environment safety Staged fleet upgrades Planned windows 2-4 hours
New version safety Canary with validation Low risk 3-6 hours
Security patches Automated patching <4 hours 30-90 minutes
Future-proof apps Resilient architecture Zero impact Ongoing

Role-based quick start

Role Start here
Site reliability engineer/Platform Scenario 1, Scenario 2
Database administrator/Data engineer Stateful workload patterns
App development Scenario 5
Security Scenario 4

Scenario 1: Minimal downtime production upgrades

Challenge: "I need to upgrade my production cluster with less than 2 minutes of downtime during business hours."

Strategy: Use blue-green deployment with intelligent traffic shifting.

To learn more, see Blue-green deployment patterns and Azure Traffic Manager configuration.

Quick implementation (15 minutes)

# 1. Create green cluster (parallel to blue)
az aks create --name myaks-green --resource-group myRG \
  --kubernetes-version 1.29.0 --enable-cluster-autoscaler \
  --min-count 3 --max-count 10

# 2. Deploy application to green cluster
kubectl config use-context myaks-green
kubectl apply -f ./production-manifests/

# 3. Validate green cluster
# Run your application-specific health checks here
# Examples: API endpoint tests, database connectivity, dependency checks

# 4. Switch traffic (<30-second downtime)
az network traffic-manager endpoint update \
  --profile-name prod-tm --name green-endpoint --weight 100
az network traffic-manager endpoint update \
  --profile-name prod-tm --name blue-endpoint --weight 0
Detailed step-by-step guide

Prerequisites

  • Secondary cluster capacity planned.
  • Application supports horizontal scaling.
  • Database connections use connection pooling.
  • Health checks configured (/health, /ready).
  • Rollback procedure tested in staging.

Step 1: Prepare the blue-green infrastructure

# Create resource group for green cluster
az group create --name myRG-green --location eastus2

# Create green cluster with same configuration as blue
az aks create \
  --resource-group myRG-green \
  --name myaks-green \
  --kubernetes-version 1.29.0 \
  --node-count 3 \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 10 \
  --enable-addons monitoring \
  --generate-ssh-keys

Step 2: Deploy and validate the green environment

# Get green cluster credentials
az aks get-credentials --resource-group myRG-green --name myaks-green

# Deploy application stack
# Apply your Kubernetes manifests in order:
kubectl apply -f ./your-manifests/namespace.yaml      # Create namespace
kubectl apply -f ./your-manifests/secrets/           # Deploy secrets
kubectl apply -f ./your-manifests/configmaps/        # Deploy config maps  
kubectl apply -f ./your-manifests/deployments/       # Deploy applications
kubectl apply -f ./your-manifests/services/          # Deploy services

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod --all --timeout=300s

# Validate application health
kubectl get pods -A
kubectl logs -l app=my-app --tail=50

Step 3: Traffic switching (critical 30-second window)

# Pre-switch validation
curl -f https://myapp-green.eastus2.cloudapp.azure.com/health
if [ $? -ne 0 ]; then echo "Green health check failed!"; exit 1; fi

# Execute traffic switch
az network dns record-set cname set-record \
  --resource-group myRG-dns \
  --zone-name mycompany.com \
  --record-set-name api \
  --cname myapp-green.eastus2.cloudapp.azure.com

# Immediate validation
sleep 30
curl -f https://api.mycompany.com/health

Step 4: Monitor and validate

# Monitor traffic and errors for 15 minutes
kubectl top nodes
kubectl top pods
kubectl logs -l app=my-app --since=15m | grep ERROR

# Check application metrics
curl https://api.mycompany.com/metrics | grep http_requests_total

Common pitfalls and FAQs

Expand for quick troubleshooting and tips
  • Domain Name System (DNS) propagation is slow: Use low time-to-live values before upgrade, and validate the DNS cache flush.
  • Pods stuck terminating: Check for finalizers, long shutdown hooks, or pod disruption budgets (PDBs) with maxUnavailable: 0.
  • Traffic not shifting: Validate Azure Load Balancer/Azure Traffic Manager configuration and health probes.
  • Rollback fails: Always keep the blue cluster ready until the green cluster is fully validated.
  • Q: Can I use open-source software tools for validation?
  • Q: What's unique to AKS?
    • A: Native integration with Traffic Manager, Azure Kubernetes Fleet Manager, and node image patching for zero-downtime upgrades.

Advanced configuration

For applications that require <30-second downtime:

# Use session affinity during transition
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 300

Success validation

To validate your progress, use the following checklist:

  • Application responds within two seconds.
  • No 5xx errors are in logs.
  • Database connections are stable.
  • User sessions are preserved.

Emergency rollback (if needed)

# Immediate rollback to blue cluster
az network dns record-set cname set-record \
  --resource-group myRG-dns \
  --zone-name mycompany.com \
  --record-set-name api \
  --cname myapp-blue.eastus2.cloudapp.azure.com

Expected outcome: Less than 2-minute total downtime, zero data loss, and full rollback capability.

az aks create \
  --resource-group production-rg \
  --name aks-green-cluster \
  --kubernetes-version 1.29.0 \
  --node-count 3 \
  --tier premium \
  --auto-upgrade-channel patch \
  --planned-maintenance-config ./maintenance-window.json

Verify cluster readiness

az aks get-credentials --resource-group production-rg --name aks-green-cluster
kubectl get nodes

Implementation steps

Step 1: Deploy the application to a green cluster

# Deploy application stack
kubectl apply -f ./k8s-manifests/
kubectl apply -f ./monitoring/

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod --all --timeout=300s

# Validate application health
curl -f http://green-cluster-ingress/health

Step 2: Run traffic shift

# Update DNS or load balancer to point to green cluster
az network dns record-set a update \
  --resource-group dns-rg \
  --zone-name contoso.com \
  --name api \
  --set aRecords[0].ipv4Address="<green-cluster-ip>"

# Monitor traffic shift (should complete in 60-120 seconds)
watch kubectl top pods -n production

Step 3: Validate and clean up

# Verify zero errors in application logs
kubectl logs -l app=api --tail=100 | grep -i error

# Monitor key metrics for 15 minutes
kubectl get events --sort-by='.lastTimestamp' | head -20

# After validation, decommission blue cluster
az aks delete --resource-group production-rg --name aks-blue-cluster --yes

Success metrics

  • Downtime: <2 minutes (DNS propagation time)
  • Error rate: 0% during transition
  • Recovery time: <5 minutes if rollback needed

Scenario 2: Stage upgrades across environments

Challenge: "I need to safely test upgrades through dev/test/production with proper validation gates."

Strategy: Use Azure Kubernetes Fleet Manager with staged rollouts.

To learn more, see the Azure Kubernetes Fleet Manager overview and Update orchestration.

Prerequisites

# Install Fleet extension
az extension add --name fleet
az extension update --name fleet

# Create Fleet resource
az fleet create \
  --resource-group fleet-rg \
  --name production-fleet \
  --location eastus

Implementation steps

Step 1: Define stage configuration

Create upgrade-stages.json:

{
  "stages": [
    {
      "name": "development",
      "groups": [{ "name": "dev-clusters" }],
      "afterStageWaitInSeconds": 1800
    },
    {
      "name": "testing", 
      "groups": [{ "name": "test-clusters" }],
      "afterStageWaitInSeconds": 3600
    },
    {
      "name": "production",
      "groups": [{ "name": "prod-clusters" }],
      "afterStageWaitInSeconds": 0
    }
  ]
}

Step 2: Add clusters to a fleet

# Add development clusters
az fleet member create \
  --resource-group fleet-rg \
  --fleet-name production-fleet \
  --name dev-east \
  --member-cluster-id "/subscriptions/.../clusters/aks-dev-east" \
  --group dev-clusters

# Add test clusters  
az fleet member create \
  --resource-group fleet-rg \
  --fleet-name production-fleet \
  --name test-east \
  --member-cluster-id "/subscriptions/.../clusters/aks-test-east" \
  --group test-clusters

# Add production clusters
az fleet member create \
  --resource-group fleet-rg \
  --fleet-name production-fleet \
  --name prod-east \
  --member-cluster-id "/subscriptions/.../clusters/aks-prod-east" \
  --group prod-clusters

Step 3: Create and run a staged update

# Create staged update run
az fleet updaterun create \
  --resource-group fleet-rg \
  --fleet-name production-fleet \
  --name k8s-1-29-upgrade \
  --upgrade-type Full \
  --kubernetes-version 1.29.0 \
  --node-image-selection Latest \
  --stages upgrade-stages.json

# Start the staged rollout
az fleet updaterun start \
  --resource-group fleet-rg \
  --fleet-name production-fleet \
  --name k8s-1-29-upgrade

Step 4: Validation gates between stages

After dev stage (30-minute soak):

# Run automated test suite
./scripts/run-e2e-tests.sh dev-cluster
./scripts/performance-baseline.sh dev-cluster

# Check for any regressions
kubectl get events --sort-by='.lastTimestamp' | grep -i warn

After test stage (60-minute soak):

# Extended testing with production-like load
./scripts/load-test.sh test-cluster 1000-users 15-minutes
./scripts/chaos-engineering.sh test-cluster

# Manual approval gate
echo "Approve production deployment? (y/n)"
read approval

Common pitfalls and FAQs

Expand for quick troubleshooting and tips
  • Stage fails because of quota: Precheck regional quotas for all clusters in the fleet.
  • Validation scripts fail: Ensure that test scripts are idempotent and have clear pass/fail output.
  • Manual approval delays: Use automation for nonproduction. Require manual only for production.
  • Q: Can I use open-source software tools for validation?
  • Q: What's unique to AKS?
    • A: Azure Kubernetes Fleet Manager enables true staged rollouts and validation gates natively.

Scenario 3: Safe Kubernetes version intake

Challenge: "I need to adopt Kubernetes 1.30 without breaking existing workloads or APIs."

Strategy: Use multiphase validation with canary deployment.

To learn more, see Canary deployments and API deprecation policies.

Implementation steps

Step 1: API deprecation analysis

# Install and run API deprecation scanner
kubectl apply -f https://github.com/doitintl/kube-no-trouble/releases/latest/download/knt-full.yaml

# Scan for deprecated APIs
kubectl run knt --image=doitintl/knt:latest --rm -it --restart=Never -- \
  -c /kubeconfig -o json > api-deprecation-report.json

# Review and remediate findings
cat api-deprecation-report.json | jq '.[] | select(.deprecated==true)'

To learn more, see the Kubernetes API deprecation guide and kube-no-trouble documentation.

Step 2: Create a canary environment

# Create canary cluster with target version
az aks create \
  --resource-group canary-rg \
  --name aks-canary-k8s130 \
  --kubernetes-version 1.30.0 \
  --node-count 2 \
  --tier premium \
  --enable-addons monitoring

# Deploy subset of workloads
kubectl apply -f ./canary-manifests/

Step 3: Progressive workload migration

# Phase 1: Stateless services (20% traffic)
kubectl patch service api-service -p '{"spec":{"selector":{"version":"canary"}}}'
./scripts/monitor-error-rate.sh 15-minutes

# Phase 2: Background jobs (50% traffic)  
kubectl scale deployment batch-processor --replicas=3
./scripts/validate-job-completion.sh

# Phase 3: Critical services (100% traffic)
kubectl patch deployment critical-api -p '{"spec":{"template":{"metadata":{"labels":{"cluster":"canary"}}}}}'

Step 4: Feature gate validation

# Test new Kubernetes 1.30 features
apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-validation
data:
  test-script: |
    # Test new security features
    kubectl auth can-i create pods --as=service-account:default:test-sa
    
    # Validate performance improvements
    kubectl top nodes --use-protocol-buffers=true
    
    # Check new API versions
    kubectl api-versions | grep "v1.30"

Success metrics

  • API compatibility: 100% (zero breaking changes)
  • Performance: ≤5% regression in key metrics
  • Feature adoption: New features validated in canary

Scenario 4: Fastest security patch deployment

Challenge: "A critical CVE was announced. I need patches deployed across all clusters within 4 hours."

Strategy: Use automated node image patching with minimal disruption.

To learn more, see Node image upgrade strategies, Auto-upgrade channels, and Security patching best practices.

Implementation steps

Step 1: Emergency response preparation

# Set up automated monitoring for security updates
az aks nodepool update \
  --resource-group production-rg \
  --cluster-name aks-prod \
  --name nodepool1 \
  --auto-upgrade-channel SecurityPatch

# Configure maintenance window for emergency patches
az aks maintenance-configuration create \
  --resource-group production-rg \
  --cluster-name aks-prod \
  --config-name emergency-security \
  --week-index First,Second,Third,Fourth \
  --day-of-week Monday,Tuesday,Wednesday,Thursday,Friday \
  --start-hour 0 \
  --duration 4

To learn more, see Planned maintenance configuration and Auto-upgrade channels.

Step 2: Automated security scanning

# security-scan-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: security-scanner
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scanner
            image: aquasec/trivy:latest
            command:
            - trivy
            - k8s
            - --report
            - summary
            - cluster

Step 3: Rapid patch deployment

# Trigger immediate node image upgrade for security patches
az aks nodepool upgrade \
  --resource-group production-rg \
  --cluster-name aks-prod \
  --name nodepool1 \
  --node-image-only \
  --max-surge 50% \
  --drain-timeout 5

# Monitor patch deployment
watch az aks nodepool show \
  --resource-group production-rg \
  --cluster-name aks-prod \
  --name nodepool1 \
  --query "upgradeSettings"

Step 4: Compliance validation

# Verify patch installation
kubectl get nodes -o wide
kubectl describe node | grep "Kernel Version"

# Generate compliance report
./scripts/generate-security-report.sh > security-compliance-$(date +%Y%m%d).json

# Notify security team
curl -X POST "$SLACK_WEBHOOK" -d "{\"text\":\"Security patches deployed to production cluster. Compliance report attached.\"}"

Success metrics

  • Deployment time: <4 hours from common vulnerabilities and exposures announcement
  • Coverage: 100% of nodes patched
  • Downtime: <5 minutes per node pool

Scenario 5: Application architecture for seamless upgrades

Challenge: "I want my applications to handle cluster upgrades gracefully without affecting users."

Strategy: Use resilient application patterns with graceful degradation.

To learn more, see Application reliability patterns, Pod disruption budgets, and Health check best practices.

Implementation steps

Step 1: Implement robust health checks

# robust-health-checks.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: resilient-api
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myapp:latest
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]

Step 2: Configure pod disruption budgets

# optimal-pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  selector:
    matchLabels:
      app: api
  maxUnavailable: 1
  # Ensures at least 2 pods remain available during upgrades
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: database-pdb
spec:
  selector:
    matchLabels:
      app: database
  minAvailable: 2
  # Critical: Always keep majority of database pods running

Step 3: Implement a circuit breaker pattern

// circuit-breaker.js
const CircuitBreaker = require('opossum');

const options = {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
  fallback: () => 'Service temporarily unavailable'
};

const breaker = new CircuitBreaker(callExternalService, options);

// Monitor circuit breaker state during upgrades
breaker.on('open', () => console.log('Circuit breaker opened'));
breaker.on('halfOpen', () => console.log('Circuit breaker half-open'));

To learn more, see Circuit breaker pattern, Retry pattern, and Application resilience.

Step 4: Database connection resilience

# connection-pool-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: db-config
data:
  database.yml: |
    production:
      adapter: postgresql
      pool: 25
      timeout: 5000
      retry_attempts: 3
      retry_delay: 1000
      connection_validation: true
      validation_query: "SELECT 1"
      test_on_borrow: true

Success metrics

  • Error rate: <0.01% during upgrades
  • Response time: <10% degradation
  • Recovery time: <30 seconds after node replacement

Monitoring and alerting setup

To learn more, see the AKS monitoring overview, Container Insights, and Prometheus metrics.

Essential metrics to monitor

# upgrade-monitoring.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: upgrade-monitoring
spec:
  groups:
  - name: upgrade.rules
    rules:
    - alert: UpgradeInProgress
      expr: kube_node_spec_unschedulable > 0
      for: 1m
      annotations:
        summary: "Node upgrade in progress"
    
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
      for: 2m
      annotations:
        summary: "High error rate during upgrade"
    
    - alert: PodEvictionFailed
      expr: increase(kube_pod_container_status_restarts_total[5m]) > 5
      for: 1m
      annotations:
        summary: "Multiple pod restarts detected"

Dashboard configuration

{
  "dashboard": {
    "title": "AKS Upgrade Dashboard",
    "panels": [
      {
        "title": "Upgrade Progress",
        "targets":
        [
          "kube_node_info",
          "kube_node_status_condition"
        ]
      },
      {
        "title": "Application Health",
        "targets":
        [
          "up{job='kubernetes-pods'}",
          "http_request_duration_seconds"
        ]
      }
    ]
  }
}

Troubleshooting guide

To learn more, see the AKS troubleshooting guide, Node and pod troubleshooting, and Upgrade error messages.

Common issues and solutions

Issue Symptoms Solution
Stuck node drain Pods won't evict. Check PDB configuration, increase drain timeout.
High error rates 5xx responses are increasing. Verify health checks, check resource limits.
Slow upgrades Takes >2 hours. Increase maxSurge, optimize container startup.
DNS resolution Service discovery is failing. Verify CoreDNS pods, check service endpoints.

Emergency rollback procedures

# Quick rollback script
#!/bin/bash
echo "Initiating emergency rollback..."

# Switch traffic back to previous cluster
az network traffic-manager endpoint update \
  --resource-group traffic-rg \
  --profile-name production-tm \
  --name current-endpoint \
  --target-resource-id "/subscriptions/.../clusters/aks-previous"

# Verify rollback success
curl -f https://api.production.com/health
echo "Rollback completed in $(date)"

Specialized scenarios

Supporting tools

Best practices

Next tasks