Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
A chaos experiment is only useful if you can measure the impact. While you can view metrics on individual resources, a centralized dashboard using Azure Workbooks provides a "single pane of glass" view to correlate the fault with its impact across multiple resources. This workbook serves as a reusable tool for all chaos experiments.
The ability to visualize the direct correlation between your chaos experiments and their impact on system metrics is crucial for understanding system resilience. Azure Workbooks offer a powerful, customizable solution that allows you to create dynamic dashboards that can be reused across different experiments, resource groups, and subscriptions. By centralizing your fault analysis in a single dashboard, you can quickly identify how different types of faults affect your infrastructure and applications, enabling you to make informed decisions about system improvements and disaster recovery planning.
Create your fault analysis Workbook
Follow these step-by-step instructions to create a reusable workbook for analyzing the impact of your chaos experiments:
Navigate to Azure Monitor in the Azure portal.
Select Workbooks from the left menu.
Click + New to create an empty workbook.
To make the workbook reusable across different experiments, you need to add parameters. Click + Add and select Add parameters.
Create three text parameters that allow you to specify different target resources for each experiment:
- Subscription (Set "Required")
- ResourceGroup (Set "Required")
- TargetResource (Set "Required")
These parameters are filled in each time you run an experiment, making the workbook flexible for use across different resources.
Now you add your first metric chart to visualize the impact. Click + Add and select Add metric.
Configure the metric chart using the parameters you created:
- Source: Azure
- Resource type: Virtual machines
- Subscription: Select "Parameter" and choose the Subscription parameter
- Resource group: Select "Parameter" and choose the ResourceGroup parameter
- Resource: Select "Parameter" and choose the TargetResource parameter
In the metric configuration, add a common metric like Percentage CPU. You can add multiple metrics (for example, Available Memory Bytes, Network In Total) to the same chart to get a comprehensive view of the resource's performance.
Click Done editing and then Save the workbook with a descriptive name like "Chaos Experiment Analysis Dashboard."
You can now add more metric charts for other resource types (like App Service, AKS, Cosmos DB) by repeating steps 7-9 and changing the Resource type. This modular approach allows you to build a comprehensive dashboard that covers all the resources involved in your chaos experiments.
Recommended Metrics for your dashboard
To help you build an effective dashboard, the following table maps each fault in the Chaos Studio library to the key Azure Monitor metrics that reveal its impact.
Agent-Based faults
Target Resource: Virtual Machine / Virtual Machine Scale Set
Fault Name | Recommended Azure Monitor Metrics | Notes |
---|---|---|
CPU Pressure | Percentage CPU , CPU Credits Remaining |
Look for CPU to spike to the configured pressure level. |
Physical/Virtual Memory Pressure | Available Memory Bytes , Percentage Memory |
Available memory should drop significantly. |
Disk I/O Pressure | OS Disk Read Bytes/sec , OS Disk Write Bytes/sec , OS Disk Queue Depth , Data Disk Latency |
Expect a spike in I/O operations and latency. |
Kill Process / Stop Service | Application-level: Http Server Errors (5xx) , Response Time . Platform-level: Percentage CPU (may drop). |
Platform metrics do not show a single process. Look for secondary effects. |
Network Disconnect | Network In Total , Network Out Total , Outbound Flows , Inbound Flows |
Traffic should drop to zero during the fault. |
Network Latency | Platform-level: Outbound Flows RTT (Network Watcher). Application-level: Dependency Duration , Response Time . |
Expect a clear increase in round-trip time or dependency call times. |
Network Packet Loss | TCP Segments Retransmitted , Outbound Packets Dropped |
These metrics indicate the network stack is working harder to resend lost data. |
DNS Failure | Application-level: Dependency call failure rate (from Application Insights). |
Best observed at the application level as it tries and fails to resolve DNS names. |
Time Change | (No direct metric) | Validate impact by checking application or guest OS logs for time-skew errors. |
Arbitrary stress-ng Stressor | (Varies) | Use the metric that corresponds to the stressor you enabled. |
Azure Kubernetes Service (AKS) faults
Target Resource: AKS Cluster (Metrics typically viewed via Container Insights)
Fault Name | Recommended Azure Monitor Metrics | Notes |
---|---|---|
Pod Chaos | kube_pod_container_status_restarts_total , kube_deployment_status_replicas_ready , kube_pod_status_ready |
Look for pod restarts to increment or the number of ready pods to drop. |
Network Chaos | ingress_controller_request_duration_seconds (Latency), node_network_in_bytes , node_network_out_bytes |
Latency faults increase request duration. |
Stress Chaos | Container CPU Usage Percentage , Container Memory Working Set Bytes , node_cpu_usage_percentage |
Look for resource usage to spike at the container and/or node level. |
I/O Chaos | node_disk_io_time_seconds_total , node_disk_read_bytes_total |
These node-level metrics show increased I/O activity and latency. |
DNS Chaos | Prometheus: coredns_dns_request_failures_total . Application-level: Dependency call failure rate . |
Best observed within the cluster's CoreDNS or at the application level. |
HTTP Chaos | ingress_controller_requests (with status code dimension), Http Server Errors , apiserver_current_inflight_requests |
Look for an increase in HTTP 5xx errors or failed requests. |
PaaS & other resource faults
Fault Category | Fault Name | Target Resource Type | Recommended Azure Monitor Metrics |
---|---|---|---|
App Service | Stop App Service | App Service | Http 5xx , Requests , Response Time , Health check status |
Autoscale | Disable Autoscale | Autoscale Settings | Virtual Machine Scale Set Instance Count (no increase), Average Percentage CPU (rise) |
Cache for Redis | Reboot Cache Node | Azure Cache for Redis | Connected Clients (dip), Server Load (spike), Errors (Type: Failover), Cache Latency |
Cosmos DB | Cosmos DB Failover | Cosmos DB Account | Server Side Latency , Total Requests (by StatusCode), Throttled Requests , Service Availability |
Event Hubs | Change Event Hub State | Event Hubs Namespaces | Incoming Requests , Throttled Requests , User Errors , Active Connections |
Key Vault | Deny Access / Disable Cert | Key Vault | Availability , Service Api Latency , Service Api Results (by ResultType) |
Network | NSG Security Rule | Network Security Group | On affected VM: Inbound/Outbound Flows . In NSG Flow Logs: [Rule Name]_Denied_Packets |
Service Bus | Change Queue State | Service Bus Namespace | Incoming Messages , User Errors , Server Errors , Active Messages |
Virtual Machine | VM Shutdown/Redeploy | Virtual Machine | VM Availability , Percentage CPU (drop to zero), Network In/Out Total |
VM Scale Set | VMSS Shutdown | VM Scale Set | Virtual Machine Scale Set Instance Count , Instance-level Percentage CPU |
Orchestration
Fault Name | Notes |
---|---|
Start/Stop Load Test | This isn't a fault. Metrics should be observed on the resources targeted by the Azure Load Testing service. |
Delay | This is a wait action. It has no direct metric impact but is used to control the timing of your experiment steps. |