Monitor and visualize AI inference metrics on Azure Kubernetes Service (AKS) with the AI toolchain operator add-on

2025-08-05

Monitoring and observability play a key role in maintaining high performance and low cost of your AI workload deployments in Azure Kubernetes Service (AKS). Visibility into system and performance metrics can indicate the limits of your underlying infrastructure and motivate real-time adjustments and optimizations to reduce workload interruptions. Monitoring also provides valuable insights into resource utilization for cost-effective management of computational resources and accurate provisioning.

The Kubernetes AI Toolchain Operator (KAITO) is a managed add-on for AKS that simplifies deployment and operations for AI models in your AKS cluster.

In KAITO version 0.4.4 and later versions, the vLLM inference runtime is enabled by default in the AKS managed add-on. vLLM is a library for language model inference and serving. It surfaces key system performance, resource usage, and request processing for Prometheus metrics that you can use to evaluate your KAITO inference deployments.

In this article, you'll learn how to monitor and visualize vLLM inference metrics using the AI toolchain operator add-on with Azure Managed Prometheus and Azure Managed Grafana on your AKS cluster.

Before you begin

This article assumes that you have an existing AKS cluster. If you don't have a cluster, create one by using the Azure CLI, Azure PowerShell, or the Azure portal.
Install and configure Azure CLI version 2.47.0 or later. To find your version, run az --version. To install or update, see Install the Azure CLI.

Prerequisites

Install and configure kubectl, the Kubernetes command-line client. For more information, see Install kubectl.
Enable the AI toolchain operator add-on in your AKS cluster.
If you already have the AI toolchain operator add-on enabled, update your AKS cluster to the latest version to run KAITO v0.4.4 or later.
Enable the managed service for Prometheus and Azure Managed Grafana in your AKS cluster.
Have permissions to create or update Azure Managed Grafana instances in your Azure subscription.

Deploy a KAITO inference service

In this example, you collect metrics for the Qwen-2.5-coder-7B-instruct language model.

Start by applying the following KAITO workspace custom resource to your cluster:

kubectl apply -f https://raw.githubusercontent.com/Azure/kaito/main/examples/inference/kaito_workspace_qwen_2.5_coder_7b-instruct.yaml

Track the live resource changes in your KAITO workspace:
```
kubectl get workspace workspace-qwen-2-5-coder-7b-instruct -w
```
Note

Machine readiness can take up to 10 minutes, and workspace readiness can take up to 20 minutes depending on the size of your language model.

Confirm that your inference service is running and get the service IP address:

export SERVICE_IP=$(kubectl get svc workspace-qwen-2-5-coder-7b-instruct -o jsonpath='{.spec.clusterIP}')

echo $SERVICE_IP

Surface KAITO inference metrics to the managed service for Prometheus

Prometheus metrics are collected by default at the KAITO /metrics endpoint.

Add the following label to your KAITO inference service so that a Kubernetes ServiceMonitor deployment can detect it:
```
kubectl label svc workspace-qwen-2-5-coder-7b-instruct App=qwen-2-5-coder 
```

Create a ServiceMonitor resource to define the inference service endpoints and the required configurations to scrape the vLLM Prometheus metrics. Export these metrics to the managed service for Prometheus by deploying the following ServiceMonitor YAML manifest in the kube-system namespace:

cat <<EOF | kubectl apply -n kube-system -f -
apiVersion: azmonitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-kaito-monitor
spec:
  selector:
    matchLabels:
      App: qwen-2-5-coder
  endpoints:
  - port: http
    interval: 30s
    path: /metrics
    scheme: http
EOF

Check for the following output to verify that ServiceMonitor is created:

servicemonitor.azmonitoring.coreos.com/prometheus-kaito-monitor created

Verify that your ServiceMonitor deployment is running successfully:

kubectl get servicemonitor prometheus-kaito-monitor -n kube-system

In the Azure portal, verify that vLLM metrics are successfully collected in the managed service for Prometheus.
1. In your Azure Monitor workspace, go to Managed Prometheus > Prometheus explorer.
2. Select the Grid tab and confirm that a metrics item is associated with the job named workspace-qwen-2-5-coder-7b-instruct.
  
  Note
  
  The up value of this item should be 1. A value of 1 indicates that Prometheus metrics are successfully being scraped from your AI inference service endpoint.

Visualize KAITO inference metrics in Azure Managed Grafana

The vLLM project provides a Grafana dashboard configuration named grafana.json for inference workload monitoring. Navigate to the bottom of this page and copy the entire contents of the grafana.json file.
Go to the bottom of the examples page and copy the entire contents of the grafana.json file:
Complete the steps to import the Grafana configurations into a new dashboard in Azure Managed Grafana.
Go to your Managed Grafana endpoint, view the available dashboards, and select the vLLM dashboard.
To begin collecting data for your selected model deployment, confirm that the datasource value shown at the top left of the Grafana dashboard is your instance of the managed service for Prometheus you created for this example.
Copy the inference preset name defined in your KAITO workspace to the model_name field in the Grafana dashboard. In this example, the model name is qwen2.5-coder-7b-instruct.
In a few moments, verify that the metrics for your KAITO inference service appear in the vLLM Grafana dashboard.

Note

The value of these inference metrics remains 0 until the requests are submitted to the model inference server.

Monitor and visualize your AKS deployments at scale.
Fine-tune an AI model by using the AI toolchain operator add-on in AKS.
Learn about AKS GPU workload deployment options on Linux and Windows.

Share via

Monitor and visualize AI inference metrics on Azure Kubernetes Service (AKS) with the AI toolchain operator add-on

Before you begin

Prerequisites

Deploy a KAITO inference service

Surface KAITO inference metrics to the managed service for Prometheus

Visualize KAITO inference metrics in Azure Managed Grafana

Related content

Feedback

Additional resources