Onboard custom models for inferencing with the AI toolchain operator (KAITO) on Azure Kubernetes Service (AKS)

2025-03-27

As an AI engineer or developer, you might have to prototype and deploy AI workloads with a range of different model weights. AKS provides the option to deploy inferencing workloads using open-source presets supported out-of-box and managed in the KAITO model registry or to dynamically download from the HuggingFace registry at runtime onto your AKS cluster.

In this article, you learn how to onboard a sample HuggingFace model for inferencing with the AI toolchain operator add-on without having to manage custom images on Azure Kubernetes Service (AKS).

Prerequisites

An Azure account with an active subscription. If you don't have an account, you can create one for free.
An AKS cluster with the AI toolchain operator add-on enabled. For more information, see Enable KAITO on an AKS cluster.
This example deployment requires quota for the Standard_NCads_A100_v4 virtual machine (VM) family in your Azure subscription. If you don't have quota for this VM family, please request a quota increase.

Choose an open-source language model from HuggingFace

In this example, we use the BigScience Bloom-1B7 small language model. Alternatively, you can choose from thousands of text-generation models supported on HuggingFace.

Connect to your AKS cluster using the az aks get-credentials command.

az aks get-credentials --resource-group <resource-group-name> --name <aks-cluster-name>

Clone the KAITO project GitHub repository using the git clone command.
```
git clone https://github.com/kaito-project/kaito.git
```
Confirm that your kaito-gpu-provisioner pod is running successfully using the kubectl get deployment command.
```
kubectl get deployment -n kube-system | grep kaito
```

Deploy your model inferencing workload using the KAITO workspace template

Navigate to the kaito directory and open the docs/custom-model-integration/reference_image_deployment.yaml KAITO template. Replace the default values in the following fields with your model's requirements:

instanceType: The minimum VM size for this inference service deployment is Standard_NC24ads_A100_v4. For larger model sizes you can choose a VM in the Standard_NCads_A100_v4 family with higher memory capacity.
MODEL_ID: Replace with your model's specific HuggingFace identifier, which can be found after https://huggingface.co/ in the model card URL.
"--torch_dtype": Set to "float16" for compatibility with V100 GPUs. For A100, H100 or newer GPUs, use "bfloat16".

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-custom-llm
resource:
  instanceType: "Standard_NC24ads_A100_v4"
  labelSelector:
    matchLabels:
      apps: custom-llm
inference:
  template: 
    spec:
      containers:
      - name: custom-llm-container
        image: ghcr.io/kaito-project/kaito/llm-reference-preset:latest
        command: ["accelerate"]
        args:
          - "launch"
          - "--num_processes"
          - "1"
          - "--num_machines"
          - "1"
          - "--gpu_ids"
          - "all"
          - "tfs/inference_api.py"
          - "--pipeline"
          - "text-generation"
          - "--trust_remote_code"
          - "--allow_remote_files"
          - "--pretrained_model_name_or_path"
          - "bigscience/bloom-1b7"
          - "--torch_dtype"
          - "bfloat16"
        volumeMounts:
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory

Save these changes to your docs/custom-model-integration/reference_image_deployment.yaml file.

Run the deployment in your AKS cluster using the kubectl apply command.

kubectl apply -f docs/custom-model-integration/reference_image_deployment.yaml

Test your custom model inferencing service

Track the live resource changes in your KAITO workspace using the kubectl get workspace command.
```
kubectl get workspace workspace-custom-llm -w
```
Note

Note that machine readiness can take up to 10 minutes, and workspace readiness up to 20 minutes.
Check your language model inference service and get the service IP address using the kubectl get svc command.
```
export SERVICE_IP=$(kubectl get svc workspace-custom-llm -o jsonpath='{.spec.clusterIP}')
```

Test your custom model inference service with a sample input of your choice using the OpenAI API format:

   kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$SERVICE_IP/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bloom-1b7",
    "prompt": "What sport should I play in rainy weather?",
    "max_tokens": 20
  }'

Clean up resources

If you no longer need these resources, you can delete them to avoid incurring extra Azure compute charges.

Delete the KAITO inference workspace using the kubectl delete workspace command.
```
kubectl delete workspace workspace-custom-llm
```
Delete the GPU node pool created by KAITO in the same namespace as the kaito-gpu-provisioner deployment.

Next steps

In this article, you learned how to onboard a HuggingFace model for inferencing with the AI toolchain operator add-on directly to your AKS cluster. To learn more about AI and machine learning on AKS, see the following articles: