Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
As an AI engineer or developer, you might have to prototype and deploy AI workloads with a range of different model weights. AKS provides the option to deploy inferencing workloads using open-source presets supported out-of-box and managed in the KAITO model registry or to dynamically download from the HuggingFace registry at runtime onto your AKS cluster.
In this article, you learn how to onboard a sample HuggingFace model for inferencing with the AI toolchain operator add-on without having to manage custom images on Azure Kubernetes Service (AKS).
Prerequisites
- An Azure account with an active subscription. If you don't have an account, you can create one for free.
- An AKS cluster with the AI toolchain operator add-on enabled. For more information, see Enable KAITO on an AKS cluster.
- This example deployment requires quota for the
Standard_NCads_A100_v4
virtual machine (VM) family in your Azure subscription. If you don't have quota for this VM family, please request a quota increase.
Choose an open-source language model from HuggingFace
In this example, we use the BigScience Bloom-1B7 small language model. Alternatively, you can choose from thousands of text-generation models supported on HuggingFace.
Connect to your AKS cluster using the
az aks get-credentials
command.az aks get-credentials --resource-group <resource-group-name> --name <aks-cluster-name>
Clone the KAITO project GitHub repository using the
git clone
command.git clone https://github.com/kaito-project/kaito.git
Confirm that your
kaito-gpu-provisioner
pod is running successfully using thekubectl get deployment
command.kubectl get deployment -n kube-system | grep kaito
Deploy your model inferencing workload using the KAITO workspace template
Navigate to the
kaito
directory and open thedocs/custom-model-integration/reference_image_deployment.yaml
KAITO template. Replace the default values in the following fields with your model's requirements:instanceType
: The minimum VM size for this inference service deployment isStandard_NC24ads_A100_v4
. For larger model sizes you can choose a VM in theStandard_NCads_A100_v4
family with higher memory capacity.MODEL_ID
: Replace with your model's specific HuggingFace identifier, which can be found afterhttps://huggingface.co/
in the model card URL."--torch_dtype"
: Set to"float16"
for compatibility with V100 GPUs. For A100, H100 or newer GPUs, use"bfloat16"
.
apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-custom-llm resource: instanceType: "Standard_NC24ads_A100_v4" labelSelector: matchLabels: apps: custom-llm inference: template: spec: containers: - name: custom-llm-container image: ghcr.io/kaito-project/kaito/llm-reference-preset:latest command: ["accelerate"] args: - "launch" - "--num_processes" - "1" - "--num_machines" - "1" - "--gpu_ids" - "all" - "tfs/inference_api.py" - "--pipeline" - "text-generation" - "--trust_remote_code" - "--allow_remote_files" - "--pretrained_model_name_or_path" - "bigscience/bloom-1b7" - "--torch_dtype" - "bfloat16" volumeMounts: - name: dshm mountPath: /dev/shm volumes: - name: dshm emptyDir: medium: Memory
Save these changes to your
docs/custom-model-integration/reference_image_deployment.yaml
file.Run the deployment in your AKS cluster using the
kubectl apply
command.kubectl apply -f docs/custom-model-integration/reference_image_deployment.yaml
Test your custom model inferencing service
Track the live resource changes in your KAITO workspace using the
kubectl get workspace
command.kubectl get workspace workspace-custom-llm -w
Note
Note that machine readiness can take up to 10 minutes, and workspace readiness up to 20 minutes.
Check your language model inference service and get the service IP address using the
kubectl get svc
command.export SERVICE_IP=$(kubectl get svc workspace-custom-llm -o jsonpath='{.spec.clusterIP}')
Test your custom model inference service with a sample input of your choice using the OpenAI API format:
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$SERVICE_IP/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "bloom-1b7", "prompt": "What sport should I play in rainy weather?", "max_tokens": 20 }'
Clean up resources
If you no longer need these resources, you can delete them to avoid incurring extra Azure compute charges.
Delete the KAITO inference workspace using the
kubectl delete workspace
command.kubectl delete workspace workspace-custom-llm
Delete the GPU node pool created by KAITO in the same namespace as the
kaito-gpu-provisioner
deployment.
Next steps
In this article, you learned how to onboard a HuggingFace model for inferencing with the AI toolchain operator add-on directly to your AKS cluster. To learn more about AI and machine learning on AKS, see the following articles:
Azure Kubernetes Service