Hi JeffreyCMI,
Thanks for posting the solution here.
Solution:
User tasks are being scheduled on a node before the NVIDIA GPU runtime stack is fully initialized. I was unable to create a Start Task that checks and waits till the stack is fully initialized. Instead, I was able to create a Start Task that forces Docker to use cgroupfs no matter what and restarts Docker, before releasing the VM for user tasks.
Here is the Start Task script that fixed this bug for me. It must be run as Admin, and I also set wait_for_success=true
:
#!/bin/bash
# set -e
#
# Some Azure Batch nodes with GPUs intermittently fail to expose the GPU to containerized tasks.
# This is tracked down to Docker not explicitly using the cgroupfs driver, causing inconsistencies
# in GPU device access inside container user tasks.
#
# The script ensures Docker uses `cgroupfs` and that NVIDIA cgroups are enabled before tasks run.
# By restarting Docker with corrected settings, the node becomes GPU-ready for all containers.
echo "=== Checking native.cgroupdriver=cgroupfs ==="
if ! grep -q "native.cgroupdriver=cgroupfs" /etc/docker/daemon.json; then
echo "-> Adding native.cgroupdriver=cgroupfs"
tmp=$(mktemp)
jq '. + { "exec-opts": ["native.cgroupdriver=cgroupfs"] }' /etc/docker/daemon.json > "$tmp"
mv "$tmp" /etc/docker/daemon.json
RESTART_DOCKER=1
else
echo " 👍 Already set"
fi
echo
echo "Current 'exec-opts' in /etc/docker/daemon.json:"
jq '.["exec-opts"]' /etc/docker/daemon.json
echo
echo "=== Checking NVIDIA container config ==="
if ! grep -q "^no-cgroups = false" /etc/nvidia-container-runtime/config.toml; then
echo "-> Disabling no-cgroups in NVIDIA config"
sed -i 's/#no-cgroups = false/no-cgroups = false/' /etc/nvidia-container-runtime/config.toml
RESTART_DOCKER=1
else
echo " 👍 no-cgroups already false"
fi
echo
echo "Lines around 'no-cgroups' in /etc/nvidia-container-runtime/config.toml:"
grep -C2 "no-cgroups" /etc/nvidia-container-runtime/config.toml
echo
# Restart Docker if changes were made
if [ "$RESTART_DOCKER" == "1" ]; then
echo "=== Restarting Docker ==="
systemctl restart docker
sleep 5
fi
Please Upvote it and Accept the Answer, it will be helpful to others in the community.