Tasks fail to detect GPU on some Pool nodes due to early startup race condition

JeffreyCMI 41 Reputation points
2025-06-18T20:33:27.05+00:00

We’re running container tasks on a GPU-enabled Azure Batch pool (Docker-based, Standard_NC4as_T4_v3). All nodes are configured identically, and tasks are configured identically, running the same task in the same Docker image and entrypoint.

Despite this, the entire first round of tasks to run on certain pool nodes intermittently fail with:

failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Investigation shows:

  • Only ~10% of new nodes are affected. The first round of Tasks on other (equivalent) nodes do find the GPU, logging Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3072 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5
  • Affected tasks are the first tasks to run on the VM after provisioning. Later, Batch Tasks on the same node succeed and detect the GPU correctly.
  • nvidia-smi works on the host: GPU appears idle, lists no running processes using the GPU. Whereas when a Task has found the GPU, nvidia-smi shows GPU usage and lists running processes.
  • No node configuration drift: By the time I can SSH into the VM (some minutes after task start), docker info, nvidia-container-cli info, /etc/docker/daemon.json, and container runtime versions are consistent across nodes (good and bad).
  • No task configuration drift: same containerRunOptions (I am not using --gpus all), image name and tag, entrypoint.

Seems to me there's some race condition at node startup, where the GPU driver or NVIDIA container runtime stack is not fully initialized when the first task is scheduled. This feels very much within Azure's responsibility and out of my control.

We've encountered the bug intermittently, starting when we upgraded our Batch Pools image from

    publisher = "microsoft-azure-batch"
    offer     = "ubuntu-server-container-rdma"
    sku       = "20-04-lts"
    version   = "latest"

to

    publisher = "microsoft-dsvm"
    offer     = "ubuntu-hpc"
    sku       = "2204"
    version   = "latest"
Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
{count} votes

Accepted answer
  1. Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator
    2025-07-03T18:28:48.3633333+00:00

    Hi JeffreyCMI,

    Thanks for posting the solution here.

    Solution:
    User tasks are being scheduled on a node before the NVIDIA GPU runtime stack is fully initialized. I was unable to create a Start Task that checks and waits till the stack is fully initialized. Instead, I was able to create a Start Task that forces Docker to use cgroupfs no matter what and restarts Docker, before releasing the VM for user tasks.

    Here is the Start Task script that fixed this bug for me. It must be run as Admin, and I also set wait_for_success=true:

    #!/bin/bash
    # set -e
    #
    # Some Azure Batch nodes with GPUs intermittently fail to expose the GPU to containerized tasks.
    # This is tracked down to Docker not explicitly using the cgroupfs driver, causing inconsistencies
    # in GPU device access inside container user tasks.
    #
    # The script ensures Docker uses `cgroupfs` and that NVIDIA cgroups are enabled before tasks run.
    # By restarting Docker with corrected settings, the node becomes GPU-ready for all containers.
    echo "=== Checking native.cgroupdriver=cgroupfs ==="
    if ! grep -q "native.cgroupdriver=cgroupfs" /etc/docker/daemon.json; then
      echo "-> Adding native.cgroupdriver=cgroupfs"
      tmp=$(mktemp)
      jq '. + { "exec-opts": ["native.cgroupdriver=cgroupfs"] }' /etc/docker/daemon.json > "$tmp"
      mv "$tmp" /etc/docker/daemon.json
      RESTART_DOCKER=1
    else
      echo "   👍 Already set"
    fi
    echo
    echo "Current 'exec-opts' in /etc/docker/daemon.json:"
    jq '.["exec-opts"]' /etc/docker/daemon.json
    echo
    echo "=== Checking NVIDIA container config ==="
    if ! grep -q "^no-cgroups = false" /etc/nvidia-container-runtime/config.toml; then
      echo "-> Disabling no-cgroups in NVIDIA config"
      sed -i 's/#no-cgroups = false/no-cgroups = false/' /etc/nvidia-container-runtime/config.toml
      RESTART_DOCKER=1
    else
      echo "   👍 no-cgroups already false"
    fi
    echo
    echo "Lines around 'no-cgroups' in /etc/nvidia-container-runtime/config.toml:"
    grep -C2 "no-cgroups" /etc/nvidia-container-runtime/config.toml
    echo
    # Restart Docker if changes were made
    if [ "$RESTART_DOCKER" == "1" ]; then
      echo "=== Restarting Docker ==="
      systemctl restart docker
      sleep 5
    fi
    

    Please Upvote it and Accept the Answer, it will be helpful to others in the community.

    0 comments No comments

3 additional answers

Sort by: Most helpful
  1. JeffreyCMI 41 Reputation points
    2025-07-01T21:11:24.6733333+00:00

    As @Anusree Nashetty confirmed in the other answer, user tasks are being scheduled on a node before the NVIDIA GPU runtime stack is fully initialized. I was unable to create a Start Task that checks and waits til the stack is fully initialized. Instead, I was able to create a Start Task that forces Docker to use cgroupfs no matter what and restarts Docker, before releasing the VM for user tasks.

    Here is the Start Task script that fixed this bug for me. It must be run as Admin, and I also set wait_for_success=true:

    #!/bin/bash
    # set -e
    #
    # Some Azure Batch nodes with GPUs intermittently fail to expose the GPU to containerized tasks.
    # This is tracked down to Docker not explicitly using the cgroupfs driver, causing inconsistencies
    # in GPU device access inside container user tasks.
    #
    # The script ensures Docker uses `cgroupfs` and that NVIDIA cgroups are enabled before tasks run.
    # By restarting Docker with corrected settings, the node becomes GPU-ready for all containers.
    
    echo "=== Checking native.cgroupdriver=cgroupfs ==="
    if ! grep -q "native.cgroupdriver=cgroupfs" /etc/docker/daemon.json; then
      echo "-> Adding native.cgroupdriver=cgroupfs"
      tmp=$(mktemp)
      jq '. + { "exec-opts": ["native.cgroupdriver=cgroupfs"] }' /etc/docker/daemon.json > "$tmp"
      mv "$tmp" /etc/docker/daemon.json
      RESTART_DOCKER=1
    else
      echo "   👍 Already set"
    fi
    
    echo
    echo "Current 'exec-opts' in /etc/docker/daemon.json:"
    jq '.["exec-opts"]' /etc/docker/daemon.json
    
    echo
    echo "=== Checking NVIDIA container config ==="
    if ! grep -q "^no-cgroups = false" /etc/nvidia-container-runtime/config.toml; then
      echo "-> Disabling no-cgroups in NVIDIA config"
      sed -i 's/#no-cgroups = false/no-cgroups = false/' /etc/nvidia-container-runtime/config.toml
      RESTART_DOCKER=1
    else
      echo "   👍 no-cgroups already false"
    fi
    
    echo
    echo "Lines around 'no-cgroups' in /etc/nvidia-container-runtime/config.toml:"
    grep -C2 "no-cgroups" /etc/nvidia-container-runtime/config.toml
    
    echo
    # Restart Docker if changes were made
    if [ "$RESTART_DOCKER" == "1" ]; then
      echo "=== Restarting Docker ==="
      systemctl restart docker
      sleep 5
    fi
    
    1 person found this answer helpful.
    0 comments No comments

  2. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  3. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.