Tasks fail to detect GPU on some Pool nodes due to early startup race condition

Question

Tasks fail to detect GPU on some Pool nodes due to early startup race condition

JeffreyCMI 41

We’re running container tasks on a GPU-enabled Azure Batch pool (Docker-based, Standard_NC4as_T4_v3). All nodes are configured identically, and tasks are configured identically, running the same task in the same Docker image and entrypoint.

Despite this, the entire first round of tasks to run on certain pool nodes intermittently fail with:

failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Investigation shows:

Only ~10% of new nodes are affected. The first round of Tasks on other (equivalent) nodes do find the GPU, logging Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3072 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5
Affected tasks are the first tasks to run on the VM after provisioning. Later, Batch Tasks on the same node succeed and detect the GPU correctly.
nvidia-smi works on the host: GPU appears idle, lists no running processes using the GPU. Whereas when a Task has found the GPU, nvidia-smi shows GPU usage and lists running processes.
No node configuration drift: By the time I can SSH into the VM (some minutes after task start), docker info, nvidia-container-cli info, /etc/docker/daemon.json, and container runtime versions are consistent across nodes (good and bad).
No task configuration drift: same containerRunOptions (I am not using --gpus all), image name and tag, entrypoint.

Seems to me there's some race condition at node startup, where the GPU driver or NVIDIA container runtime stack is not fully initialized when the first task is scheduled. This feels very much within Azure's responsibility and out of my control.

We've encountered the bug intermittently, starting when we upgraded our Batch Pools image from

    publisher = "microsoft-azure-batch"
    offer     = "ubuntu-server-container-rdma"
    sku       = "20-04-lts"
    version   = "latest"

to

    publisher = "microsoft-dsvm"
    offer     = "ubuntu-hpc"
    sku       = "2204"
    version   = "latest"

JeffreyCMI 41 Reputation points

2025-06-18T20:39:20.5233333+00:00
I think this is a bug in Azure Batch.

Preferred resolutions (in order from most to least)

Azure Batch should not schedule GPU workloads until the node is fully GPU-ready. A platform fix that requires no action from me.

Or a reliable workaround that doesn’t involve, say, manually blocking in start tasks

Or a documented mechanism to delay task scheduling until GPU readiness
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-18T22:41:59.8633333+00:00
Hi JeffreyCMI,

The NVIDIA GPU driver stack or the NVIDIA container runtime is not always fully initialized by the time Azure Batch schedules the first task on a freshly provisioned node. This causes the first task to fail to detect the GPU, even though subsequent tasks on the same node succeed once the system is fully initialized.

Some containers do not automatically include GPU support. The NVIDIA Container Toolkit must be installed to enable GPU access inside containers. Ensure the Batch nodes have the NVIDIA Container Toolkit installed.

Run the following commands to ensure drivers are installed and GPUs are recognized

nvidia-smi lsmod | grep nvidia

If nvidia-smi fails, reinstall the drivers.
JeffreyCMI 41 Reputation points

2025-06-19T01:17:09.8433333+00:00
@Anusree Nashetty Thanks for confirming the race condition is known: that Azure Batch may schedule tasks before the NVIDIA GPU stack is fully initialized. This behavior is problematic for GPU workloads. Could you clarify:

Is this known issue on the Azure Batch roadmap for mitigation?

Is there a supported mechanism to delay task scheduling until the GPU runtime is fully ready?

Is there any internal health check that Azure uses to decide when a node is “Ready,” and can that be extended to GPU readiness?

To reiterate: this is a timing issue, not a driver or container config issue. I confirmed that after the first few minutes, drivers are installed, nvidia-smi works, etc.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-19T19:23:54.47+00:00
Hi JeffreyCMI,

To clarify your doubts:

Is this known issue on the Azure Batch roadmap for mitigation?
Currently, there is no mention of a scheduled fix, roadmap timeline, or ETA for adding GPU readiness to the internal health checks that govern when Batch deems a node “Ready"

Is there a supported mechanism to delay task scheduling until the GPU runtime is fully ready?
The only supported mechanism in Azure Batch to delay task scheduling until the GPU runtime is fully ready is by using a Start Task with waitForSuccess=true. Azure Batch will not schedule any user tasks on a compute node until the Start Task completes successfully if you set:

"waitForSuccess": true

You can use this to manually implement a readiness check for GPU availability — including waiting for the NVIDIA driver stack to initialize and nvidia-smi / nvidia-container-cli to succeed. https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.protocol.models.starttask.waitforsuccess?view=azure-dotnet

Is there any internal health check that Azure uses to decide when a node is “Ready,” and can that be extended to GPU readiness?
Yes, Azure Batch uses internal health checks to determine when a node is in a "Ready" (idle/schedulable) state—but these checks currently do not include GPU readiness, and they cannot be extended by users to include GPU-specific conditions.

If you have any further queries, let me know. If the information is helpful, please click on Upvote.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-23T20:20:36.3866667+00:00

Hi JeffreyCMI,

Did you get a chance to see my response. If you have any further queries, let me know. If the information is helpful, please click on Upvote.
JeffreyCMI 41 Reputation points

2025-06-25T19:10:14.52+00:00
The idea of using start task sounds very promising. However in my initial test of it, I have seen the start task successfully run all of the following, yet still have the first user tasks not find GPU.

nvidia-container-cli info nvidia-smi lsmod | grep -q nvidia

So now I'm tweaking the startup script to see if I can identify a check that correctly predicts if a user task will find GPU. Stay tuned...
JeffreyCMI 41 Reputation points

2025-06-26T01:04:32.11+00:00
I'm not able to find a check that I can run from a Start Task that reliably signals whether a user Task launched subsequently by Batch will or will not use the GPU. Even after the new node's Start Task (using waitForSuccess=true) successfully runs all of the following checks, I can still reproduce the bug: on some but not all nodes, the first containerized Tasks launched by Batch cannot use the GPU inside the container.

nvidia-container-cli info nvidia-smi lsmod | grep nvidia docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

To reiterate: the closest thing I can get to checking what will happen to an actual containerized Azure Batch Task, but from inside a Start Task, is docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi shows the same correct output even when a subsequent Batch Task will not find the GPU!

For more debugging, I was able to SSH into the Batch node and docker exec into the running container of an affected Batch Task (one launched by Azure Batch, not a Start Task) to run the following checks:

$ docker exec -it busy_lewin bash # ls -l /dev/nvidia* crw-rw-rw- 1 root root 507, 0 Jun 25 19:48 /dev/nvidia-uvm crw-rw-rw- 1 root root 507, 1 Jun 25 19:48 /dev/nvidia-uvm-tools crw-rw-rw- 1 root root 195, 0 Jun 25 19:48 /dev/nvidia0 crw-rw-rw- 1 root root 195, 255 Jun 25 19:48 /dev/nvidiactl # nvidia-smi Failed to initialize NVML: Unknown Error

Plus I confirmed an equivalent Azure Batch Task that correctly finds the GPU instead shows correct nvidia-smi output in the above scenario. I think the "Failed to initialize NVML" is the most striking finding yet, but I don't know where to go from here.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-26T17:31:52.0333333+00:00

Hi JeffreyCMI,

Change Docker’s cgroup driver to cgroupfs Then restart Docker. This avoids Docker-induced cgroup reloads that can tear GPU binding.
Disable cgroups in NVIDIA config no-cgroups = false

For your reference: https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours
https://learn.microsoft.com/en-us/answers/questions/2237519/urgent-need-help-docker-cannot-use-a100-gpu
https://forums.developer.nvidia.com/t/solved-somehow-cuda-in-docker-gives-failed-to-initialize-nvml-unknown-error/283234
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-30T17:06:57.82+00:00

Hi JeffreyCMI,

Did you get a chance to see my response. If you have any further queries, let me know. If the information is helpful, please click on Upvote.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-07-01T17:06:32.7666667+00:00

Hi JeffreyCMI,

Did you get a chance to see my latest response. If you have any further queries, let me know. If the information is helpful, please click on Upvote.
JeffreyCMI 41 Reputation points

2025-07-01T19:12:23.68+00:00

I made a Start Task that sets cgroups and restarts Docker. Now my user tasks appear to all find the GPU (at least, on 45 new nodes so far)! It's a pity I have to do this, why is this not the default in Azure Batch? Should I be using a different machine image for GPU-enabled container tasks?
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-07-01T20:52:22.6433333+00:00
Hi JeffreyCMI,

Azure Batch is a powerful service for scheduling compute-intensive jobs, but when working with GPU-enabled containers, there is a known race condition: tasks can be scheduled on a node before the NVIDIA GPU runtime stack is fully initialized. This often leads to intermittent failures in the first containerized tasks with errors like: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected Failed to initialize NVML: Unknown Error

The officially supported mechanism to control when a Batch node is ready to run tasks is to use a Start Task with the waitForSuccess flag. This allows you to delay scheduling until custom health checks are complete.

In your case, configuring the Start Task to:

Set Docker's cgroup driver to cgroupfs

Restart the Docker daemon

Check GPU readiness using nvidia-smi and nvidia-container-cli

has proven to be an effective and scalable workaround. After implementing this, your Batch tasks consistently detect the GPU across new nodes.

There might be several reasons why Azure Batch doesn't apply this configuration by default:
Azure Batch intentionally avoids modifying the host OS outside of the Node Agent and your Start Task. This gives users full control over what runs on the VM.
General-purpose images like DSVM or Ubuntu-HPC are not designed specifically for Batch and are shared across multiple Azure services (including Azure ML, interactive SSH-based workflows, etc.).
Docker + NVIDIA support is complex and evolving, especially around cgroup v2 and container runtime changes. Microsoft gives flexibility, but that requires manual tuning on your part.

We recommend experimenting with the ubuntu-server-container-rdma image if you want lower maintenance and are comfortable installing only the specific libraries your workloads need. The Ubuntu Server (with GPU and RDMA drivers) for Azure Batch container pools image is a Microsoft‑published, Azure‑Marketplace image specifically tailored for GPU and RDMA‑capable N‑series VMs. https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-azure-batch.ubuntu-server-container-rdma?tab=overview
JeffreyCMI 41 Reputation points

2025-07-01T21:03:04.76+00:00

Thanks! Yes I looked at ubuntu-server-container-rdma but it's still stuck on Ubuntu 20.04 which is very outdated by now.

I'll share my working Start Task script in an answer.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-07-01T22:14:17.4166667+00:00

Hi JeffreyCMI,

If any of the information provided by me is helpful for you, could you please Accept Answer on it, so that it will be useful to others in the community.

Accepted answer

3 additional answers

Your answer

JeffreyCMI 41 Reputation points

2025-06-18T20:39:20.5233333+00:00

I think this is a bug in Azure Batch.

Preferred resolutions (in order from most to least)

Azure Batch should not schedule GPU workloads until the node is fully GPU-ready. A platform fix that requires no action from me.

Or a reliable workaround that doesn’t involve, say, manually blocking in start tasks

Or a documented mechanism to delay task scheduling until GPU readiness
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-18T22:41:59.8633333+00:00

Hi JeffreyCMI,

The NVIDIA GPU driver stack or the NVIDIA container runtime is not always fully initialized by the time Azure Batch schedules the first task on a freshly provisioned node. This causes the first task to fail to detect the GPU, even though subsequent tasks on the same node succeed once the system is fully initialized.

Some containers do not automatically include GPU support. The NVIDIA Container Toolkit must be installed to enable GPU access inside containers. Ensure the Batch nodes have the NVIDIA Container Toolkit installed.

Run the following commands to ensure drivers are installed and GPUs are recognized

nvidia-smi lsmod | grep nvidia

If nvidia-smi fails, reinstall the drivers.
JeffreyCMI 41 Reputation points

2025-06-19T01:17:09.8433333+00:00

@Anusree Nashetty Thanks for confirming the race condition is known: that Azure Batch may schedule tasks before the NVIDIA GPU stack is fully initialized. This behavior is problematic for GPU workloads. Could you clarify:

Is this known issue on the Azure Batch roadmap for mitigation?

Is there a supported mechanism to delay task scheduling until the GPU runtime is fully ready?

Is there any internal health check that Azure uses to decide when a node is “Ready,” and can that be extended to GPU readiness?

To reiterate: this is a timing issue, not a driver or container config issue. I confirmed that after the first few minutes, drivers are installed, nvidia-smi works, etc.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-19T19:23:54.47+00:00

Hi JeffreyCMI,

To clarify your doubts:

Is this known issue on the Azure Batch roadmap for mitigation?
Currently, there is no mention of a scheduled fix, roadmap timeline, or ETA for adding GPU readiness to the internal health checks that govern when Batch deems a node “Ready"

Is there a supported mechanism to delay task scheduling until the GPU runtime is fully ready?
The only supported mechanism in Azure Batch to delay task scheduling until the GPU runtime is fully ready is by using a Start Task with waitForSuccess=true. Azure Batch will not schedule any user tasks on a compute node until the Start Task completes successfully if you set:

"waitForSuccess": true

You can use this to manually implement a readiness check for GPU availability — including waiting for the NVIDIA driver stack to initialize and nvidia-smi / nvidia-container-cli to succeed. https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.protocol.models.starttask.waitforsuccess?view=azure-dotnet

Is there any internal health check that Azure uses to decide when a node is “Ready,” and can that be extended to GPU readiness?
Yes, Azure Batch uses internal health checks to determine when a node is in a "Ready" (idle/schedulable) state—but these checks currently do not include GPU readiness, and they cannot be extended by users to include GPU-specific conditions.

If you have any further queries, let me know. If the information is helpful, please click on Upvote.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-23T20:20:36.3866667+00:00

Hi JeffreyCMI,

Did you get a chance to see my response. If you have any further queries, let me know. If the information is helpful, please click on Upvote.
JeffreyCMI 41 Reputation points

2025-06-25T19:10:14.52+00:00

The idea of using start task sounds very promising. However in my initial test of it, I have seen the start task successfully run all of the following, yet still have the first user tasks not find GPU.

nvidia-container-cli info nvidia-smi lsmod | grep -q nvidia

So now I'm tweaking the startup script to see if I can identify a check that correctly predicts if a user task will find GPU. Stay tuned...
JeffreyCMI 41 Reputation points

2025-06-26T01:04:32.11+00:00

I'm not able to find a check that I can run from a Start Task that reliably signals whether a user Task launched subsequently by Batch will or will not use the GPU. Even after the new node's Start Task (using waitForSuccess=true) successfully runs all of the following checks, I can still reproduce the bug: on some but not all nodes, the first containerized Tasks launched by Batch cannot use the GPU inside the container.

nvidia-container-cli info nvidia-smi lsmod | grep nvidia docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

To reiterate: the closest thing I can get to checking what will happen to an actual containerized Azure Batch Task, but from inside a Start Task, is docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi shows the same correct output even when a subsequent Batch Task will not find the GPU!

For more debugging, I was able to SSH into the Batch node and docker exec into the running container of an affected Batch Task (one launched by Azure Batch, not a Start Task) to run the following checks:

$ docker exec -it busy_lewin bash # ls -l /dev/nvidia* crw-rw-rw- 1 root root 507, 0 Jun 25 19:48 /dev/nvidia-uvm crw-rw-rw- 1 root root 507, 1 Jun 25 19:48 /dev/nvidia-uvm-tools crw-rw-rw- 1 root root 195, 0 Jun 25 19:48 /dev/nvidia0 crw-rw-rw- 1 root root 195, 255 Jun 25 19:48 /dev/nvidiactl # nvidia-smi Failed to initialize NVML: Unknown Error

Plus I confirmed an equivalent Azure Batch Task that correctly finds the GPU instead shows correct nvidia-smi output in the above scenario. I think the "Failed to initialize NVML" is the most striking finding yet, but I don't know where to go from here.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-26T17:31:52.0333333+00:00

Hi JeffreyCMI,

Change Docker’s cgroup driver to cgroupfs Then restart Docker. This avoids Docker-induced cgroup reloads that can tear GPU binding.
Disable cgroups in NVIDIA config no-cgroups = false

For your reference: https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours
https://learn.microsoft.com/en-us/answers/questions/2237519/urgent-need-help-docker-cannot-use-a100-gpu
https://forums.developer.nvidia.com/t/solved-somehow-cuda-in-docker-gives-failed-to-initialize-nvml-unknown-error/283234
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-06-30T17:06:57.82+00:00

Hi JeffreyCMI,

Did you get a chance to see my response. If you have any further queries, let me know. If the information is helpful, please click on Upvote.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-07-01T17:06:32.7666667+00:00

Hi JeffreyCMI,

Did you get a chance to see my latest response. If you have any further queries, let me know. If the information is helpful, please click on Upvote.
JeffreyCMI 41 Reputation points

2025-07-01T19:12:23.68+00:00

I made a Start Task that sets cgroups and restarts Docker. Now my user tasks appear to all find the GPU (at least, on 45 new nodes so far)! It's a pity I have to do this, why is this not the default in Azure Batch? Should I be using a different machine image for GPU-enabled container tasks?
JeffreyCMI 41 Reputation points

2025-07-01T21:03:04.76+00:00

Thanks! Yes I looked at ubuntu-server-container-rdma but it's still stuck on Ubuntu 20.04 which is very outdated by now.

I'll share my working Start Task script in an answer.
Anusree Nashetty 5,735 Reputation points Microsoft External Staff Moderator

2025-07-01T22:14:17.4166667+00:00

Hi JeffreyCMI,

If any of the information provided by me is helpful for you, could you please Accept Answer on it, so that it will be useful to others in the community.

Answer 1

Hi JeffreyCMI,

Thanks for posting the solution here.

Solution:
User tasks are being scheduled on a node before the NVIDIA GPU runtime stack is fully initialized. I was unable to create a Start Task that checks and waits till the stack is fully initialized. Instead, I was able to create a Start Task that forces Docker to use cgroupfs no matter what and restarts Docker, before releasing the VM for user tasks.

Here is the Start Task script that fixed this bug for me. It must be run as Admin, and I also set wait_for_success=true:

#!/bin/bash
# set -e
#
# Some Azure Batch nodes with GPUs intermittently fail to expose the GPU to containerized tasks.
# This is tracked down to Docker not explicitly using the cgroupfs driver, causing inconsistencies
# in GPU device access inside container user tasks.
#
# The script ensures Docker uses `cgroupfs` and that NVIDIA cgroups are enabled before tasks run.
# By restarting Docker with corrected settings, the node becomes GPU-ready for all containers.
echo "=== Checking native.cgroupdriver=cgroupfs ==="
if ! grep -q "native.cgroupdriver=cgroupfs" /etc/docker/daemon.json; then
  echo "-> Adding native.cgroupdriver=cgroupfs"
  tmp=$(mktemp)
  jq '. + { "exec-opts": ["native.cgroupdriver=cgroupfs"] }' /etc/docker/daemon.json > "$tmp"
  mv "$tmp" /etc/docker/daemon.json
  RESTART_DOCKER=1
else
  echo "   👍 Already set"
fi
echo
echo "Current 'exec-opts' in /etc/docker/daemon.json:"
jq '.["exec-opts"]' /etc/docker/daemon.json
echo
echo "=== Checking NVIDIA container config ==="
if ! grep -q "^no-cgroups = false" /etc/nvidia-container-runtime/config.toml; then
  echo "-> Disabling no-cgroups in NVIDIA config"
  sed -i 's/#no-cgroups = false/no-cgroups = false/' /etc/nvidia-container-runtime/config.toml
  RESTART_DOCKER=1
else
  echo "   👍 no-cgroups already false"
fi
echo
echo "Lines around 'no-cgroups' in /etc/nvidia-container-runtime/config.toml:"
grep -C2 "no-cgroups" /etc/nvidia-container-runtime/config.toml
echo
# Restart Docker if changes were made
if [ "$RESTART_DOCKER" == "1" ]; then
  echo "=== Restarting Docker ==="
  systemctl restart docker
  sleep 5
fi

Please Upvote it and Accept the Answer, it will be helpful to others in the community.

Answer 2

As @Anusree Nashetty confirmed in the other answer, user tasks are being scheduled on a node before the NVIDIA GPU runtime stack is fully initialized. I was unable to create a Start Task that checks and waits til the stack is fully initialized. Instead, I was able to create a Start Task that forces Docker to use cgroupfs no matter what and restarts Docker, before releasing the VM for user tasks.

Here is the Start Task script that fixed this bug for me. It must be run as Admin, and I also set wait_for_success=true:

#!/bin/bash
# set -e
#
# Some Azure Batch nodes with GPUs intermittently fail to expose the GPU to containerized tasks.
# This is tracked down to Docker not explicitly using the cgroupfs driver, causing inconsistencies
# in GPU device access inside container user tasks.
#
# The script ensures Docker uses `cgroupfs` and that NVIDIA cgroups are enabled before tasks run.
# By restarting Docker with corrected settings, the node becomes GPU-ready for all containers.

echo "=== Checking native.cgroupdriver=cgroupfs ==="
if ! grep -q "native.cgroupdriver=cgroupfs" /etc/docker/daemon.json; then
  echo "-> Adding native.cgroupdriver=cgroupfs"
  tmp=$(mktemp)
  jq '. + { "exec-opts": ["native.cgroupdriver=cgroupfs"] }' /etc/docker/daemon.json > "$tmp"
  mv "$tmp" /etc/docker/daemon.json
  RESTART_DOCKER=1
else
  echo "   👍 Already set"
fi

echo
echo "Current 'exec-opts' in /etc/docker/daemon.json:"
jq '.["exec-opts"]' /etc/docker/daemon.json

echo
echo "=== Checking NVIDIA container config ==="
if ! grep -q "^no-cgroups = false" /etc/nvidia-container-runtime/config.toml; then
  echo "-> Disabling no-cgroups in NVIDIA config"
  sed -i 's/#no-cgroups = false/no-cgroups = false/' /etc/nvidia-container-runtime/config.toml
  RESTART_DOCKER=1
else
  echo "   👍 no-cgroups already false"
fi

echo
echo "Lines around 'no-cgroups' in /etc/nvidia-container-runtime/config.toml:"
grep -C2 "no-cgroups" /etc/nvidia-container-runtime/config.toml

echo
# Restart Docker if changes were made
if [ "$RESTART_DOCKER" == "1" ]; then
  echo "=== Restarting Docker ==="
  systemctl restart docker
  sleep 5
fi

Answer 3

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Answer 4

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Share via

Tasks fail to detect GPU on some Pool nodes due to early startup race condition

3 additional answers

Your answer