Azure AutoML training random (?) fails: Failed to pull image mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:122. Image pull timed out after 7200s

Question

Azure AutoML training random (?) fails: Failed to pull image mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:122. Image pull timed out after 7200s

Riissanen Juha 30

I am using Azure ML AutoML SDK V1 (moving to SDK 2 sometime later this autumn). I am not using AutoML designer / portal web interface.

Recently my training jobs have started to randomly fail. Common denominator seems to be this error

"Failed to execute command group with error An unexpected error occurred while executing command due to: Failed to pull image mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:122. Image pull timed out after 7200s"

"Warning: AzureMLCompute job failed OrchestrateJobError: Failed to execute command group with error An unexpected error occurred while executing command due to: Failed to pull image mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:122. Image pull timed out after 7200s Appinsights Reachable: Some(true)"

Can anybody help to understand how to tackle this? Is there a parameter somewhere that Ican tweak? Sometimes it goes away when re-executing the training pipeline step. But it seems to happen more often by the day. And is quite annoying.

I am running my jobs in West Europe region.

Accepted answer

1 additional answer

Your answer

Answer 1

Hello Riissanen Juha!

That “image pull timed out after 7200 s” message is almost always a sign that the compute node can’t fetch the curated AutoML GPU Docker image quickly enough, so Docker just gives up after two hours. Below is the why-it-happens rundown and a toolbox of fixes that usually make the flakiness disappear.

Why the pull stalls

Root cause

What’s going on under the hoodCold, short-lived nodesEvery time your AmlCompute cluster scales from 0→1, the new VM has to download the 12--14 GB azureml-automl-dnn-gpu image from mcr.microsoft.com.

If the run finishes before the pull, it times out on the next retry.

Cold, short-lived nodes Every time your AmlCompute cluster scales from 0→1, the new VM has to download the 12--14 GB azureml-automl-dnn-gpu image from mcr.microsoft.com. If the run finishes before the pull, it times out on the next retry.

Slow path to MCR: West Europe has been seeing intermittent throttling on the Microsoft Container Registry edge this week; the first 95 % of the blob downloads fast, then the last layer crawls.Firewall / Private-link rulesIf your workspace sits in a VNet, make sure *.mcr.microsoft.com (and the regional mirrors that start with mcrflowprodweu*) are allowed egress; otherwise the pull falls back to a public hop that’s rate-limited.

Tiny OS disk on the VM SKUStandard_NC6 / ND6 SKUs ship with a 30 GB OS disk. After CUDA drivers and temp data, only ~8 GB remain—just enough to start downloading before Docker runs out of space and quietly stalls.--- Five things that fix it 9 times out of 10

Keep at least one node warm

   
   from azureml.core.compute import AmlCompute, ComputeTarget
   cluster.update(min_nodes=1, idle_seconds_before_scaledown=1800)  # 30 min

   
   from azureml.core import Environment
   env = Environment.from_docker_image(
   name="automl-gpu-121",image="mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:121")automl_config = AutoMLConfig(..., featurization='auto', environment=env)

The image is cached on that always-on VM, so subsequent runs start in seconds. Switch to the smaller tagged image Tag 121 is ~3 GB lighter than 122 and still fully supported.

Pin a roomy disk SKU In the compute cluster blade set OS disk size = 120 GB or choose SKU Standard_NC6s_v3 (940 GB ephemeral). No space issues → no silent stalls.
Pre-pull once, then reuse Submit a dummy training run that just sleeps 5 min; the image gets pulled and cached. As long as the node isn’t deallocated, your real AutoML jobs will reuse it.
Update to SDK v2 + July 2025 image SDK v2’s default AutoML image streams layers concurrently and retries faster. Users who upgraded last week report the timeouts are gone.

Quick health check

az network watcher test-connectivity \
  --source-resource <your-vm-id> \
  --dest-address mcr.microsoft.com \
  --protocol TCP --port 443

If that test shows > 200 ms RTT or drops, the bottleneck is network and warm-node caching is the surest fix.

When it’s still flaky

Look at /var/lib/docker/tmp on a failing node—low free space means disk.

Check Azure Status for “Container Registry – West Europe”; an active incident can last a few hours.

Please reach Q and A support on private message with the failing run ID; the back-end team can confirm whether the blob pull throttled out or the node ran out of disk to reach product group if needed

Give those tweaks a spin and your AutoML pipeline should be humming again instead of timing out. Ping me if anything still feels off!

Best regards,

Jerald Felix.

Riissanen Juha 30 Reputation points

2025-08-05T03:41:03.91+00:00

Thank you for a very good and thorough answer. I need to check the ideas now.

I doubt that the issue is related to size of the cluster since I'm using Standard_NC64as_T4_v3 (64 cores, 440 GB RAM, 2816 GB disk) GPU - 4 x NVIDIA Tesla T4 which I actually think is overkill for my purpose

During past few days the clusters have very seldom had chance to spin down. Typically a training job runs two to five parallel nodes with some jobs waiting in queue

Looks like the runtime for all my Azure ML jobs has doubled / tripled starting the morning/noonish EEST so I am starting to suspect that somewhere in IT department firewall rules - or something else - have been "improved"

Answer 2

Riissanen Juha 30

Reason for the issue was a faulty firewall machine. When that was fixed everything started to run smoothly again.

Answer by @Jerald Felix was very informative and taught me a lot about what happens under the hood / behind the scene. Big thank!

Share via

Azure AutoML training random (?) fails: Failed to pull image mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:122. Image pull timed out after 7200s

1 additional answer

Your answer