Azure AutoML training random (?) fails: Failed to pull image mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:122. Image pull timed out after 7200s

Riissanen Juha 30 Reputation points
2025-08-04T11:11:53.9133333+00:00

I am using Azure ML AutoML SDK V1 (moving to SDK 2 sometime later this autumn). I am not using AutoML designer / portal web interface.

Recently my training jobs have started to randomly fail. Common denominator seems to be this error

"Failed to execute command group with error An unexpected error occurred while executing command due to: Failed to pull image mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:122. Image pull timed out after 7200s"

"Warning: AzureMLCompute job failed OrchestrateJobError: Failed to execute command group with error An unexpected error occurred while executing command due to: Failed to pull image mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:122. Image pull timed out after 7200s Appinsights Reachable: Some(true)"

Can anybody help to understand how to tackle this? Is there a parameter somewhere that Ican tweak? Sometimes it goes away when re-executing the training pipeline step. But it seems to happen more often by the day. And is quite annoying.

I am running my jobs in West Europe region.

Azure Machine Learning
0 comments No comments
{count} votes

Accepted answer
  1. Jerald Felix 4,450 Reputation points
    2025-08-05T01:09:36.8766667+00:00

    Hello Riissanen Juha!

    That “image pull timed out after 7200 s” message is almost always a sign that the compute node can’t fetch the curated AutoML GPU Docker image quickly enough, so Docker just gives up after two hours. Below is the why-it-happens rundown and a toolbox of fixes that usually make the flakiness disappear.

    Why the pull stalls

    Root cause

    What’s going on under the hoodCold, short-lived nodesEvery time your AmlCompute cluster scales from 0→1, the new VM has to download the 12--14 GB azureml-automl-dnn-gpu image from mcr.microsoft.com.

    If the run finishes before the pull, it times out on the next retry.

    Cold, short-lived nodes Every time your AmlCompute cluster scales from 0→1, the new VM has to download the 12--14 GB azureml-automl-dnn-gpu image from mcr.microsoft.com. If the run finishes before the pull, it times out on the next retry.

    Slow path to MCR: West Europe has been seeing intermittent throttling on the Microsoft Container Registry edge this week; the first 95 % of the blob downloads fast, then the last layer crawls.Firewall / Private-link rulesIf your workspace sits in a VNet, make sure *.mcr.microsoft.com (and the regional mirrors that start with mcrflowprodweu*) are allowed egress; otherwise the pull falls back to a public hop that’s rate-limited.

    Tiny OS disk on the VM SKUStandard_NC6 / ND6 SKUs ship with a 30 GB OS disk. After CUDA drivers and temp data, only ~8 GB remain—just enough to start downloading before Docker runs out of space and quietly stalls.--- Five things that fix it 9 times out of 10

    1. Keep at least one node warm
         
         from azureml.core.compute import AmlCompute, ComputeTarget
         cluster.update(min_nodes=1, idle_seconds_before_scaledown=1800)  # 30 min
      
         
         from azureml.core import Environment
         env = Environment.from_docker_image(
         name="automl-gpu-121",image="mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:121")automl_config = AutoMLConfig(..., featurization='auto', environment=env)
      

    The image is cached on that always-on VM, so subsequent runs start in seconds. Switch to the smaller tagged image Tag 121 is ~3 GB lighter than 122 and still fully supported.

    1. Pin a roomy disk SKU In the compute cluster blade set OS disk size = 120 GB or choose SKU Standard_NC6s_v3 (940 GB ephemeral). No space issues → no silent stalls.
    2. Pre-pull once, then reuse Submit a dummy training run that just sleeps 5 min; the image gets pulled and cached. As long as the node isn’t deallocated, your real AutoML jobs will reuse it.
    3. Update to SDK v2 + July 2025 image SDK v2’s default AutoML image streams layers concurrently and retries faster. Users who upgraded last week report the timeouts are gone.

    Quick health check

    az network watcher test-connectivity \
      --source-resource <your-vm-id> \
      --dest-address mcr.microsoft.com \
      --protocol TCP --port 443
    

    If that test shows > 200 ms RTT or drops, the bottleneck is network and warm-node caching is the surest fix.


    When it’s still flaky

    Look at /var/lib/docker/tmp on a failing node—low free space means disk.

    Check Azure Status for “Container Registry – West Europe”; an active incident can last a few hours.

    Please reach Q and A support on private message with the failing run ID; the back-end team can confirm whether the blob pull throttled out or the node ran out of disk to reach product group if needed

    Give those tweaks a spin and your AutoML pipeline should be humming again instead of timing out. Ping me if anything still feels off!

    Best regards,

    Jerald Felix.


1 additional answer

Sort by: Most helpful
  1. Riissanen Juha 30 Reputation points
    2025-08-06T13:51:17.44+00:00

    Reason for the issue was a faulty firewall machine. When that was fixed everything started to run smoothly again.

    Answer by @Jerald Felix was very informative and taught me a lot about what happens under the hood / behind the scene. Big thank!

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.