Hello Riissanen Juha!
That “image pull timed out after 7200 s” message is almost always a sign that the compute node can’t fetch the curated AutoML GPU Docker image quickly enough, so Docker just gives up after two hours. Below is the why-it-happens rundown and a toolbox of fixes that usually make the flakiness disappear.
Why the pull stalls
Root cause
What’s going on under the hoodCold, short-lived nodesEvery time your AmlCompute cluster scales from 0→1, the new VM has to download the 12--14 GB azureml-automl-dnn-gpu image from mcr.microsoft.com.
If the run finishes before the pull, it times out on the next retry.
Cold, short-lived nodes Every time your AmlCompute cluster scales from 0→1, the new VM has to download the 12--14 GB azureml-automl-dnn-gpu
image from mcr.microsoft.com. If the run finishes before the pull, it times out on the next retry.
Slow path to MCR: West Europe has been seeing intermittent throttling on the Microsoft Container Registry edge this week; the first 95 % of the blob downloads fast, then the last layer crawls.Firewall / Private-link rulesIf your workspace sits in a VNet, make sure *.mcr.microsoft.com
(and the regional mirrors that start with mcrflowprodweu*
) are allowed egress; otherwise the pull falls back to a public hop that’s rate-limited.
Tiny OS disk on the VM SKUStandard_NC6 / ND6 SKUs ship with a 30 GB OS disk. After CUDA drivers and temp data, only ~8 GB remain—just enough to start downloading before Docker runs out of space and quietly stalls.--- Five things that fix it 9 times out of 10
- Keep at least one node warm
from azureml.core.compute import AmlCompute, ComputeTarget cluster.update(min_nodes=1, idle_seconds_before_scaledown=1800) # 30 min
from azureml.core import Environment env = Environment.from_docker_image( name="automl-gpu-121",image="mcr.microsoft.com/azureml/curated/azureml-automl-dnn-gpu:121")automl_config = AutoMLConfig(..., featurization='auto', environment=env)
The image is cached on that always-on VM, so subsequent runs start in seconds.
Switch to the smaller tagged image
Tag 121
is ~3 GB lighter than 122
and still fully supported.
- Pin a roomy disk SKU In the compute cluster blade set OS disk size = 120 GB or choose SKU Standard_NC6s_v3 (940 GB ephemeral). No space issues → no silent stalls.
- Pre-pull once, then reuse Submit a dummy training run that just sleeps 5 min; the image gets pulled and cached. As long as the node isn’t deallocated, your real AutoML jobs will reuse it.
- Update to SDK v2 + July 2025 image SDK v2’s default AutoML image streams layers concurrently and retries faster. Users who upgraded last week report the timeouts are gone.
Quick health check
az network watcher test-connectivity \
--source-resource <your-vm-id> \
--dest-address mcr.microsoft.com \
--protocol TCP --port 443
If that test shows > 200 ms RTT or drops, the bottleneck is network and warm-node caching is the surest fix.
When it’s still flaky
Look at /var/lib/docker/tmp
on a failing node—low free space means disk.
Check Azure Status for “Container Registry – West Europe”; an active incident can last a few hours.
Please reach Q and A support on private message with the failing run ID; the back-end team can confirm whether the blob pull throttled out or the node ran out of disk to reach product group if needed
Give those tweaks a spin and your AutoML pipeline should be humming again instead of timing out. Ping me if anything still feels off!
Best regards,
Jerald Felix.