Vmss nodes stuck during deallocating with NetworkingInternalOperation error during maintenance

Question

Vmss nodes stuck during deallocating with NetworkingInternalOperation error during maintenance

Niels Witte 0

We received a maintenance notification for our VMSS with tracking ID: 7SY2-VV0

Action Required: Urgent preventive repair may affect one or more of your virtual machines in westeurope You are receiving this notice because you currently use Azure Virtual Machines.Maintenance Summary: Azure has identified a degradation to a network device that is connected to one or more of your virtual machines that requires urgent repair to prevent unplanned failures. During this preventive repair event, Virtual Machines (VM) will be rebooted.To avoid any disruption, we will attempt to migrate VMs to a server under a healthy switch. You can also take preventive actions on VMs by following the steps mentioned in the Recommended Action section of this notification. If the virtual machine isn’t moved by the platform and no action is taken before maintenance begins, each virtual machine listed in this notification may be unavailable for up to 15 minutes while it is being migrated to a healthy server. This will occur sometime between 7/29/2025 5:24:05 AM UTC until 8/2/2025 5:24:05 AM UTC in westeurope.

Since we do not want to risk our nodes for being unavailable for 15 minutes, we decided to perform the manual steps as listed in the notification:

For virtual machine scale sets:

Log into the Azure portal and navigate to the virtual machine scale sets.

Select a virtual machine scale set and select 'Instances' from the left-side settings menu.

Select the virtual machine you want to migrate.

Select 'Deallocate' to stop and deallocate the virtual machine. Please ensure the virtual machine's status has transitioned to 'Stop (deallocated)'.

Select 'Start' only after the previous change in status has been completed.

Our scale set consists of 5 nodes so we deallocated and stopped one node of the scale set (in this case the last one). And it has been stuck in 'Failed' state since.

Provisioning failed
Error
ProvisioningState/failed/NetworkingInternalOperation
An unexpected error occured while processing the network profile of the VM. Please retry later.

We tried removing that node, and replacing it with a new node but all operations fail with a similar error: "NetworkingInternalOperation"

Some more background information. The cluster is hosted in West Europe, based on 5 nodes of type Standard_B4ms and runs a number of Service Fabric application and services. The nodes are connected with a virtual network to the other Azure services such as Azure SQL databases, storage accounts, service bus and load balancers.

Can you please help us complete the migration to the healthy network environment?

Durga Reshma Malthi 9,355 Reputation points Microsoft External Staff Moderator

2025-07-25T13:22:17.7833333+00:00
Hi Niels Witte

Could you please refer to this document - https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machine-scale-sets/extensions/vm-extension-provisioning-errors

Check if there are any NSGs associated with the subnet or the VMSS that might be blocking traffic.

If the node remains in a failed state, you can try to replace the instance, to delete use the below cmd:
az vmss delete-instances --resource-group <your-resource-group> --name <your-vmss-name> --instance-ids <instance-id>

Try to update the NIC:
az network nic update --name <nic-name> --resource-group <your-resource-group> --set provisioningState="Succeeded"

Additional References:

https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machine-scale-sets/restart-stop/instances-not-repaired

https://github.com/Azure/vm-scale-sets/issues/82

Hope this helps!

Please Let me know if you have any queries.

Niels Witte 0

Hi @Durga Reshma Malthi

Thank you for the response. After running the command about extension failures it provides this for the specific node, the other nodes have states of "Succeeded".

az vmss list-instances --resource-group ao-sf-cluster-rg --name nodeType1 --query "[].{instanceId:instanceId, extension:resources[].id, extProvisioningState:resources[].provisioningState}"

[
  {
    "extProvisioningState": ["Updating", "Updating", "Updating"],
    "extension": [
      "/subscriptions/<Subscription ID>/resourceGroups/ao-sf-cluster-rg/providers/Microsoft.Compute/virtualMachines/nodeType1_4/extensions/nodeType1_ServiceFabricNode",
      "/subscriptions/<Subscription ID>/resourceGroups/ao-sf-cluster-rg/providers/Microsoft.Compute/virtualMachines/nodeType1_4/extensions/VMDiagnosticsVmExt_vmNodeType0Name",
      "/subscriptions/<Subscription ID>/resourceGroups/ao-sf-cluster-rg/providers/Microsoft.Compute/virtualMachines/nodeType1_4/extensions/VsDebuggerService-7y868dttu4"
    ],
    "instanceId": "4"
  }
]

Removing the node is not an option, since it results an error. After running the az commend:

az vmss delete-instances --resource-group ao-sf-cluster-rg --name nodeType1 --instance-ids 4

(OperationNotAllowed) Seed node removal operation has been detected, and will be rejected. Reason : This operation would result in only 4 potential seed nodes to remain in the cluster, while 5 are needed as a minimum. Code: OperationNotAllowed Message: Seed node removal operation has been detected, and will be rejected. Reason : This operation would result in only 4 potential seed nodes to remain in the cluster, while 5 are needed as a minimum.

So I tried adding a new node, but the new node gets stuck in the failed state.

The status with the node is stuck in 'creating'

[
  {
    "extProvisioningState": ["Creating", "Creating", "Creating"],
    "extension": [
      "/subscriptions/<Subscription ID>/resourceGroups/ao-sf-cluster-rg/providers/Microsoft.Compute/virtualMachines/nodeType1_5/extensions/nodeType1_ServiceFabricNode",
      "/subscriptions/<Subscription ID>/resourceGroups/ao-sf-cluster-rg/providers/Microsoft.Compute/virtualMachines/nodeType1_5/extensions/VMDiagnosticsVmExt_vmNodeType0Name",
      "/subscriptions/<Subscription ID>/resourceGroups/ao-sf-cluster-rg/providers/Microsoft.Compute/virtualMachines/nodeType1_5/extensions/VsDebuggerService-7y868dttu4"
    ],
    "instanceId": "5"
  }
]

So now i have 6 nodes, 1 failed to update after allocating, 1 failed after creating. I tried removing the failed created node with the same command with instance-ids 5 however that also fails. All other actions I try on the nodes 4 and 5 result in a NetworkInternalOperationError.

The last command you provided about updating NIC, the only NIC resource we have is linked to a private endpoint and results in the following error:

It can not be modified by user. Code: CannotModifyNicAttachedToPrivateEndpoint

1 answer

Your answer

Durga Reshma Malthi 9,355 Reputation points Microsoft External Staff Moderator

2025-07-25T13:22:17.7833333+00:00

Hi Niels Witte

Could you please refer to this document - https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machine-scale-sets/extensions/vm-extension-provisioning-errors

Check if there are any NSGs associated with the subnet or the VMSS that might be blocking traffic.

If the node remains in a failed state, you can try to replace the instance, to delete use the below cmd:
az vmss delete-instances --resource-group <your-resource-group> --name <your-vmss-name> --instance-ids <instance-id>

Try to update the NIC:
az network nic update --name <nic-name> --resource-group <your-resource-group> --set provisioningState="Succeeded"

Additional References:

https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machine-scale-sets/restart-stop/instances-not-repaired

https://github.com/Azure/vm-scale-sets/issues/82

Hope this helps!

Please Let me know if you have any queries.

Answer 1

Niels Witte 0

Completely shutting down the scale set caused the deallocation to succeed and then starting the scale set again solved the issues.

Stop-AzVMSS -ResourceGroupName $myResourceGroup --VMScaleSetName $myVM
Start-AzVMSS -ResourceGroupName $myResourceGroup --VMScaleSetName $myVM

Nikhil Duserla 8,515 Reputation points Microsoft External Staff Moderator

2025-07-29T15:50:14.6366667+00:00

Thank you for letting us know the issue has been resolved. @Niels Witte

Share via

Vmss nodes stuck during deallocating with NetworkingInternalOperation error during maintenance

1 answer

Your answer