Hello Kevin, I am Henry and I've reviewed the details of the cluster issue you experienced.
The behavior you saw is a symptom of a node losing network connectivity with its partners, which triggers a protective shutdown of the Cluster Service. To find the specific root cause and prevent this from happening again, we will collect the logs from both nodes. This will allow us to see both sides of the network conversation at the moment of failure.
Please run the following PowerShell commands from a management PC or one of the cluster nodes:
- Get the log from the failing node
Get-ClusterLog -Node HVHOST2 -Destination C:\Temp\HVHOST2_Cluster.log -UseLocalTime
- Get the log from the owner node (replace HVHOST1 with its actual name)
Get-ClusterLog -Node HVHOST1 -Destination C:\Temp\HVHOST1_Cluster.log -UseLocalTime
Before sending the files, please quickly verify that they contain the necessary information:
- Open each log file created in C:\Temp.
- Check the timestamps on the first and last lines to ensure the time of the incident (around 10:32 AM on 7/23/2025) falls within that range.
- Once confirmed, please upload both logs, and I will be happy to review them for you.
Step 2: Since the nodes are on separate subnets and cross a firewall, the network path is more complex and prone to issues.
- NIC Teaming: This is a common point of failure.
- Check the Event Logs on HVHOST2 for the NIC Teaming provider (e.g., Microsoft-Windows-MsLbfoSysEvtProvider/Admin). Did one of the physical NICs in the team flap or disconnect?
- Ensure the drivers and firmware for the physical network adapters are fully up-to-date on both hosts. A driver bug is a very likely cause.
- Physical Switches and Firewall:
- Even though you confirmed no blocked traffic, check the logs on the switch ports and the firewall for high error counts, dropped packets, or port flaps that align with the time of the failure.
Step 3: For clusters with nodes on different subnets (a "stretched cluster"), the default heartbeat settings can sometimes be too aggressive.
- Check your current heartbeat thresholds with PowerShell:
-
(Get-Cluster).SameSubnetThreshold
-
(Get-Cluster).CrossSubnetThreshold
-
- The CrossSubnetThreshold is the key one for you. The default is 20 heartbeats (which usually translates to 20 seconds). Consider increasing this to make the cluster more tolerant of brief network latency over the stretched link. For example, to set it to 30:
(Get-Cluster).CrossSubnetThreshold = 30
I look forward to reviewing the logs and helping you get to the bottom of this.