What caused the failover cluster service to fail on the non-owner node?

Kevin Griggs 0 Reputation points
2025-07-23T19:36:32.8333333+00:00

Hello, today we experienced an issue where the non-owner node on our 2-node cluster had the cluster service fail. After rebooting the node things are stable, but we are trying to figure out the cause. Here is what I've been able to determine:

  1. The CSV tried and failed to enter redirected mode (likely due to a backup being performed)
  2. NetFt marked a route for deletion right before the error (but didn't actually delete it until later)
  3. There were a lot of missed heartbeat events, resulting in the connection timing out
  4. The owner node was unaffected.
  5. This event appeared before the cluster service failed in the NetFt logs: Pending event IRP 0xFFFFAF8F898555B0 is being cancelled

Other things to note about the infrastructure:

  1. The physical network adapters on the node are teamed and configured on specific VLANs.
  2. All nodes are on separate subnets
  3. Nodes connect through a firewall (with open rules, can confirm no traffic was blocked between nodes in that time frame)
  4. The CSV is connected directly via FibreChannel

It seems to me that this was a networking issue, but I'm not sure where to start looking for a related setting to fix the issue. If you need any information, please feel free to ask.

User's image

Windows for business | Windows Server | Storage high availability | Clustering and high availability
{count} votes

1 answer

Sort by: Most helpful
  1. Henry Mai 2,375 Reputation points Independent Advisor
    2025-08-05T15:06:09.7866667+00:00

    Hello Kevin, I am Henry and I've reviewed the details of the cluster issue you experienced.

    The behavior you saw is a symptom of a node losing network connectivity with its partners, which triggers a protective shutdown of the Cluster Service. To find the specific root cause and prevent this from happening again, we will collect the logs from both nodes. This will allow us to see both sides of the network conversation at the moment of failure.

    Please run the following PowerShell commands from a management PC or one of the cluster nodes:

    • Get the log from the failing node

    Get-ClusterLog -Node HVHOST2 -Destination C:\Temp\HVHOST2_Cluster.log -UseLocalTime

    • Get the log from the owner node (replace HVHOST1 with its actual name)

    Get-ClusterLog -Node HVHOST1 -Destination C:\Temp\HVHOST1_Cluster.log -UseLocalTime

    Before sending the files, please quickly verify that they contain the necessary information:

    • Open each log file created in C:\Temp.
    • Check the timestamps on the first and last lines to ensure the time of the incident (around 10:32 AM on 7/23/2025) falls within that range.
    • Once confirmed, please upload both logs, and I will be happy to review them for you.

    Step 2: Since the nodes are on separate subnets and cross a firewall, the network path is more complex and prone to issues.

    • NIC Teaming: This is a common point of failure.
      • Check the Event Logs on HVHOST2 for the NIC Teaming provider (e.g., Microsoft-Windows-MsLbfoSysEvtProvider/Admin). Did one of the physical NICs in the team flap or disconnect?
      • Ensure the drivers and firmware for the physical network adapters are fully up-to-date on both hosts. A driver bug is a very likely cause.
    • Physical Switches and Firewall:
      • Even though you confirmed no blocked traffic, check the logs on the switch ports and the firewall for high error counts, dropped packets, or port flaps that align with the time of the failure.

    Step 3: For clusters with nodes on different subnets (a "stretched cluster"), the default heartbeat settings can sometimes be too aggressive.

    • Check your current heartbeat thresholds with PowerShell:
      • (Get-Cluster).SameSubnetThreshold
      • (Get-Cluster).CrossSubnetThreshold
    • The CrossSubnetThreshold is the key one for you. The default is 20 heartbeats (which usually translates to 20 seconds). Consider increasing this to make the cluster more tolerant of brief network latency over the stretched link. For example, to set it to 30: (Get-Cluster).CrossSubnetThreshold = 30

    I look forward to reviewing the logs and helping you get to the bottom of this.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.