Windows Failover FC Quorum Disk Persistent Reservation

Question

Windows Failover FC Quorum Disk Persistent Reservation

Joseph Boban 0

We have a 2 Node Windows 2022 Cluster with Node Votes+FC connected Quorum Disk.

If Node1 is the owner to the VMs,shared disks,cluster and Quorum and if Node1 looses its network connectivity how will the voting work,? It will loose heartbeat between Node2 but will still have FC connection to the Quorum holding the Quorum Disk Persistent Reservation,so in this scenario will the cluster fail completely ?,as Node1 will not release Quorum as its visible via FC ,and Node2 will not get quorum as Node1 is still holding it

We saw that if we remove the FC Cables from Node1 or shut the FC ports to Node1 Failover happens .

Restarting Node1 will immediately move the Quorum to Node2 and all works as expected

MPIO is already there

Storage HP MSA 2060

1 answer

Your answer

Answer 1

Hello Joseph, I am Henry and I want to share my insight about your concern.

The behavior you're describing aligns with the cluster's split-brain prevention mechanism functioning as intended. I can describe the sequence of events that typically occurs in your network loss scenario:

Network Failure: Node1 loses network connectivity. The cluster heartbeat between Node1 and Node2 stops.
Arbitration Attempt: Node2 sees Node1 is "down" (from its network perspective). It attempts to form a new cluster and, to gain a majority vote, tries to take ownership of the Quorum Disk by placing its own persistent reservation on it.
The Conflict: Node1 is not down. It is network-isolated but its cluster service is still running and it has a healthy Fibre Channel (FC) connection to the storage. It sees Node2 attempting to "steal" the disk reservation. To prevent a split-brain scenario (where both nodes think they own the cluster), Node1 actively defends its reservation.
Result: Cluster Down:
- Node2 fails to gain quorum because it cannot acquire the disk witness vote. Its cluster service will not fully start.
- Node1 fails because it has lost network communication with its partner and cannot confirm a majority. Its cluster service will terminate to be safe.

When the cluster service goes offline, the virtual machines on Node1 shut down. Your observation that disconnecting the FC cables triggers a failover causing Node1 to lose its reservation while Node2 successfully takes over suggests that there is a single point of failure in the cluster’s network communication path. I recommend implementing network redundancy. This will ensure that a failure in one component doesn’t isolate a node or interrupt cluster operations.

Use Multiple NICs: Configure at least two physical network adapters on each node for cluster communications (heartbeat, CSV traffic, live migration).
Use NIC Teaming (SET): Create a Switch Embedded Team (SET) using these two NICs. This provides a single, resilient virtual adapter for the cluster.
Use Redundant Switches: For true redundancy, connect each physical NIC from the team to a separate, redundant physical network switch.

Hope this points you in the right direction for troubleshooting

Share via

Windows Failover FC Quorum Disk Persistent Reservation

1 answer

Your answer