Hello Joseph, I am Henry and I want to share my insight about your concern.
The behavior you're describing aligns with the cluster's split-brain prevention mechanism functioning as intended. I can describe the sequence of events that typically occurs in your network loss scenario:
- Network Failure: Node1 loses network connectivity. The cluster heartbeat between Node1 and Node2 stops.
- Arbitration Attempt: Node2 sees Node1 is "down" (from its network perspective). It attempts to form a new cluster and, to gain a majority vote, tries to take ownership of the Quorum Disk by placing its own persistent reservation on it.
- The Conflict: Node1 is not down. It is network-isolated but its cluster service is still running and it has a healthy Fibre Channel (FC) connection to the storage. It sees Node2 attempting to "steal" the disk reservation. To prevent a split-brain scenario (where both nodes think they own the cluster), Node1 actively defends its reservation.
- Result: Cluster Down:
- Node2 fails to gain quorum because it cannot acquire the disk witness vote. Its cluster service will not fully start.
- Node1 fails because it has lost network communication with its partner and cannot confirm a majority. Its cluster service will terminate to be safe.
When the cluster service goes offline, the virtual machines on Node1 shut down. Your observation that disconnecting the FC cables triggers a failover causing Node1 to lose its reservation while Node2 successfully takes over suggests that there is a single point of failure in the cluster’s network communication path. I recommend implementing network redundancy. This will ensure that a failure in one component doesn’t isolate a node or interrupt cluster operations.
- Use Multiple NICs: Configure at least two physical network adapters on each node for cluster communications (heartbeat, CSV traffic, live migration).
- Use NIC Teaming (SET): Create a Switch Embedded Team (SET) using these two NICs. This provides a single, resilient virtual adapter for the cluster.
- Use Redundant Switches: For true redundancy, connect each physical NIC from the team to a separate, redundant physical network switch.
Hope this points you in the right direction for troubleshooting