Split-brain
Split-brain occurs when all heartbeat links between the source and target hosts are cut and each side mistakenly thinks the other side is down. To minimize the effects of split-brain, it is best if the cluster heartbeat links pass through similar physical infrastructure as the replication links so that if one breaks, so does the other.
In a replicated data cluster, VCS attempts to start the application assuming a total disaster because the P-VOL hosts and array are unreachable. Once the heartbeats are restored, VCS stops the applications on one side and restarts the VCS engine (HAD) there to eliminate concurrency violation of the same group being online at two places simultaneously. Administrators must resynchronize the volumes manually using the pairresync commands.
In global cluster environments, administrators can confirm the failure before failing over the service groups. You can check with the site administrator to identify the cause of the failure. If you do mistakenly fail over, the situation is similar to the replicated data cluster case; however, when the heartbeat is restored, VCS does not stop HAD at either site. VCS forces you to choose which group to take offline. Again, resynchronization must be performed manually. If it is physically impossible to place the heartbeats alongside the replication links, there is a possibility that the cluster heartbeats are disabled, but the replication link is not. A failover transitions the original P-VOLs to S-VOLs and vice-versa. In this case, the original running application faults because its underlying volumes become write-disabled. This causes the service group to fault and VCS tries to fail it over to another host, causing the same consequence in the reverse direction. This phenomenon, sometimes called ping-pong, continues until the group comes online on the final node. This situation can be avoided by setting up your infrastructure such that loss of heartbeat links also mean the loss of replication links.
|