Previous  |  Next  >  
Product: Cluster Server Guides   
Manual: Cluster Server 4.1 User's Guide   

VCS Operation Without I/O Fencing

This section describes the operation of VCS in clusters without SCSI-III PR storage.

VCS provides many methods to maintain cluster membership. These methods include LLT, low priority links and disk heartbeat. In all heartbeat configurations, VCS determines that a system has faulted when all heartbeats fail.

The traditional VCS design assumed that for all heartbeats to fail at the same time, a system must be dead. To handle situations where two or more heartbeat connections are not available at time of failure, VCS has a special membership condition known as jeopardy, which is explained in section Jeopardy.

Non-fencing Cluster Membership

VCS membership operates differently when fencing is disabled with the "UseFence=None" directive or when I/O fencing is not available for membership arbitration.

Reliable Vs. Unreliable Communication Notification

LLT informs GAB if communication to a peer is reliable or unreliable. A peer connection is said to be reliable if more than one network link exists between them. If multiple links fail simultaneously, there is a higher possibility that the node has failed.

For the reliable designation to have meaning, it is critical that the networks used fail independently. LLT supports multiple independent links between systems. Using different interfaces and connecting infrastructure decreases the chance of two links failing at the same time, thereby increasing overall reliability. Nodes with a single connection to the cluster are placed in a special membership called a jeopardy membership.

Low Priority Link

LLT can be configured to use a low priority network link as a backup to normal heartbeat channels. Low priority links are typically configured on the public or administrative network.

The low priority link is not used for cluster membership traffic until it is the only remaining link. During normal operation, the low priority link carries only LLT heartbeat traffic. The frequency of heartbeats is reduced to 50% of normal to reduce network overhead. When the low priority link is the only remaining network link, LLT switches all cluster status traffic over as well. When a configured private link is repaired, LLT switches cluster status traffic back to the high priority link.

Disk Heartbeats (GABDISK)

Disk heartbeats improve cluster resiliency by allowing a heartbeat to be placed on a physical disk shared by all systems in the cluster. It uses two small, dedicated regions of a physical disk. It has the following limitations:

  • The cluster size is limited to 8 nodes
  • Disk heartbeat channels cannot carry cluster state. Cluster status can only be transmitted on Network heartbeat connections.

With disk heartbeating configured, each system in the cluster periodically writes to and reads from specific regions on a dedicated shared disk. Because disk heartbeats do not support cluster communication, a failure of private network links leaves only a disk heartbeat link between one system and the remaining nodes in the cluster. This causes the system to have a special jeopardy status. See the next section for information on how VCS handles nodes in jeopardy.

Jeopardy

VCS without I/O fencing requires a minimum of two heartbeat-capable channels between cluster nodes to provide adequate protection against network failure. When a node is down to a single heartbeat connection, VCS can no longer reliably discriminate between loss of a system and loss of the last network connection. It must then handle communication loss on a single network differently than on multiple network. This handling is called jeopardy.

GAB makes intelligent choices on cluster membership based on information about reliable and unreliable links provided by LLT. It also verifies the presence or absence of a functional disk heartbeat.

If a system's heartbeats are lost simultaneously across all channels, VCS determines that the system has failed. The services running on that system are then restarted on another system. However, if the node had only one heartbeat (that is, the node was in jeopardy), VCS does not restart the applications on a new node. This action of disabling failover is a safety mechanism to prevent data corruption. A system can be placed in a jeopardy membership on two conditions:

  • One network heartbeat and no disk heartbeat
  • In this situation, the node is a member of the regular membership and the jeopardy membership. VCS continues to operate as a single cluster except that failover due to system failure is disabled. Even after the last network connection is lost, VCS continues to operate as partitioned clusters on each side of the failure.
  • A disk heartbeat and no network heartbeat
  • In this situation, the node is excluded from regular membership because the disk heartbeat cannot carry cluster status. The node is placed in a jeopardy membership. VCS prevents any actions taken on service groups that were running on the departed system. Reconnecting the network without stopping VCS and GAB may result in one or more systems stopping and restarting HAD and associated service groups.

A system can be placed in a jeopardy membership if the system has only one functional network heartbeat. In this situation, the node is a member of the regular membership and the jeopardy membership, known as a regardy membership. VCS continues to operate as a single cluster except that failover due to system failure is disabled. Even after the last network connection is lost, VCS continues to operate as partitioned clusters on each side of the failure.

 ^ Return to Top Previous  |  Next  >  
Product: Cluster Server Guides  
Manual: Cluster Server 4.1 User's Guide  
VERITAS Software Corporation
www.veritas.com