Previous  |  Next  >  
Product: Cluster Server Guides   
Manual: Cluster Server 4.1 Installation Guide   

I/O Fencing

I/O fencing is a feature within a kernel module of VCS designed to guarantee data integrity, even in the case of faulty cluster communications causing a split brain condition.

Understanding Split Brain and the Need for I/O Fencing

Split brain is an issue faced by all cluster solutions. To provide high availability, the cluster must be capable of taking corrective action when a node fails. In VCS, this is carried out by the reconfiguration of CVM and CFS to change membership. Problems arise when the mechanism used to detect the failure of a node breaks down. The symptoms look identical to a failed node. For example, if a system in a two-node cluster were to fail, it would stop sending heartbeats over the private interconnects and the remaining node would take corrective action. However, the failure of the private interconnects would present identical symptoms. In this case, both nodes would determine that their peer has departed and attempt to take corrective action. This typically results in data corruption when both nodes attempt to take control of data storage in an uncoordinated manner.

In addition to a broken set of private networks, other scenarios can cause this situation. If a system were so busy as to appear hung, it would be declared dead. This can also happen on systems where the hardware supports a "break" and "resume" function. Dropping the system to PROM level with a break and subsequently resuming means the system could be declared as dead, the cluster could reform, and when the system returns, it could begin writing again.

VCS uses a technology called I/O fencing to remove the risk associated with split brain. I/O fencing blocks access to storage from specific nodes. This means even if the node is alive, it cannot cause damage.

SCSI-III Persistent Group Reservations

VCS uses an enhancement to the SCSI specification, known as SCSI-III Persistent Group Reservations, (SCSI-III PGR). SCSI-III PGR is designed to resolve the issues of using SCSI reservations in a modern clustered SAN environment. SCSI-III PGR supports multiple nodes accessing a device while at the same time blocking access to other nodes. SCSI-III reservations are persistent across SCSI bus resets and SCSI-III PGR also supports multiple paths from a host to a disk.

SCSI-III PGR uses a concept of registration and reservation. Systems wishing to participate register a "key" with a SCSI-III device. Each system registers its own key. Multiple systems registering keys form a membership. Registered systems can then establish a reservation. This is typically set to "Write Exclusive Registrants Only" (WERO). This means registered systems can write, and all others cannot. For a given disk, there can only be one reservation, while there may be many registrations.

With SCSI-III PGR technology, blocking write access is as simple as removing a registration from a device. Only registered members can "eject" the registration of another member. A member wishing to eject another member issues a "preempt and abort" command that ejects another node from the membership. Nodes not in the membership cannot issue this command. Once a node is ejected, it cannot in turn eject another. This means ejecting is final and "atomic."

In the VCS implementation, a node registers the same key for all paths to the device. This means that a single preempt and abort command ejects a node from all paths to the storage device.

Several important concepts are:

  • Only a registered node can eject another
  • Since a node registers the same key down each path, ejecting a single key blocks all I/O paths from the node
  • Once a node is ejected, it has no key registered and it cannot eject others

The SCSI-III PGR specification simply describes the method to control access to disks with the registration and reservation mechanism. The method to determine who can register with a disk and when a registered member should eject another node is implementation specific. The following paragraphs describe VCS I/O fencing concepts and implementation.

I/O Fencing Components

I/O Fencing, or simply fencing, allows write access to members of the active cluster and blocks access to non-members. I/O fencing in VCS uses several components. The physical components are coordinator disks and data disks. Each has a unique purpose and uses different physical disk devices.

Data Disks

Data disks are standard disk devices used for data storage. These can be physical disks or RAID Logical Units (LUNs). These disks must support SCSI-III PGR. Data disks are incorporated in standard VxVM/CVM disk groups. In operation, CVM is responsible for fencing data disks on a disk group basis. Since VxVM enables I/O fencing, several other features are provided. Disks added to a disk group are automatically fenced, as are new paths discovered to a device.

Coordinator Disks

Coordinator disks are special purpose disks in a VCS environment. Coordinator disks are three (or an odd number greater than three) standard disks, or LUNs, set aside for use by I/O fencing during cluster reconfiguration.

The coordinator disks act as a global lock device during a cluster reconfiguration. This lock mechanism is used to determine who gets to fence off data drives from other nodes. From a high level, a system must eject a peer from the coordinator disks before it can fence the peer from the data drives. This concept of racing for control of the coordinator disks to gain the capability to fence data disks is key to understanding the split brain prevention capability of fencing.

Coordinator disks cannot be used for any other purpose in the VCS configuration. The user must not store data on these disks, or include the disks in a disk group used by user data. The coordinator disks can be any three disks that support SCSI-III PGR. VERITAS typically recommends the smallest possible LUNs for coordinator use. Since coordinator disks do not store any data, cluster nodes need only register with them and do not need to reserve them.

I/O Fencing Operation

I/O fencing provided by the kernel-based fencing module (VXFEN) performs identically on node failures and communications failures. When the fencing module on a node is informed of a change in cluster membership by the GAB module, it immediately begins the fencing operation. The node immediately attempts to eject the key for departed node(s) from the coordinator disks using the preempt and abort command. When the node has successfully ejected the departed nodes from the coordinator disks, it ejects the departed nodes from the data disks. If this were a split brain scenario, both sides of the split would be "racing" for control of the coordinator disks. The side winning the majority of the coordinator disks wins the race and fences the loser. The loser then panics and reboots.

The VERITAS Cluster Server User's Guide describes I/O fencing concepts in detail.

 ^ Return to Top Previous  |  Next  >  
Product: Cluster Server Guides  
Manual: Cluster Server 4.1 Installation Guide  
VERITAS Software Corporation
www.veritas.com