Previous  |  Next  >  
Product: Cluster Server Guides   
Manual: Cluster Server 4.1 Installation Guide   

Troubleshooting I/O Fencing

The following troubleshooting topics have headings that indicate likely symptoms or that indicate procedures required for a solution.

vxfentsthdw Fails When SCSI TEST UNIT READY Command Fails

If you see a message resembling:


Issuing SCSI TEST UNIT READY to disk reserved by other node FAILED.
Contact the storage provider to have the hardware configuration fixed.

The disk array does not support returning success for a SCSI TEST UNIT READY command when another host has the disk reserved using SCSI-III persistent group reservations. This happens with Hitachi Data Systems 99XX arrays if bit 186 of the system mode option is not enabled.

vxfentsthdw Fails When Prior Registration Key Exists on Disk

Although unlikely, you may attempt to use the vxfentsthdw utility to test a disk that has a registration key already set. If you suspect a key exists on the disk you plan to test, use the vxfenadm -g command to display it.


vxfenadm -g diskname

  • If the disk is not SCSI-III compliant, an error is returned indicating: Inappropriate ioctl for device.
  • If you have a SCSI-III compliant disk and no key exists, then the output resembles:

  •   Reading SCSI Registration Keys...
      Device Name: <diskname>
      Total Number Of Keys: 0
      No keys ...
    Proceed to test the disk using the vxfentsthdw utility.
    Testing Data Storage Disks Using vxfentsthdw.
  • If keys exist, you must remove them before you test the disk.
    Refer to Removing Existing Keys From Disks.

Node is Unable to Join Cluster While Another Node is Being Ejected

A cluster that is currently fencing out (ejecting) a node from the cluster prevents a new node from joining the cluster until the fencing operation is completed. The following are example messages that appear on the console for the new node:


...VCS FEN ERROR V-11-1-25 ... Unable to join running cluster 
...VCS FEN ERROR V-11-1-25 ... since cluster is currently fencing 
...VCS FEN ERROR V-11-1-25 ... a node out of the cluster.

...VCS GAB.. Port b closed

If you see these messages when the new node is booting, the startup script (/sbin/init.d/vxfen) on the node makes up to five attempts to join the cluster. If this is not sufficient to allow the node to join the cluster, reboot the new node or attempt to restart vxfen driver with the command:


/sbin/init.d/vxfen start

Removing Existing Keys From Disks

To remove the registration and reservation keys created by another node from a disk, use the following procedure:

  1. Create a file to contain the access names of the disks:
      # vi /tmp/disklist

    For example:


      /dev/rdsk/c1t12d0
  2. Read the existing keys:
      # vxfenadm -g all -f /tmp/disklist

    The output from this command displays the key:


      Device Name: /dev/rdsk/c1t12d0
      Total Number Of Keys: 1
      key[0]:
         Key Value [Numeric Format]:   65,49,45,45,45,45,45,45
         Key Value [Character Format]: A1------
  3. If you know on which node the key was created, log in to that node and enter the following command:
      # vxfenadm -x -k A1 -f /tmp/disklist

    The key is removed.

  4. If you do not know on which node the key was created, follow step 5 through step 7 to remove the key.
  5. Register a second key "A2" temporarily with the disk:
      # vxfenadm -m -k A2 -f /tmp/disklist
      Registration completed for disk path /dev/rdsk/c1t12d0
  6. Remove the first key from the disk by preempting it with the second key:
      # vxfenadm -p -k A2 -f /tmp/disklist -vA1
      key: A2------ prempted the key: A1------ on disk  
      /dev/rdsk/c1t12d0
  7. Remove the temporary key assigned in step 5.
      # vxfenadm -x -k A2 -f /tmp/disklist
      Deleted the key : [A2------] from device /dev/rdsk/c1t12d0

    No registration keys exist for the disk.

System Panics to Prevent Potential Data Corruption

When a system experiences a split brain condition and is ejected from the cluster, it panics and displays the following console message:


VXFEN:vxfen_plat_panic: Local cluster node ejected from cluster to prevent potential data corruption.

How vxfen Driver Checks for Pre-existing Split Brain Condition

The vxfen driver functions to prevent an ejected node from rejoining the cluster after the failure of the private network links and before the private network links are repaired.

For example, suppose the cluster of system 1 and system 2 is functioning normally when the private network links are broken. Also suppose system 1 is the ejected system. When system 1 reboots before the private network links are restored, its membership configuration does not show system 2; however, when it attempts to register with the coordinator disks, it discovers system 2 is registered with them. Given this conflicting information about system 2, system 1 does not join the cluster and returns an error from vxfenconfig that resembles:


vxfenconfig: ERROR: There exists the potential for a preexisting
  split-brain. The coordinator disks list no nodes which are in the
  current membership. However, they also list nodes which are not
  in the current membership.

I/O Fencing Disabled!

Also, the following information is displayed on the console:


<date> <system name> vxfen: WARNING: Potentially a preexisting
<date> <system name> split-brain.
<date> <system name> Dropping out of cluster.
<date> <system name> Refer to user documentation for steps
<date> <system name> required to clear preexisting split-brain.
<date> <system name>
<date> <system name> I/O Fencing DISABLED!
<date> <system name>
<date> <system name> gab: GAB:20032: Port b closed

However, the same error can occur when the private network links are working and both systems go down, system 1 reboots, and system 2 fails to come back up. From the view of the cluster from system 1, system 2 may still have the registrations on the coordinator disks.

Case 1: System 2 Up, System 1 Ejected (Actual Potential Split Brain)

Determine if system1 is up or not. If it is up and running, shut it down and repair the private network links to remove the split brain condition. Reboot system 1.

Case 2: System 2 Down, System 1 Ejected (Apparent Potential Split Brain)

  1. Physically verify that system 2 is down.
  2. Verify the systems currently registered with the coordinator disks. Use the following command:
      # vxfenadm -g all -f /etc/vxfentab

    The output of this command identifies the keys registered with the coordinator disks.

  3. Clear the keys on the coordinator disks as well as the data disks using the command /opt/VRTSvcs/rac/bin/vxfenclearpre. See Using vxfenclearpre Command to Clear Keys After Split Brain.
  4. Make any necessary repairs to system 2 and reboot.

Using vxfenclearpre Command to Clear Keys After Split Brain

When you have encountered a split brain condition, use the vxfenclearpre command to remove SCSI-III registrations and reservations on the coordinator disks as well as on the data disks in all shared disk groups.

  1. Shut down all other systems in the cluster that have access to the shared storage. This prevents data corruption.
  2. Start the script:
      # cd /opt/VRTSvcs/vxfen/bin
      # ./vxfenclearpre
  3. Read the script's introduction and warning. Then, you can choose to let the script run.
      Do you still want to continue: [y/n] (default : n)
      y
    Note   Note    Informational messages resembling the following may appear on the console of one of the nodes in the cluster when a node is ejected from a disk/LUN:

    <date> <system name> scsi: WARNING: /sbus@3,0/lpfs@0,0/sd@0,1(sd91):
    <date> <system name> Error for Command: <undecoded cmd 0x5f> Error Level: Informational
    <date> <system name> scsi: Requested Block: 0 Error Block 0
    <date> <system name> scsi: Vendor: <vendor> Serial Number: 0400759B006E
    <date> <system name> scsi: Sense Key: Unit Attention
    <date> <system name> scsi: ASC: 0x2a (<vendor unique code 0x2a>), ASCQ: 0x4, FRU: 0x0

    These informational messages may be ignored.


      Cleaning up the coordinator disks...
      Cleaning up the data disks for all shared disk groups...
      Successfully removed SCSI-III persistent registration and
      reservations from the coordinator disks as well as the shared
      data disks.
      Reboot the server to proceed with normal cluster startup...
      #
  4. Reboot all systems in the cluster.
 ^ Return to Top Previous  |  Next  >  
Product: Cluster Server Guides  
Manual: Cluster Server 4.1 Installation Guide  
VERITAS Software Corporation
www.veritas.com