Previous  |  Next  >  
Product: Storage Foundation for Oracle RAC Guides   
Manual: Storage Foundation 4.1 for Oracle RAC Installation and Configuration   

System Panics to Prevent Potential Data Corruption

When a node experiences a split brain condition and is ejected from the cluster, it panics and displays the following console message:


VXFEN:vxfen_plat_panic: Local cluster node ejected from cluster to prevent potential data corruption.

How vxfen Driver Checks for Pre-existing Split Brain Condition

The vxfen driver functions to prevent an ejected node from rejoining the cluster after the failure of the private network links and before the private network links are repaired.

For example, suppose the cluster of system 1 and system 2 is functioning normally when the private network links are broken. Also suppose system 1 is the ejected system. When system 1 restarts before the private network links are restored, its membership configuration does not show system 2; however, when it attempts to register with the coordinator disks, it discovers system 2 is registered with them. Given this conflicting information about system 2, system 1 does not join the cluster and returns an error from vxfenconfig that resembles:


vxfenconfig: ERROR: There exists the potential for a preexisting
  split-brain. The coordinator disks list no nodes which are in the
  current membership. However, they also list nodes which are not
  in the current membership.

I/O Fencing Disabled!

Also, the following information is displayed on the console:


<date> <system name> vxfen: WARNING: Potentially a preexisting
<date> <system name> split-brain.
<date> <system name> Dropping out of cluster.
<date> <system name> Refer to user documentation for steps
<date> <system name> required to clear preexisting split-brain.
<date> <system name>
<date> <system name> I/O Fencing DISABLED!
<date> <system name>
<date> <system name> gab: GAB:20032: Port b closed

However, the same error can occur when the private network links are working and both systems go down, system 1 restarts, and system 2 fails to come back up. From the view of the cluster from system 1, system 2 may still have the registrations on the coordinator disks.

Case 1: System 2 Up, System 1 Ejected (Actual Potential Split Brain)

Determine if system1 is up or not. If it is up and running, shut it down and repair the private network links to remove the split brain condition. restart system 1.

Case 2: System 2 Down, System 1 Ejected (Apparent Potential Split Brain)

  1. Physically verify that system 2 is down.
  2. Verify the systems currently registered with the coordinator disks. Use the following command:
      # vxfenadm -g all -f /etc/vxfentab

    The output of this command identifies the keys registered with the coordinator disks.

  3. Clear the keys on the coordinator disks as well as the data disks using the command /opt/VRTSvcs/rac/bin/vxfenclearpre. See Using vxfenclearpre Command to Clear Keys After Split Brain.
  4. Make any necessary repairs to system 2 and restart.

Using vxfenclearpre Command to Clear Keys After Split Brain

When you have encountered a split brain condition, use the vxfenclearpre command to remove SCSI-3 registrations and reservations on the coordinator disks as well as on the data disks in all shared disk groups.

  1. Shut down all other nodes in the cluster that have access to the shared storage. This prevents data corruption.
  2. Start the script:
      # cd /opt/VRTSvcs/vxfen/bin
      # ./vxfenclearpre
  3. Read the script's introduction and warning. Then, you can choose to let the script run.
      Do you still want to continue: [y/n] (default : n)
      y
    Note   Note    Informational messages resembling the following may appear on the console of one of the nodes in the cluster when a node is ejected from a disk/LUN:

    <date> <system name> scsi: WARNING: /sbus@3,0/lpfs@0,0/sd@0,1(sd91):
    <date> <system name> Error for Command: <undecoded cmd 0x5f> Error Level: Informational
    <date> <system name> scsi: Requested Block: 0 Error Block 0
    <date> <system name> scsi: Vendor: <vendor> Serial Number: 0400759B006E
    <date> <system name> scsi: Sense Key: Unit Attention
    <date> <system name> scsi: ASC: 0x2a (<vendor unique code 0x2a>), ASCQ: 0x4, FRU: 0x0

    These informational messages may be ignored.


      Cleaning up the coordinator disks...

      Cleaning up the data disks for all shared disk groups...
      
      Successfully removed SCSI-3 persistent registration and
      reservations from the coordinator disks as well as the shared
      data disks.

      Reboot the server to proceed with normal cluster startup...
      #
  4. Restart all nodes in the cluster.

Removing or Adding Coordinator Disks

This section describes how to:

  • Replace coordinator disk in the coordinator disk group
  • Destroy a coordinator disk group

  • Note   Note    Adding or removing coordinator disks requires all services be shut down.

Note the following about the procedure:

    Checkmark  A coordinator disk group requires an odd number (three minimum) of disks/LUNs.

    Checkmark  When adding a disk, add the disk to the disk group vxfencoorddg and retest the group for support of SCSI-3 persistent reservations.

    Checkmark  You can destroy the coordinator disk group such that no registration keys remain on the disks. The disks can then be used elsewhere.

  To remove and replace a disk in the coordinator disk group

  1. Log in as root user on one of the cluster nodes.
  2. If VCS is running, shut it down:
      # hastop -all
  3. Stop the VCSMM driver on each node:
     # /sbin/vcsmmconfig -U
  4. Stop I/O fencing on all nodes:
     # /sbin/init.d/vxfen stop

    This removes any registration keys on the disks.

  5. Import the coordinator disk group. The file /etc/vxfendg includes the name of the disk group (typically, vxfencoorddg) that contains the coordinator disks, so use the command:
      # vxdg -tfC import 'cat /etc/vxfendg'

    where:

      -t specifies that the disk group is imported only until the node restarts.

      -f specifies that the import is to be done forcibly, which is necessary if one or more disks is not accessible.

      -C specifies that any import blocks are removed.

  6. To remove disks from the disk group, use the VxVM disk administrator utility, vxdiskadm.
    Note   Note    You may also destroy the existing coordinator disk group. For example:

    # vxdg destroy vxfencoorddg

  7. Add the new disk to the node, initialize it as a VxVM disk, and add it to the vxfencoorddg disk group. Refer to Creating the vxfencoorddg Disk Group
  8. Test the recreated disk group for SCSI-3 persistent reservations compliance. Refer to Requirements for Testing the Coordinator Disk Group.
  9. After replacing disks in a coordinator disk group, deport the disk group:
      # vxdg deport 'cat /etc/vxfendg'
  10. On each node, start the I/O fencing driver:
    /sbin/init.d/vxfen start
  11. On each node, start the VCSMM driver:
    # /sbin/init.d/vcsmm start
  12. If necessary, restart VCS on each node:
      # hastart
 ^ Return to Top Previous  |  Next  >  
Product: Storage Foundation for Oracle RAC Guides  
Manual: Storage Foundation 4.1 for Oracle RAC Installation and Configuration  
VERITAS Software Corporation
www.veritas.com