Previous  |  Next  >  
Product: Cluster Server Guides   
Manual: Cluster Server 4.1 User's Guide   

Troubleshooting and Recovery for Global Clusters

This section describes the concept of disaster declaration and provides troubleshooting tips for configurations using global clusters.

Disaster Declaration

When a cluster in a global cluster transitions to the FAULTED state because it can no longer be contacted, failover executions depend on whether the cause was due to a split-brain, temporary outage, or a permanent disaster at the remote cluster.

If you choose to take action on the failure of a cluster in a global cluster, VCS prompts you to declare the type of failure.

  • Disaster, implying permanent loss of the primary data center
  • Outage, implying the primary may return to its current form in some time
  • Disconnect, implying a split-brain condition; both clusters are up, but the link between them is broken
  • Replica, implying that data on the takeover target has been made consistent from a backup source and that the RVGPrimary can initiate a takeover when the service group is brought online. This option applies to VVR environments only.

You can select the groups to be failed over to the local cluster, in which case VCS brings the selected groups online on a node based on the group's FailOverPolicy attribute. It also marks the groups as being offline in the other cluster. If you do not select any service groups to fail over, VCS takes no action except implicitly marking the service groups as offline on the downed cluster.

Lost Heartbeats and the Inquiry Mechanism

The loss of internal and all external heartbeats between any two clusters indicates that the remote cluster is faulted, or that all communication links between the two clusters are broken (a wide-area split-brain).

VCS queries clusters to confirm the remote cluster to which heartbeats have been lost is truly down. This mechanism is referred to as inquiry. If in a two-cluster configuration a connector loses all heartbeats to the other connector, it must consider the remote cluster faulted. If there are more than two clusters and a connector loses all heartbeats to a second cluster, it queries the remaining connectors before declaring the cluster faulted. If the other connectors view the cluster as running, the querying connector transitions the cluster to the UNKNOWN state, a process that minimizes false cluster faults. If all connectors report that the cluster is faulted, the querying connector also considers it faulted and transitions the remote cluster state to FAULTED.

VCS Alerts

VCS alerts are identified by the alert ID, which is comprised of the following elements:

  • alert_type---The type of the alert, described in Types of Alerts.
  • cluster---The cluster on which the alert was generated
  • system---The system on which this alert was generated
  • object---The name of the VCS object for which this alert was generated. This could be a cluster or a service group.

Alerts are generated in the following format:


 alert_type-cluster-system-object

For example:


GNOFAILA-Cluster1-oracle_grp

This is an alert of type GNOFAILA generated on cluster Cluster1 for the service group oracle_grp.

Types of Alerts

VCS generates the following types of alerts.

  • CFAULT---Indicates that a cluster has faulted
  • GNOFAILA---Indicates that a global group is unable to fail over within the cluster where it was online. This alert is displayed if the ClusterFailOverPolicy attribute is set to Manual and the wide-area connector (wac) is properly configured and running at the time of the fault.
  • GNOFAIL---Indicates that a global group is unable to fail over to any system within the cluster or in a remote cluster.
  • Some reasons why a global group may not be able to fail over to a remote cluster:

    • The ClusterFailOverPolicy is set to either Auto or Connected and VCS is unable to determine a valid remote cluster to which to automatically fail the group over.
    • The ClusterFailOverPolicy attribute is set to Connected and the cluster in which the group has faulted cannot communicate with one ore more remote clusters in the group's ClusterList.
    • The wide-area connector (wac) is not online or is incorrectly configured in the cluster in which the group has faulted

Managing Alerts

Alerts require user intervention. You can respond to an alert in the following ways:

  • If the reason for the alert can be ignored, use the Alerts dialog box in the Java or Web consoles or the haalert command to delete the alert. You must provide a comment as to why you are deleting the alert; VCS logs the comment to engine log.
  • Take an action on administrative alerts that have actions associated with them.You can do so using the Java or Web consoles. See Actions Associated with Alerts for more information.
  • VCS deletes or negates some alerts when a negating event for the alert occurs. See Negating Events for more information.

An administrative alert will continue to live if none of the above actions are performed and the VCS engine (HAD) is running on at least one node in the cluster. If HAD is not running on any node in the cluster, the administrative alert is lost.

Actions Associated with Alerts

This section describes the actions you can perform from the Java and the Web consoles on the following types of alerts:

  • CFAULT---When the alert is presented, clicking Take Action guides you through the process of failing over the global groups that were online in the cluster before the cluster faulted.
  • GNOFAILA---When the alert is presented, clicking Take Action guides you through the process of failing over the global group to a remote cluster on which the group is configured to run.
  • GNOFAIL---There are no associated actions provided by the consoles for this alert

Negating Events

VCS deletes a CFAULT alert when the faulted cluster goes back to the running state

VCS deletes the GNOFAILA and GNOFAIL alerts in response to the following events:

  • The faulted group's state changes from FAULTED to ONLINE.
  • The group's fault is cleared.
  • The group is deleted from the cluster where alert was generated.

 ^ Return to Top Previous  |  Next  >  
Product: Cluster Server Guides  
Manual: Cluster Server 4.1 User's Guide  
VERITAS Software Corporation
www.veritas.com