Skip Headers

Oracle® Database High Availability Architecture and Best Practices
10g Release 1 (10.1)

Part Number B10726-02
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

11 Restoring Fault Tolerance

This chapter describes how to restore redundancy to your environment after a failure. It includes the following topics:

11.1 Restoring Full Tolerance

Whenever a component within an HA architecture fails, then the full protection, or fault tolerance, of the architecture is compromised and possible single points of failure exist until the component is repaired. Restoring the HA architecture to full fault tolerance to reestablish full RAC, Data Guard, or MAA protection requires repairing the failed component. While full fault tolerance may be sacrificed during a scheduled outage, the method of repair is well understood because it is planned, the risk is controlled, and it ideally occurs at times best suited for continued application availability. However, for unscheduled outages, the risk of exposure to a single point of failure must be clearly understood.

This chapter describes the steps needed to restore database fault tolerance. It includes the following topics:

For RAC environments:

For Data Guard and MAA environments:

11.2 Restoring Failed Nodes or Instances in a RAC Cluster

Ensuring that application services fail over quickly and automatically within a RAC cluster, or between primary and secondary sites, is important when planning for both scheduled and unscheduled outages. It is also important to understand the steps and processes for restoring failed instances or nodes within a RAC cluster or databases between sites, to ensure that the environment is restored to full fault tolerance after any errors or issues are corrected.

Adding a failed node back into the cluster or restarting a failed RAC instance is easily done after the core problem that caused the specific component to originally fail has been corrected. However, the following are additional considerations.

How an application runs within a RAC environment (similar to initial failover) also dictates how to restore the node or instance, as well as whether to perform other processes or steps.

After the problem that caused the initial node or instance failure has been corrected, a node or instance can be restarted and added back into the RAC environment at any time. However, there may be some performance impact on the current workload when rejoining the node or instance. Table 11-1 summarizes the performance impact of restarting or rejoining a node or instance.

Table 11-1 Performance Impact of Restarting or Rejoining a Node or Instance

Action Impact on Runtime Application
Restarting a node or rejoining a node into a cluster There may be some potential performance impact while the reconfiguration occurs to add this node back into the cluster. There may or may not be an impact on the performance of the running application, but this should be evaluated.
Restarting or rejoining a RAC instance When you restart a RAC instance, there may be some potential performance impact while lock reconfiguration takes place. Evaluation tests show the impact on current applications to be minimal, but they should be evaluated with an appropriate test workload.

Therefore, it is important to consider the following when restoring a node or RAC instance:

The rest of this section includes the following topics:

11.2.1 Recovering Service Availability

After a failed node has been brought back into the cluster and its instance has been started, RAC's Cluster Ready Services (CRS) automatically manages the virtual IP address used for the node and the services supported by that instance automatically. A particular service may or may not be started for the restored instance. The decision by CRS to start a service on the restored instance depends on how the service is configured and whether the proper number of instances are currently providing access for the service. A service is not relocated back to a preferred instance if the service is still being provided by an available instance to which it was moved by CRS when the initial failure occurred. CRS restarts services on the restored instance if the number of instances that are providing access to a service across the cluster is less than the number of preferred instances defined for the service. After CRS restarts a service on a restored instance, CRS notifies registered applications of the service change.

For example, suppose the HR service is defined with instances A and B as preferred and instances C and D as available in case of a failure. If instance B fails and CRS starts up the HR service on C automatically, then when instance B is restarted, the HR service remains at instance C. CRS does not automatically relocate a service back to a preferred instance.

Suppose a different scenario in which the HR service is defined with instances A, B, C, and D as preferred and no instances defined as available, spreading the service across all nodes in the cluster. If instance B fails, then the HR service remains available on the remaining three nodes. CRS automatically starts the HR service on instance B when it rejoins the cluster because it is running on fewer instances than configured. CRS notifies the applications that the HR service is again available on instance B.

11.2.2 Considerations for Client Connections After Restoring a RAC Instance

After a RAC instance has been restored, additional steps may be required, depending on the current resource utilization and performance of the system, the application configuration, and the network load balancing that has been implemented.

Existing connections (which may have failed over or started as a new session) on the surviving RAC instances, are not automatically redistributed or failed back to an instance that has been restarted. Failing back or redistributing users may or may not be necessary, depending on the current resource utilization and the capability of the surviving instances to adequately handle and provide acceptable response times for the workload. If the surviving RAC instances do not have adequate resources to run a full workload or to provide acceptable response times, then it may be necessary to move (disconnect and reconnect) some existing user connections to the restarted instance.

New connections are started as they are needed, on the least-used node, assuming connection load balancing has been configured. Therefore, the new connections are automatically load-balanced over time.

An application service can be:

  • Partitioned with services running on a subset of RAC instances

  • Nonpartitioned so that all services run equally across all nodes

This is valuable for modularizing application and database form and function while still maintaining a consolidated data set. For the cases where an application is partitioned or has a combination of partitioning and non-partitioning, the response time and availability aspects for each service should be considered. If redistribution or failback of connections for a particular service is required, then you can rebalance workloads manually with the DBMS_SERVICE.disconnect_session PL/SQL procedure. You can use this procedure to disconnect sessions associated with a service while the service is running.

For load-balancing application services across multiple RAC instances, Oracle Net connect-time failover and connection load balancing are recommended. This feature does not require changes or modifications for failover or restoration. It is also possible to use hardware-based load balancers. However, there may be limitations in distinguishing separate application services (which is understood by Oracle Net Services) and restoring an instance or a node. For example, when a node or instance is restored and available to start receiving new connections, a manual step may be required to include the restored node or instance in the hardware-based load balancer logic, whereas Oracle Net Services does not require manual reconfiguration.

Table 11-2 summarizes the considerations for new and existing connections after an instance has been restored. The considerations differ depending on whether the application services are partitioned, nonpartitioned, or have a combination of each type. The actual redistribution of existing connections may or may not be required depending on the resource utilization and response times.

Table 11-2 Restoration and Connection Failback

Application Services Failback or Restore Existing Connections Failback or Restore New Connections
Partitioned Existing sessions are not automatically relocated back to the restored instance. Use DBMS_SERVICE.disconnect_session to manually disconnect sessions and allow them to be reestablished on one of the remaining instances that provides the service. Automatically routes to the restored instance by using the Oracle Net Services configuration.
Nonpartitioned No action is necessary unless the load needs to be rebalanced, because restoring the instance means that the load there is low. If the load needs to be rebalanced, then the same problems are encountered as if application services were partitioned. Automatically routes to the restored instance (because its load should be lowest) by using the Oracle Net Services configuration

Figure 11-1 shows a 2-node partitioned RAC database. Each instance services a different portion of the application (HR and Sales). Client processes connect to the appropriate instance based on the service they require.

Figure 11-1 Partitioned 2-Node RAC Database

Description of maxav015.gif follows
Description of the illustration maxav015.gif

Figure 11-2 shows what happens when one RAC instance fails.

Figure 11-2 RAC Instance Failover in a Partitioned Database

Description of maxav016.gif follows
Description of the illustration maxav016.gif

If one RAC instance fails, then the service and existing client connections can be automatically failed over to another RAC instance. In this example, the HR and Sales services are both supported by the remaining RAC instance. In addition, new client connections for the Sales service can be routed to the instance now supporting this service.

After the failed instance has been repaired and restored to the state shown in Figure 11-1 and the Sales service is relocated to the restored instance failed-over clients and any new clients that had connected to the Sales service on the failed-over instance may need to be identified and failed back. New client connections, which are started after the instance has been restored, should automatically connect back to the original instance. Therefore, over time, as older connections disconnect, and new sessions connect to the Sales service, the client load migrates back to the restored instance. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.

Figure 11-3 shows a nonpartitioned application. Services are evenly distributed across both active instances. Each instance has a mix of client connections for both HR and Sales.

Figure 11-3 Nonpartitioned RAC Instances

Description of maxav017.gif follows
Description of the illustration maxav017.gif

If one RAC instance fails, then CRS moves the services that were running on the failed instance. In addition, new client connections are routed only to the available RAC instances that offer that service.

After the failed instance has been repaired and restored to the state shown in Figure 11-3, some clients may need to be moved back to the restored instance. For nonpartitioned applications, identifying appropriate services is not required for rebalancing the client load among all available instances. Also, this is necessary only if a single instance is not able to adequately service the requests.

New client connections that are started after the instance has been restored should automatically connect back to the restored instance because it has a smaller load. Therefore, over time, as older connections disconnect and new sessions connect to the restored instance, the client load will again evenly balance across all available RAC instances. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.

11.3 Restoring the Standby Database After a Failover

Following an unplanned production database outage that requires a failover, full fault tolerance is compromised until the physical or logical standby database is reestablished. Full database protection should be restored as soon as possible. Steps for restoring fault tolerance differ slightly for physical and logical standby databases.

Standby databases do not need to be reinstantiated because of Oracle's Flashback Database feature. Flashback Database:

This section includes the following topics:

11.3.1 Restoring a Physical Standby Database After a Failover

The following steps are required to restore a physical standby database after a failover. The steps assume that archived redo logs and sufficient flashback log data are available.

11.3.1.1 Step 1P: Retrieve STANDBY_BECAME_PRIMARY_SCN

From the new production database, execute the following query:

SELECT TO_CHAR(STANDBY_BECAME_PRIMARY_SCN) FROM V$DATABASE;

Use this SCN to convert the previous production database to a standby database.

11.3.1.2 Step 2P: Flash Back the Previous Production Database

Log on to the previous production database and execute the following statements:

SHUTDOWN IMMEDIATE; /*if necessary */
STARTUP MOUNT;
FLASHBACK DATABASE TO SCN standby_became_primary_scn;

If there is insufficient flashback data, then see Oracle Data Guard Concepts and Administration about creating a new standby database.

11.3.1.3 Step 3P: Mount New Standby Database From Previous Production Database

Mounting the new standby database requires the following substeps:

  1. Turn off flashback mode. This deletes the flashback logs, which are obsolete after the standby control file is restored.

ALTER DATABASE FLASHBACK OFF;

  1. Create the standby control file.

ALTER DATABASE CREATE STANDBY CONTROLFILE AS controlfile_name;
    SHUTDOWN IMMEDIATE;

  1. Issue operating system copy commands to replace the current control files with the new standby control files.

  2. Mount the new standby database with the corresponding standby control file.

STARTUP MOUNT;
 
  1. Ensure that the standby listener is running.

LSNRCTL STAT list_name;

11.3.1.4 Step 4P: Archive to New Standby Database From New Production Database

Before the new standby database was created, the current production remote standby archive destination probably stopped with an error and is no longer shipping files to the remote destination. To restart remote archiving, you may have to reenable the standby archive destination.

Query the V$ARCHIVE_DEST_STATUS view to see the current state of the archive destinations.

SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL
    FROM V$ARCHIVE_DEST_STATUS;

Enable the remote archive destination.

ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_n=ENABLE;

Switch a redo log file and verify that it was sent successfully.

ALTER SYSTEM SWITCH LOGFILE;
SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL
    FROM V$ARCHIVE_DEST_STATUS;

Shipping the archived redo log from the new production database notifies the standby database of the new production database incarnation number.

11.3.1.5 Step 5P: Start Managed Recovery

Start managed recovery or real-time apply managed recovery with one of the following statements:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;

or

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Ensure that recovery is applying the archived redo logs.

SELECT * FROM V$MANAGED_STANDBY;

11.3.1.6 Step 6P: Restart MRP After It Encounters the End-of-Redo Marker

The managed recovery process (MRP) stops after it encounters the end-of-redo marker that demarcates when the Data Guard failover was completed in the redo stream. This is not an error. Restart MRP, and it will continue with no problem.

11.3.2 Restoring a Logical Standby Database After a Failover

The following steps are required to restore a logical standby database after a failover:

11.3.2.1 Step 1L: Retrieve END_PRIMARY_SCN

On the new production database, query for the SCN at which the previous standby database became the new production database.

SELECT VALUE AS TO_CHAR(STANDBY_BECAME_PRIMARY_SCN) FROM DBA_LOGSTDBY_PARAMETERS
WHERE NAME = 'END_PRIMARY_SCN';

11.3.2.2 Step 2L: Flash Back the Previous Production Database

You can create a new logical standby database by mounting the previous production database, flashing it back to STANDBY_BECAME_PRIMARY_SCN, and then enabling the previous guard level setting.

SHUTDOWN IMMEDIATE;
STARTUP MOUNT;
FLASHBACK DATABASE TO SCN standby_became_primary_scn;
ALTER DATABASE GUARD [ALL | STANDBY | NONE];

11.3.2.3 Step 3L: Open New Logical Standby Database and Start SQL Apply

ALTER DATABASE OPEN RESETLOGS;
ALTER DATABASE START LOGICAL STANDBY APPLY NEW PRIMARY dblink;

You need to create a database link from the new logical standby database to the new production database if it does not already exist. Use the following syntax:

CREATE PUBLIC DATABASE LINK dblink
  CONNECT TO system IDENTIFIED BY password
  USING 'service_name_of_new_primary_database';

11.4 Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage

The following steps are required to restore full fault tolerance after a scheduled secondary site or clusterwide outage:

11.4.1 Step 1: Start the Standby Database

You may have to restore the standby database from local backups, local tape backups, or from the primary site backups if the data in the secondary site has been damaged. Re-create the standby database from the new production database by following the steps for creating a standby database in Oracle Data Guard Concepts and Administration.

After the standby database has been reestablished, start the standby database.

Type of Standby Database SQL Statement
Physical STARTUP MOUNT;
Logical STARTUP;

11.4.2 Step 2: Start Recovery

Type of Standby Database SQL Statement
Physical RECOVER MANAGED STANDBY DATABASE DISCONNECT;
Logical ALTER DATABASE START LOGICAL STANDBY APPLY;

11.4.3 Step 3: Verify Log Transport Services on Production Database

You may have to reenable the production database remote archive destination. Query the V$ARCHIVE_DEST_STATUS view first to see the current state of the archive destinations:

SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL 
FROM V$ARCHIVE_DEST_STATUS;
ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_n=ENABLE;
ALTER SYSTEM SWITCH LOGFILE;

Verify log transport services between the production and standby databases by checking for errors. Query V$ARCHIVE_DEST and V$ARCHIVE_DEST_STATUS views.

SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR 
    FROM V$ARCHIVE_DEST;
SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';

11.4.4 Step 4: Verify that Recovery is Progressing on Standby Database

For a physical standby database, verify that there are no errors from the managed recovery process and that the recovery has applied the archived redo logs.

SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD;
SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS
    FROM V$MANAGED_STANDBY;

For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied the archived redo logs.

SELECT THREAD#, SEQUENCE# SEQ# 
    FROM DBA_LOGSTDBY_LOG LOG, DBA_LOGSTDBY_PROGRESS PROG 
    WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE# 
    ORDER BY NEXT_CHANGE#;

11.4.5 Step 5: Restore Production Database Protection Mode

If you had to change the protection mode of the production database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the production database protection mode back to maximum protection depending on your business requirements.

ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];

11.5 Restoring Fault Tolerance after a Standby Database Data Failure

Following an unplanned outage of the standby database that requires a full or partial datafile restoration (such as data or media failure), full fault tolerance is compromised until the standby database is brought back into service. Full database protection should be restored as soon as possible. Note that using a Hardware Assisted Resilient Database configuration can prevent this type of problem.

The following steps are required to restore full fault tolerance after data failure of the standby database:

11.5.1 Step 1: Fix the Cause of the Outage

The root cause of the outage should be investigated and action taken to prevent the problem from occurring again.

11.5.2 Step 2: Restore the Backup of Affected Datafiles

Only the affected datafiles need to be restored on to the standby site.

11.5.3 Step 3: Restore Required Archived Redo Log Files

Archived redo log files may need to be restored to recover the restored data files up to the configured lag.

For physical standby databases:

  • If the archived redo logs required for recovery are available on the standby system in a configured archive destination, then the managed recovery process automatically finds and applies them as needed. No restoration is necessary.

  • If the required archived redo logs have been deleted from the standby system but are still available on the production system, then the fetch archive log (FAL) process is automatically invoked to transfer them to the standby system. No restoration is necessary.

For logical standby databases, initiate complete media recovery for the affected files. Consider the following:

  • If the archived redo logs required for recovery are available on the standby system in a configured archive destination, then the recovery process automatically finds and applies them as needed. No restoration is necessary.

  • If the required archived redo logs have been deleted from the standby system, then they must be restored to the standby system. Complete media recovery for the affected files after the necessary archived redo logs are available.

11.5.4 Step 4: Start the Standby Database

After the standby database has been reestablished, start the standby database.

Type of Standby Database SQL Statement
Physical STARTUP MOUNT;
Logical STARTUP;

11.5.5 Step 5: Start Recovery or Apply

Type of Standby Database SQL Statement
Physical RECOVER MANAGED STANDBY DATABASE DISCONNECT;
Logical ALTER DATABASE START LOGICAL STANDBY APPLY;

11.5.6 Step 6: Verify Log Transport Services On the Production Database

Verify log transport services on the new production database by checking for errors when querying V$ARCHIVE_DEST and V$ARCHIVE_DEST_STATUS.

SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR FROM V$ARCHIVE_DEST;

SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS != 'INACTIVE';

11.5.7 Step 7: Verify that Recovery or Apply Is Progressing On the Standby Database

For a physical standby database, verify that there are no errors from the managed recovery process and that the recovery has applied archived redo logs.

SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD;
SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS
    FROM V$MANAGED_STANDBY;

For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied archived redo logs.

SELECT THREAD#, SEQUENCE# SEQ# 
    FROM DBA_LOGSTDBY_LOG LOG, DBALOGSTDBY_PROGRESS PROG
    WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE#
    ORDER BY NEXT_CHANGE#;

11.5.8 Step 8: Restore Production Database Protection Mode

If you had to change the protection mode of the production database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the production database protection mode back to maximum protection depending on your business requirements.

ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];

11.6 Restoring Fault Tolerance After the Production Database Has Opened Resetlogs

If the production database is activated because it was flashed back to correct a logical error or because it was restored and recovered to a point in time, then the corresponding standby database may require additional maintenance. No additional work is required if the production database did complete recovery with no resetlogs.

After activating the production database, execute the queries in the following table.

Database Query
Production database SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) FROM V$DATABASE;
Physical standby database SELECT TO_CHAR(CURRENT_SCN) FROM V$DATABASE;
Logical standby database SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;

11.6.1 Scenario 1: SCN on Standby is Behind Resetlogs SCN on Production

Database Action
Physical standby database
  1. Ensure that the standby database has received a new archived redo log file from the production database.
  2. Restart recovery.

Logical standby database Ensure that the standby database has received a new archived redo log file from the production database. See "Step 3: Verify Log Transport Services on Production Database".

11.6.2 Scenario 2: SCN on Standby is Ahead of Resetlogs SCN on Production

Database Action
Physical standby database
  1. Ensure that the standby database has received a new archived redo log file from the production database.
  2. Flash back the database to the SCN that is 2 SCNs before the resetlogs occurred.

SHUTDOWN IMMEDIATE; /* if necessary */
STARTUP MOUNT;
FLASHBACK DATABASE TO SCN resetlogs_change#_minus_2;
  1. Restart recovery. See "Step 5P: Start Managed Recovery".

Logical standby database
  1. Retrieve production database flashback time or SCN. The flashback time or SCN needs to be extracted from the production database alert log.
  2. Stop SQL Apply on the logical standby database.

ALTER DATABASE STOP LOGICAL STANDBY APPLY;
SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;
  1. Flash back the logical standby database.

Issue the following SQL statements to flash back the logical standby database to the same time used to flash back the primary database.

SHUTDOWN;
STARTUP MOUNT EXCLUSIVE;
FLASHBACK DATABASE TO TIMESTAMP time_of_primary_database_flashback;
ALTER DATABASE OPEN READ ONLY;
SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;
  1. Open the logical standby database with resetlogs

SHUTDOWN;
STARTUP MOUNT EXCLUSIVE;
ALTER DATABASE OPEN RESETLOGS;
  1. Archive the current log on the primary database.

ALTER SYSTEM ARCHIVE LOG CURRENT;
  1. Start SQL Apply.

ALTER DATABASE START LOGICAL STANDBY APPLY;

11.7 Restoring Fault Tolerance after Dual Failures

If a dual failure affecting both the standby and production databases occurs, then you need to re-create the production database first. Because the sites are identical, the production database can be created wherever the most recent backup resides.

Table 11-3 summarizes the recovery strategy depending on the type of backups that are available.

Table 11-3 Re-Creating the Production and Standby Databases

Available Backups Re-Creating the Production Database
Local backup on production and standby databases Restore backup from the production database. Recover and activate the database as the new production database.
Local backup only on standby database. Tape backups on standby database. Restore the local standby backup to the standby database. Recover and activate the database as the new production database.
Tape backups only Restore tape backups locally. Recover the database and activate it as the new production database.

After the production database is re-created, follow the steps for creating a new standby database that are described in Oracle Data Guard Concepts and Administration.