Skip Headers
Oracle® Database High Availability Overview
11g Release 1 (11.1)

Part Number B28281-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

1 Overview of High Availability

This chapter contains the following sections:

1.1 Introduction to High Availability

Databases and the Internet have enabled worldwide collaboration and information sharing by extending the reach of database applications throughout organizations and communities. This reach emphasizes the importance of high availability in data management solutions. Both small businesses and global enterprises have users all over the world who require access to data 24 hours a day. Without this data access, operations can stop, and revenue is lost. Users, who have become more dependent upon their solutions, now demand service-level agreements from their Information Technology (IT) departments and solutions providers. Increasingly, availability is measured in dollars, euros, and yen, not just in time and convenience.

Enterprises have used their IT infrastructure to provide a competitive advantage, increase productivity, and empower users to make faster and more informed decisions. However, with these benefits has come an increasing dependence on that infrastructure. If a critical application becomes unavailable, then the business can be in jeopardy. Revenue and customers can be lost, penalties can be owed, and bad publicity can have a lasting effect on customers and a company's stock price. It is critical to examine the factors that determine how your data is protected and maximize availability to your users.

1.2 What is Availability?

Availability is the degree to which an application, service, or functionality is available upon demand. Availability is measured by the perception of an application's end user. End users experience frustration when their data is unavailable or not performing within certain expectations, and they do not understand or care to differentiate between the complex components of an overall solution. Performance failures due to higher than expected usage create the same havoc as the failure of critical components in the solution.

Reliability, recoverability, timely error detection, and continuous operations are primary characteristics of a highly available solution:

More specifically, a high availability architecture should have the following traits:

1.3 Importance of Availability

The importance of high availability varies among applications. However, the need to deliver increasing levels of availability continues to accelerate as enterprises re-engineer their solutions to gain competitive advantage. Most often, these new solutions rely on immediate access to critical business data. When data is not available, the operation can cease to function. Downtime can lead to lost productivity, lost revenue, damaged customer relationships, bad publicity, and lawsuits.

It is not always easy to place a direct cost on downtime. Angry customers, idle employees, and bad publicity are all costly, but not directly measured in currency. On the other hand, lost revenue and legal penalties incurred because SLA objectives are not met can easily be quantified. The cost of downtime can quickly grow in industries that are dependent upon their solutions to provide service.

Other factors to consider in the cost of downtime are the maximum tolerable length of a single unplanned outage, and the maximum frequency of allowable incidents. If the event lasts less than 30 seconds, then it may cause very little impact and may be barely perceptible to end users. As the length of the outage grows, the effect may grow exponentially and result in a negative impact on the business. Alternatively, frequent outages, even if short in duration, may similarly disrupt business operations. When designing a solution, it is important to understand the true cost of downtime to understand how the business can benefit by availability improvements.

Oracle provides a range of high availability solutions that fit every organization regardless of size. Small workgroups and global enterprises alike are able to extend the reach of their critical business applications. With Oracle and the Internet, applications and their data are now reliably accessible everywhere, at any time.

1.4 Causes of Downtime

One of the challenges in designing a high availability solution is examining and addressing all the possible causes of downtime. It is important to consider causes of both unplanned and planned downtime when designing a fault tolerant and resilient IT infrastructure. Planned downtime can be just as disruptive to operations, especially in global enterprises that support users in multiple time zones.

Table 1-1 describes unplanned outage types and provides examples of each type.

Table 1-1 Causes of Unplanned Downtime

Type Description Examples

Computer failure

A computer failure outage occurs when the system running the database becomes unavailable because it has crashed or is no longer accessible.

  • Database system hardware failure

  • Operating system failure

  • Oracle instance failure

  • Network interface failure

Storage failure

A storage failure outage occurs when the storage holding some or all of the database contents becomes unavailable because it has shut down or is no longer accessible.

  • Disk drive failure

  • Disk controller failure

  • Storage array failure

Human error

A human error outage occurs when unintentional or malicious actions are committed that cause data within the database to become logically corrupt or unusable. The service level impact of a human error outage can vary significantly depending on the amount and critical nature of the affected data.

  • File deletion (at the file system level). Dropped database object

  • Inadvertent data changes

  • Malicious data changes

Data corruption

A corrupt block is a block that has been changed so that it differs from what Oracle Database expects to find. Block corruptions fall under two categories: physical and logical block corruptions. In a physical corruption, which is also called a media corruption, the database does not recognize the block at all: the checksum is invalid, the block contains all zeros, or the header and footer of the block do not match. In a logical corruption, the contents of the block are logically inconsistent. Examples of logical corruption include corruption of a row piece or index entry.

Block corruptions can also be divided into interblock corruption and intrablock corruption. In intrablock corruption, the corruption occurs within the block itself and can be either physical or logical corruption. In an interblock corruption, the corruption occurs between blocks and can only be logical corruption.

A data corruption outage occurs when a hardware, software or network component causes corrupt data to be read or written. The service level impact of a data corruption outage may vary, from a small portion of the database (down to a single database block) to a large portion of the database (making it essentially unusable).

  • Operating system or storage device driver

  • Host bus adapter

  • Disk controller

  • Volume manager error causing bad disk read or writes

  • Software defects

Lost Writes

A lost write is another form of data corruption, but it is much more evasive to detect and repair quickly. A data block lost write occurs when:

  • An I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage. On a subsequent block read on the primary database, the I/O subsystem returns the stale version of the data block, which might be used to update other blocks of the database, thereby corrupting it.

  • The write I/O completed but it was written somewhere else, and a subsequent read operation returns the stale value.

  • A read I/O from one cluster node returns stale data after a write I/O on another node. For example, this could occur if an NFS caching policy is incompatible with Oracle RAC.

  • Operating system or storage device driver

  • Host bus adapter

  • Disk controller

  • Volume manager error

  • Other application software

  • NFS write visibility across cluster

Hang or slow down

Hang or slow down occurs when the database or application are unable to process transactions because of a resource or lock contention. Perceived hang can be caused by lack of system resources.

  • Database or application deadlocks

  • Runaway processes that consume system resources

  • Log on storms or system faults

  • Combination of application peaks with lack of system or database resources

  • Archive log destination or flash recovery area destinations become full

Site failure

A site failure outage occurs when an event causes all or a significant portion of an application to stop processing or slow to an unusable service level. A site failure may affect all processing at a data center, or a subset of applications supported by a data center.

  • Extended site-wide power failure

  • Site-wide network failure

  • Natural disaster making a data center inoperable

  • Terrorist or malicious attack on operations or the site


Table 1-2 describes planned outage types and provides examples of each types.

Table 1-2 Causes of Planned Downtime

Type Description Examples

System and database changes

Planned system changes occur when performing routine and periodic maintenance operations and new deployments.

Planned system changes include any scheduled changes to the operating environment that occur outside the organizational data structure within the database.

The service level impact of a planned system change varies significantly depending on the nature and scope of the planned outage, the testing and validation efforts made prior to implementing the change, and the technologies and features in place to minimize the impact.

  • Adding/removing processors to/from an SMP server

  • Adding/removing nodes to/from a cluster

  • Adding/removing disks drives or storage arrays

  • Changing configuration parameters

  • Upgrading/patching system hardware and software

  • Upgrading/patching Oracle software

  • Upgrading/patching application software

  • System platform migration

  • Database relocation

  • Moving from 32 bits to 64 bits

  • Migrating to cluster architecture

  • Migrating to new storage

Data changes

Planned data changes occur when there are changes to the logical structure or physical organization of Oracle Database objects. The primary objective of these changes is to improve performance or manageability.

  • Table definition changes

  • Adding table partitioning

  • Creating and rebuilding indexes

Application Changes

Planned application changes may include data changes as well as schema and programmatic changes. The primary objective of these changes is to improve performance, manageability, and functionality.

  • Application upgrades


Oracle offers high availability solutions to help avoid both unplanned and planned downtime, as well as recover from failures. Chapter 2 discusses each of these high availability solutions in detail.

1.5 What Does This Book Contain?

Choosing and implementing the architecture that best fits your availability requirements can be a daunting task. This architecture must:

To help you select the most suitable architecture for your organization, this book describes several high availability architectures and provides guidelines for choosing the one that best meets your requirements. Knowledge of the Oracle Database server, Oracle Real Application Clusters, and Oracle Data Guard terminology is required to understand the configuration and implementation details.

Chief technology officers and information technology architects can benefit from reading the following chapters:

Database administrators and network administrators can find useful information in the following chapters:

Oracle High Availability Best Practice recommendations can be found in the Oracle Database High Availability Best Practices and in the white papers that can be downloaded from:

http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm