C H A P T E R 1 |
Product Overview |
This chapter provides an overview of the Online CPU Diagnostics Monitor software and architecture, and includes the following sections:
The Online CPU Diagnostics Monitor (CDM) is an online CPU diagnostic program for platforms based on the UltraSPARC® III family of processors. CDM continuously verifies proper functioning of processors in the system. CDM provides additional high reliability to systems by detecting and taking action on faulty CPUs.
CDM is designed to run on production systems. Because CDM provides high reliability against CPU failures, it is highly recommended to install CDM on production systems.
When using the default configuration, the expected performance impact of CDM is either negligible or less than 1% CPU usage for most applications measured in a steady state.
The following diagram depicts the general architecture of CDM.
CDM performs CPU diagnostics periodically on the system and takes the appropriate actions such as offlining the faulty CPU and logging error messages. The primary components of CDM consist of a CPU diagnostics test, a daemon to schedule the test, and a user configuration file.
The main components and interfaces of CDM are:
The CPU Diagnostics monitor daemon /usr/lib/sparcv9/cpudiagd is started from a system startup script during the boot sequence. The daemon reads the configuration file on bringup and is responsible for scheduling tests and performing fault management based on test results.
The cpudiagd daemon invocation syntax is as follows:
# /usr/lib/sparcv9/cpudiagd [-vdi] |
cpudiagd exits with error code 0 on success and 1 on error.
The CPU diagnostics test /usr/platform/sun4u/sbin/sparcv9+vis2/cputst is provided for all platforms supporting sparcv9+vis2 instruction architecture. Platforms based on the UltraSPARC III family of processors support sparcv9+vis2 instruction set.
The test supports options to test at 3 levels of varying stress: low, medium and high. The test invocation syntax is as follows:
# cputst [ -s 1|2|3 ] [-d dev_id] [-vnh] |
cputst exits with one of the following exit codes:
The configuration file /etc/cpudiagd.conf can be used to configure CDM. This file is read by the cpudiagd daemon on startup. You can force reconfiguration after updating the configuration file by sending SIGHUP to the daemon.
For a detailed description of the parameters supported by this configuration file, see Appendix C or the online manual page for cpudiagd.conf(4). The supported parameters in the configuration files are:
The above parameters can be used to schedule cputst at different stress levels: low, medium and high respectively. The suffixes s, m, and h are used for seconds, minutes, and hours respectively. The value of 0 disables testing at the specified stress level. By default, the tests are scheduled at the system-defined scheduling intervals. (See TABLE 3-1 on page 14 for the system-defined scheduling intervals.)
This parameter is used to specify a user-provided script/binary that will be executed after detecting a faulty CPU.
The faulty processor ID is passed to the script by setting the environment variable CPU_ID_FAILED to the decimal value of the processor ID.
This parameter specifies the number of maximum backup logs that need to be maintained for CDM-specific information and error logs. The minimum value is 1 and the maximum value is 100. By default, only one backup log is maintained.
This parameter is the maximum log size in Kbytes. The minimum value is 1 and the maximum value is 1000000, which means the minimum log size is 1 Kbyte and the maximum size is 1 Gbyte. The default value is 1000 which means 1 Mbyte.
This parameter enables/disables the logging of test execution statistics information. By default, it is enabled (which is equivalent to specifying yes). However if power management is enabled, the test execution informational logging is disabled by default.
Specifying LOG_ENABLE_INFO_STATS=no will disable this parameter.
Note - All blank lines are ignored. All lines for which the first nonwhite character is a pound sign (#) are treated as comments. |
Copyright © 2003, Sun Microsystems, Inc. All rights reserved.