C H A P T E R  1

Product Overview

This chapter provides an overview of the Online CPU Diagnostics Monitortrademark software and architecture, and includes the following sections:


What is the Online CPU Diagnostics Monitor?

The Online CPU Diagnostics Monitor (CDM) is an online CPU diagnostic program for UltraSPARC® III based platforms, which is used to continuously verify proper functioning of processors in the system. CDM provides additional high reliability to systems by detecting and taking action on faulty CPUs.


Online CPU Diagnostics Monitor Architecture

The following diagram depicts the general architecture of CDM.

FIGURE 1-1 CDM General Architecture

Diagram of the CDM architecture, cpudiagd, cputst, and cpudiagd.conf.

CDM performs CPU diagnostics periodically on the system and takes the appropriate actions such as offlining the faulty CPU and logging error messages. The primary components of CDM consist of a CPU diagnostics test, a daemon to schedule the test, and a user configuration file.

CDM Components

The main components and interfaces of CDM are:

  • cputst - CPU diagnostics test
  • cpudiagd - CPU diagnostics monitor daemon
  • /etc/cpudiagd.conf - CPU diagnostics monitor configuration file

CPU Diagnostics Monitor Daemon

The CPU Diagnostics monitor daemon /usr/lib/sparcv9/cpudiagd is started from a system startup script during the boot sequence. The daemon reads the configuration file on bringup and is responsible for scheduling tests and performing fault management based on test results.

The cpudiagd daemon invocation syntax is as follows:

# /usr/lib/sparcv9/cpudiagd [-vdi]

TABLE 1-1 cpudiagd Command-Line Syntax

Option

Description

-v

Verbose Mode. Prints verbose messages to stdout. Specifying verbose mode also places the daemon in non-daemon mode.

-d

Invokes the daemon to run in the foreground and operate as non-daemon.

-i

Performs initial testing during system boot before becoming a daemon.


cpudiagd Return Exit Code

cpudiagd exits with error code 0 on success and 1 on error.

CPU Diagnostics Test

The CPU diagnostics test /usr/platform/sun4u/sbin/sparcv9+vis2/cputst is provided for all platforms supporting sparcv9+vis2 instruction architecture. UltraSPARC III based platforms support sparcv9+vis2 instruction set.

The test supports options to test at 3 levels of varying stress: low, medium and high. The test invocation syntax is as follows:

# cputst  [ -s 1|2|3 ] [-d dev_id] [-vnh]

TABLE 1-2 cputst Command-Line Syntax

Option

Description

-s 1|2|3

Specifies one of the following stress level:

  • 1 - Perform low stress testing
  • 2 - Perform medium stress testing
  • 3 - Perform high stress testing

By default, low stress testing is performed.

-d dev_id

Specifies the processor ID to be tested. If this option is not specified, all CPUs are tested sequentially.

-v

Prints verbose messages to stdout.

-n

Notifies the cpudiagd daemon on fault. If this option is not specified, the errors are printed to stderr, and no information about detected faulty CPUs is communicated to the cpudiagd daemon.

-h

Displays help message.


cputst Return Exit Code

cputst exits with one of the following exit codes:

  • 0 - Indicates successful execution; the CPU is functioning properly.
  • 1 - At least one CPU failed.
  • 2 - Internal software error. (malloc failure, etc.)
  • 3 - Command-line usage error

Configuration File

The configuration file /etc/cpudiagd.conf can be used to configure CDM. This file is read by the cpudiagd daemon on startup. You can force reconfiguration after updating the configuration file by sending SIGHUP to the daemon.

For a detailed description of the parameters supported by this configuration file, see Appendix C or the online manual page for cpudiagd.conf(4). The supported parameters in the configuration files are:

  • CPU_TEST_FREQ_MIN_STRESS=DEFAULT|[0-9]+[smh]
  • CPU_TEST_FREQ_MED_STRESS=DEFAULT|[0-9]+[smh]
  • CPU_TEST_FREQ_HIGH_STRESS=DEFAULT|[0-9]+[smh]

The above parameters can be used to schedule cputst at different stress levels: low, medium and high respectively. The suffixes s, m, and h are used for seconds, minutes, and hours respectively. The value of 0 disables testing at the specified stress level. By default, the tests are scheduled at the system-defined scheduling intervals. (See TABLE 3-1 on page 14 for the system-defined scheduling intervals.)

CPU_ON_FLT_EXEC=command_line

This parameter is used to specify a user-provided script/binary that will be executed after detecting a faulty CPU.

The faulty processor ID is passed to the script by setting the environment variable CPU_ID_FAILED to the decimal value of the processor ID.

LOG_MAX_NUM_BACKUPS=[0-9]+

This parameter specifies the number of maximum backup logs that need to be maintained for CDM-specific information and error logs. The minimum value is 1 and the maximum value is 100. By default, only one backup log is maintained.

LOG_MAX_SIZE=[0-9]+

This parameter is the maximum log size in Kbytes. The minimum value is 1 and the maximum value is 1000000, which means the minimum log size is 1 Kbyte and the maximum size is 1 Gbyte. The default value is 1000 which means 1 Mbyte.

LOG_ENABLE_INFO_STATS = yes/no

This parameter specifies whether the statistics about test execution should be logged in an information log or not. By default, this parameter is enabled (which is equivalent to specifying yes). Specifying LOG_ENABLE_INFO_STATS=no will disable this parameter.



Note - All blank lines are ignored. All lines for which the first nonwhite character is a pound sign (#) are treated as comments.