C H A P T E R  3

Configuring and Using CDM

This chapter describes how to configure and use the CDM software, and includes the following sections:


Starting and Stopping CDM

The cpudiagd daemon is automatically started during system startup from the /etc/rc2.d/S22cpudiagd startup script. This link is a hard link to the /etc/init.d/cpudiagd script.

See Appendix D for the contents of /etc/init.d/cpudiagd script.

To manually start CDM directly, type the following as root user:

# /usr/lib/sparcv9/cpudiagd

To perform cpudiagd in verbose mode and in the foreground, type the following as root user:

# /usr/lib/sparcv9/cpudiagd -v



Note - The /etc/init.d/cpudiagd file is used as a system startup script, not the binary.



During system startup, the /usr/lib/sparcv9/cpudiagd daemon is invoked with the -i option which performs initial low stress CPU testing on all CPUs in parallel before becoming the daemon.

To stop running the cpudiagd daemon and the cputst processes, type the following as root user:

# /etc/init.d/cpudiagd stop

To stop only the daemon, type the following as root user:

# pkill -x cpudiagd

To terminate only the cputst, type the following as root user:

# pkill -x cputst


Error Handling

When a faulty CPU is detected by the cputst, it communicates the information about the faulty CPU to the daemon. Then the following sequence of operations take place:

1. The daemon logs an error message about the detected fault using the syslog(3C) mechanism which logs the error message into the /var/adm/messages file by default. In addition, the CDM-specific error log is updated.

2. The daemon creates a bad CPU history file (/var/cpudiag/data/bad_cpu_id.X [X is the decimal processor ID]). This file is used by the daemon to recognize the faulty CPU as a suspected bad CPU across reboots until this file is manually deleted by the system administrator after the faulty CPU is replaced.

3. The CDM software performs an initial attempt to offline the detected faulty CPU. The offline attempt fails if any process is bound to the faulty CPU. The offline attempt always fails on a single processor system.

4. If the user has provided a binary/script to be run when a fault is detected, it is invoked. This script could, for example, notify the system administrator about the fault and shut down any user applications that may be explicitly bound to the faulty CPU.

See Appendix C or the online manual pages for cpudiagd.conf(4) for more information on specifying the command line to be executed on fault detection.

5. The CDM software attempts to offline the faulty CPU again. This reattempt to offline is likely to succeed if the binary/script has shut down the user applications and all processes bound to the faulty CPU were terminated.

6. If the offline reattempt fails and if the bad CPU is still online, the CDM software will reboot the system if the system has multiple online processors, otherwise the CDM software will just halt the system.

Emergency category syslog messages are logged before the system is rebooted or halted. Messages are sent to all users and the messages are also displayed in all terminals including the console. The messages include the specific cause of the problem with the processor ID that failed when the system was halted or rebooted.

7. On reboot, the CDM software invokes the daemon with the -i option from the startup script. If there is a suspected faulty CPU indicated by a bad CPU history file (/var/cpudiag/data/bad_cpu_id.X [X is the processor ID]) as created from Step 2, the CDM software performs cputst with the high stress mode to test the CPU.

If the CPU is found faulty, the CDM software performs Step 2 to Step 5. If the faulty CPU still cannot be taken offline, the CDM software halts the system. This step is provided to prevent any potential indefinite looping on reboot.


Log Files

CDM errors are logged using the syslog mechanism (see the online manual page for syslog(3C)) which logs the messages in the /var/adm/messages file by default. In addition, the errors are logged in a CDM-specific error log file, /var/cpudiag/log/error.log. The informational messages such as test execution start, end, and elapsed time statistics are logged in a CDM-specific information log file, /var/cpudiag/log/info.log.

The growth size of the log files and the number of additional backup logs are configurable with the configuration file parameters. See Appendix C or the online manual page for cpudiagd.conf(4) for more information.


Runtime Resource Consumption and Default Scheduling Intervals

The approximate memory resources consumed by the cputst and approximate runtimes are listed in TABLE 3-1. The runtimes vary greatly depending on the hardware platform and the system load. The figures in TABLE 3-1 are approximate and were taken from a system with 900 MHz UltraSPARC III processors with no user applications running at the same time.

TABLE 3-1 Default Configuration and Resource Consumption

Stress Level

Memory

Run Time

Test Interval

Low

4 MB

80 milliseconds/CPU

30 seconds

Medium

8 MB

1.5 seconds/CPU

15 minutes

High

130 MB

80 seconds/CPU

12 hours


On machines with less than 512 Mbytes of physical memory, invocation of high stress testing is disabled by default. However, if the high stress testing interval is explicitly specified in the /etc/cpudiagd.conf file, then it is invoked accordingly.

The cputst invoked by the daemon during normal operation tests all CPUs in the system sequentially. Testing at different stress levels is scheduled independently; hence, they could partially overlap. Even if a very low test interval is configured, the daemon does not perform a new cputst invocation until the previous invocation completes the testing on the same stress level. For example, if the high stress testing frequency is specified as 5 seconds, the invocation of a new cputst high stress testing would still be delayed until the previous invocation is completed.

The CPU usage of the CPU that is under testing would be close to 100% during the brief time of the test run. The average CPU usage of the testing with default parameters have been measured to be less than 1% on UltraSPARC III processor based systems.