C H A P T E R  3

Configuring and Using CDM

This chapter describes how to configure and use the CDM software, and includes the following sections:


Starting and Stopping CDM

The cpudiagd daemon is automatically started during system startup from the /etc/rc2.d/S22cpudiagd startup script. This link is a hard link to the /etc/init.d/cpudiagd script.

See Appendix D for the contents of /etc/init.d/cpudiagd script.

To manually start CDM directly, type the following as root user:

# /usr/lib/sparcv9/cpudiagd

To perform cpudiagd in verbose mode and in the foreground, type the following as root user:

# /usr/lib/sparcv9/cpudiagd -v



Note - The /etc/init.d/cpudiagd file is used as a system startup script, not the binary.



During system startup, the /usr/lib/sparcv9/cpudiagd daemon is invoked with the -i option which performs initial low stress CPU testing on all CPUs in parallel before becoming the daemon.

To stop running the cpudiagd daemon and the cputst processes, type the following as root user:

# /etc/init.d/cpudiagd stop

To stop only the daemon, type the following as root user:

# pkill -x cpudiagd

To terminate only the cputst, type the following as root user:

# pkill -x cputst


Error Handling

When a faulty CPU is detected by the cputst, it communicates the information about the faulty CPU to the daemon. Then the following sequence of operations take place:

1. The daemon logs an error message about the detected fault using the syslog(3C) mechanism which logs the error message into the /var/adm/messages file by default. In addition, the CDM-specific error log is updated.

2. The daemon creates a bad CPU history file (/var/cpudiag/data/bad_cpu_id.X [X is the decimal processor ID]). This file is used by the daemon to recognize the faulty CPU as a suspected bad CPU across reboots until this file is manually deleted by the system administrator after the faulty CPU is replaced.

3. The CDM software performs an initial attempt to offline the detected faulty CPU. The offline attempt fails if any process is bound to the faulty CPU. The offline attempt always fails on a single processor system.

4. If the user has provided a binary/script to be run when a fault is detected, it is invoked. This script could, for example, notify the system administrator about the fault and shut down any user applications that may be explicitly bound to the faulty CPU.

See Appendix C or the online manual pages for cpudiagd.conf(4) for more information on specifying the command line to be executed on fault detection.

5. The CDM software attempts to offline the faulty CPU again. This reattempt to offline is likely to succeed if the binary/script has shut down the user applications and all processes bound to the faulty CPU were terminated.

6. If the offline reattempt fails and if the bad CPU is still online, the CDM software will reboot the system if the system has multiple online processors, otherwise the CDM software will just halt the system.

Emergency category syslog messages are logged before the system is rebooted or halted. Messages are sent to all users and the messages are also displayed in all terminals including the console. The messages include the specific cause of the problem with the processor ID that failed when the system was halted or rebooted.

7. On reboot, the CDM software invokes the daemon with the -i option from the startup script. If there is a suspected faulty CPU indicated by a bad CPU history file (/var/cpudiag/data/bad_cpu_id.X [X is the processor ID]) as created from Step 2, the CDM software performs cputst with the high stress mode to test the CPU.

If the CPU is found faulty, the CDM software performs Step 2 to Step 5. If the faulty CPU still cannot be taken offline, the CDM software halts the system. This step is provided to prevent any potential indefinite looping on reboot.


Log Files

CDM errors are logged using the syslog mechanism (see the online manual page for syslog(3C)) which logs the messages in the /var/adm/messages file by default. In addition, the errors are logged in a CDM-specific error log file, /var/cpudiag/log/error.log. The informational messages such as test execution start, end, and elapsed time statistics are logged in a CDM-specific information log file, /var/cpudiag/log/info.log.

The growth size of the log files and the number of additional backup logs are configurable with the configuration file parameters. See Appendix C or the online manual page for cpudiagd.conf(4) for more information.


Runtime Resource Consumption and Default Scheduling Intervals

CPU testing (cputst) can be invoked with three levels of stress (low, med, or high). The operations performed in all three modes are the same. The difference between low, med, and high stress is defined by how many times an operation or set of operations is performed.

cputst performs complex mathematical operations, like operations with matrices. The stress level is also reflected in the size of the matrices with which cputst operates.

Low stress tests provide functional coverage of the floating point and caches in the CPU. High stress tests simulate a floating point intensive application.

There is a better chance to catch errors in the high stress mode, than in the low stress mode. However, the run time increases with the stress level.

The approximate memory resources consumed by the cputst and approximate runtimes are listed in TABLE 3-1. The runtimes vary greatly depending on the hardware platform and the system load. The figures in TABLE 3-1 are approximate and were taken from a system with 900 MHz UltraSPARC III family of processors with no user applications running at the same time.

TABLE 3-1 Default Configuration and Resource Consumption

Stress Level

Memory

Run Time

Test Interval

Low

4 MB

80 milliseconds/CPU

30 seconds

Medium

8 MB

1.5 seconds/CPU

15 minutes

High

130 MB

80 seconds/CPU

12 hours


On machines with less than 512 Mbytes of physical memory, invocation of high stress testing is disabled by default. However, if the high stress testing interval is explicitly specified in the /etc/cpudiagd.conf file, then it is invoked accordingly.

The cputst invoked by the daemon during normal operation tests all CPUs in the system sequentially. Testing at different stress levels is scheduled independently; hence, they could partially overlap. Even if a very low test interval is configured, the daemon does not perform a new cputst invocation until the previous invocation completes the testing on the same stress level. For example, if the high stress testing frequency is specified as 5 seconds, the invocation of a new cputst high stress testing would still be delayed until the previous invocation is completed.

The CPU usage of the CPU that is under testing would be close to 100% during the brief time of the test run. The average CPU usage of the testing with default parameters have been measured to be less than 1% on systems based on the UltraSPARC III family of processors.



Note - Some SPARC desktop models satisfy the United States Environmental Protection Agency's ENERGY STAR® Memorandum of Understanding #3 guidelines. On these systems, by default, power management is enabled and the online CPU testing will be done only when the CPUs are not in power save mode. However, if the /etc/cpudiagd.conf file is explicitly modified to specify test intervals, the testing is done exactly as configured. See man power.conf(4) for more information about power management.



Scheduling Tests at Specific Times

In most cases, changing the /etc/cpudiagd.conf file is sufficient for changing scheduling parameters. However, if you choose to schedule testing only during specific times of the day, this can be achieved by disabling the testing in the cpudiagd.conf entry, and making a crontab(1) entry to invoke the test.

For example, to invoke the cputst program in high stress mode at 1:00 AM daily, the following entry needs to be made using crontab(1): (See the crontab(1) man page)

0  1 * * *  /usr/platform/sun4u/sbin/sparcv9+vis2/cputst -s 3 -n



Note - Since the cputst program is invoked with the -n option, this will result in the automatic notification to the cpudiagd daemon which takes appropriate action if any CPU failures are detected.



To disable high stress testing invoked by the daemon, add the following entry to the /etc/cpudiagd.conf file:

CPU_TEST_FREQ_HIGH_STRESS=0 

See the cpudiagd.conf(4) man page for more details.

Customizing Tests on Demand at Arbitrary Check Points

In some customized environments, performing some quick testing on demand could be desirable. Such quick testing can be done using an administrative script to run tests on all CPUs in parallel.

See the following examples:

Example 1:

In addition to the active CDM running periodic testing in the background, the following script may be executed by an application at certain definite points to verify the reliability of the CPU at that instant:

#!/bin/sh
#
#   Example 1: Perform medium stress test in parallel on all CPUs.
# 
for cpu in `/usr/sbin/psrinfo | grep on-line | cut -f1`
do
   /usr/platform/sun4u/sbin/sparcv9+vis2/cputst -s 2 -d $cpu -n &
done
wait
#  End of the script 



Note - Since cputst is invoked with the -n option, the daemon is notified of any fault.





Note - The test is invoked in parallel on all CPUs in the background. By default, when the -d option is not used, the testing is done sequentially over one CPU at a time, which can take a significantly longer time.





Note - The cpudiagd daemon is capable of handling multiple simultaneous fault notifications from different instances of the cputst processes. Hence, even if multiple cputst processes are running at the same time (some invoked by cpudiagd and some from another script), this will be properly handled by cpudiagd daemon.



Example 2:

Completely disable cputst invocations from the cpudiagd daemon and let other administrative scripts do testing at different points of time on demand.

To disable test invocations, specify interval 0 in /etc/cpudiagd.conf for all low, medium, and high stress test frequencies. (See the cpudiagd.conf(4) man page.)

Develop the custom script to do testing similar to the script in Example 1.

Example 3:

Completely disable running CDM; perform testing and fault management by administrative scripts.

The /etc/rc2.d/S22cpudiagd script can be modified to disable starting the /usr/lib/sparcv9/cpudiagd daemon. This will completely disable CDM.

A script similar to the following can be used for testing and fault management without making use of the cpudiagd daemon for fault management:

#!/bin/sh
CPUTST=/usr/platform/sun4u/sbin/sparcv9+vis2/cputst
for cpu in `/usr/sbin/psrinfo | grep on-line | cut -f1`
do
  (  $CPUTST -s 2 -d $cpu >> /var/adm/cpuerrs.log 2>&1 ;
     if [ $? -eq 1 ]; then
       (date; echo Error: CPU $cpu is found faulty ) >> /var/adm/cpuerrs.log
        # Include application specific fault management actions here.
        # Do: "psradm -f $cpu" to take faulty cpu offline, etc, if desired 
     fi
  ) & 
done
wait
# End of script



Note - cputst is invoked without the -n option, so the cpudiagd daemon does not need to run and the daemon is not notified of CPU failures. When the -n option is not specified, the /var/cpudiag/data/bad_cpu_id.X (where X = processor_id) file is not created after a fault detection because it is done only by the daemon.





Note - The test exit status 1 is used to detect CPU failure.



Example 4:

Perform medium stress testing from a script continuously with five second intervals between test completion and a new invocation. Do testing sequentially on one CPU at a time to minimize any performance impact during this time. Also, disable test invocations by the CDM daemon.

Disable test invocation by the daemon by modifying /etc/cpudiagd.conf to specify corresponding test intervals as 0.

A script similar to the following can be used:

#!/bin/sh
 
while true;
do
  /usr/platform/sun4u/sbin/sparcv9+vis2/cputst -s 2 -n >>/var/adm/cpuerrs.log
  sleep 5
done
# End of script



Note - cputst is invoked with the -n option. This requires the cpudiagd daemon to be running.





Note - Since no -d option is specified, the testing is done sequentially over all CPUs, one CPU at a time.