A P P E N D I X  B

Online Manual Page for cpudiagd

This appendix provides the online manual page for cpudiagd(1M).

Name

cpudiagd - Online CPU Diagnostics Monitor daemon

Synopsis

/usr/lib/sparcv9/cpudiagd [ -vdi ]

Description

The Online CPU Diagnostics Monitor daemon runs in the background and schedules periodic executions of the CPU diagnostics test cputst(1M) to monitor CPUs in the system to provide high reliability. If any faulty CPU is detected, it is immediately taken offline if possible.

The daemon is started from a system startup script. It reads the /etc/cpudiagd.conf configuration file on startup. Users can send a SIGHUP signal to the daemon to force reconfiguration after updating the configuration file. For the description of the configuration file, see cpudiagd.conf(4).

The daemon schedules the CPU diagnostics test cputst(1M) periodically on system-defined default intervals. The frequency of scheduling the test can also be explicitly configured using the cpudiagd.conf file.

The cputst communicates information about any detected faulty CPU to the daemon. On detecting the fault, the daemon attempts to offline the faulty CPU immediately. Then the daemon creates a bad CPU history data file (explained later). Then the daemon executes any user configured CPU fault action executable, if specified in the cpudiagd.conf(4) configuration file.

After executing the fault action executable, the daemon reattempts to offline the faulty CPU again if it is not already offline. If for some reason, the faulty CPU still can not be taken offline, the machine is rebooted or halted as appropriate. On single processor systems, the machine will be halted if the only online CPU is found faulty. On multi-processor systems, the machine would be rebooted if it is likely to be taken offline after the next reboot. If the faulty CPU can not be taken offline during early system startup testing (when invoked with -i flag), then the machine will be halted.

One of the primary reasons for not being able to take the bad CPU offline is that processes are bound to the CPU. To force successful CPU offline in such cases, the user fault action script could kill/unbind any processes bound to the faulty CPU, if that is desired. The environment variable CPU_ID_FAILED is used to export the faulty processor ID to the user executable. See the online manual pages for cpudiagd.conf(4) for more information.

Information about detected faulty CPU is maintained in a bad CPU history data file /var/cpudiag/data/bad_cpu_id.X where X = processor ID of the faulty CPU. This can be easily processed by any other script to recognize the faulty CPU processor ID. The contents of the bad CPU history file includes information about the timestamp of the failure on that CPU.

The daemon recognizes the existence of the file and assumes the CPU as indicated by the suffix of the file is a suspected bad component. It runs high stress testing during system startup (when invoked with -i flag) on suspected bad CPUs. If any system monitoring tool is to consume this data file, then the format of the contents should not be assumed to be stable.

After replacing bad CPU, user should manually remove this file to prevent additional system startup delay of around 1.5 to 2 minutes due to testing on any suspected bad CPU. If any bad CPU history data file is left over, then appropriate warnings are displayed and logged during startup of the daemon.

All critical errors such as those related to detection and offline of faulty CPUs are logged using syslog(3C) as well as in /var/cpudiag/log/error.log file. The informational messages such as test execution statistics are maintained in the /var/cpudiag/log/info.log file. The maximum growth size of the log files and the number of additional backup logs are user-configurable using cpudiagd.conf file entries.

In the event of a dynamic reconfiguration operation to remove any CPU board, cpudiagd temporarily suspends scheduling of any CPU testing until the operation is complete. cpudiagd is notified of CPU board dynamic reconfiguration events by the rcmscript(4) plug-in script /usr/platform/sun4u/lib/rcm/scripts/SUNW,cpudiagd

Options

The following options are supported:

TABLE B-1 cpudiagd Options

Option

Description

-v

Prints Verbose messages to stdout. Also implies the -d option.

-d

Runs in non-daemon mode; used for debugging.

-i

Invoked during early system startup. It performs minimum initial testing before becoming daemon. Use of this option is discouraged unless started from early system startup sequence from rc script.


Exit Status

cpudiagd exits with 0 on success and exits with 1 on failure.

Environment Variables

CPU_ID_FAILED

The processor ID of detected faulty CPU. This environment variable is exported to the user script/binary that is specified to be executed on fault detection.

Files

cpudiagd uses the following files:

TABLE B-2 Files Used by cpudiagd

File

Description

/etc/cpudiagd.conf

User configuration file

/var/cpudiag/log

Log files directory

/var/cpudiag/log/error.log

Log file containing Error Messages

/var/cpudiag/log/info.log

Log file containing Informational Messages

/var/cpudiag/log/error.log.0, /var/cpudiag/log/error.log.1, etc.

Backup error logs

/var/cpudiag/log/info.log.0, /var/cpudiag/log/info.log.1, etc.

Backup information logs

/var/cpudiag/data

Data files directory

/var/cpudiag/data/bad_cpu_id.X

Bad CPU history data file where X = processor ID of the faulty CPU

/etc/init.d/cpudiagd

Start/stop script

/var/run/cpudiagd_door

Door file used for communication with cputst(1M)

/var/run/cpudiagd.pid

Contains PID of the daemon

/usr/platform/sun4u/lib/rcm/scripts/SUNW,cpudiagd

rcmscript(4) plug-in script for cpudiagd


Attributes

See attributes(5) for descriptions of the following attributes:

TABLE B-3 Attributes for cpudiagd

Attribute

Description

Availability

SUNWcdiax

Interface Stability

Evolving


See Also

cputst(1M), cpudiagd.conf(4)

Notes

Some SPARC desktop models satisfy the United States Environmental Protection Agency's ENERGY STAR(R) Memorandum of Understanding #3 guidelines. On these systems, by default, power management is enabled and the online CPU testing will be done only when the CPUs are not in power save mode. However, if the /etc/cpudiagd.conf file is explicitly modified to specify test intervals, the testing is done exactly as configured. See man power.conf(4) for more information about power management.