A P P E N D I X B |
Online Manual Page for cpudiagd |
This appendix provides the online manual page for cpudiagd(1M).
cpudiagd - Online CPU Diagnostics Monitor daemon
/usr/lib/sparcv9/cpudiagd [ -vdi ]
The Online CPU Diagnostics Monitor daemon runs in the background and schedules periodic executions of the CPU diagnostics test cputst(1M) to monitor CPUs in the system to provide high reliability. If any faulty CPU is detected, it is immediately taken offline if possible.
The daemon is started from a system startup script. It reads the /etc/cpudiagd.conf configuration file on startup. Users can send a SIGHUP signal to the daemon to force reconfiguration after updating the configuration file. For the description of the configuration file, see cpudiagd.conf(4).
The daemon schedules the CPU diagnostics test cputst(1M) periodically on system-defined default intervals. The frequency of scheduling the test can also be explicitly configured using the cpudiagd.conf file.
The cputst communicates information about any detected faulty CPU to the daemon. On detecting the fault, the daemon attempts to offline the faulty CPU immediately. Then the daemon creates a bad CPU history data file (explained later). Then the daemon executes any user configured CPU fault action executable, if specified in the cpudiagd.conf(4) configuration file.
After executing the fault action executable, the daemon reattempts to offline the faulty CPU again if it is not already offline. If for some reason, the faulty CPU still can not be taken offline, the machine is rebooted or halted as appropriate. On single processor systems, the machine will be halted if the only online CPU is found faulty. On multi-processor systems, the machine would be rebooted if it is likely to be taken offline after the next reboot. If the faulty CPU can not be taken offline during early system startup testing (when invoked with -i flag), then the machine will be halted.
One of the primary reasons for not being able to take the bad CPU offline is that processes are bound to the CPU. To force successful CPU offline in such cases, the user fault action script could kill/unbind any processes bound to the faulty CPU, if that is desired. The environment variable CPU_ID_FAILED is used to export the faulty processor ID to the user executable. See the online manual pages for cpudiagd.conf(4) for more information.
Information about detected faulty CPU is maintained in a bad CPU history data file /var/cpudiag/data/bad_cpu_id.X where X = processor ID of the faulty CPU. This can be easily processed by any other script to recognize the faulty CPU processor ID. The contents of the bad CPU history file includes information about the timestamp of the failure on that CPU.
The daemon recognizes the existence of the file and assumes the CPU as indicated by the suffix of the file is a suspected bad component. It runs high stress testing during system startup (when invoked with -i flag) on suspected bad CPUs. If any system monitoring tool is to consume this data file, then the format of the contents should not be assumed to be stable.
After replacing bad CPU, user should manually remove this file to prevent additional system startup delay of around 1.5 to 2 minutes due to testing on any suspected bad CPU. If any bad CPU history data file is left over, then appropriate warnings are displayed and logged during startup of the daemon.
All critical errors such as those related to detection and offline of faulty CPUs are logged using syslog(3C) as well as in /var/cpudiag/log/error.log file. The informational messages such as test execution statistics are maintained in the /var/cpudiag/log/info.log file. The maximum growth size of the log files and the number of additional backup logs are user-configurable using cpudiagd.conf file entries.
The following options are supported:
cpudiagd exits with 0 on success and exits with 1 on failure.
The processor ID of detected faulty CPU. This environment variable is exported to the user script/binary that is specified to be executed on fault detection.
cpudiagd uses the following files:
See attributes(5) for descriptions of the following attributes:
Copyright © 2003, Sun Microsystems, Inc. All rights reserved.