A P P E N D I X  C

Online Manual Page for cpudiagd.conf

This appendix provides the online manual page for cpudiagd.conf(4)

File format - cpudiagd.conf(4)

Name

cpudiagd.conf - configuration file for cpudiagd(1M)

Synopsis

/etc/cpudiagd.conf

Description

The file /etc/cpudiagd.conf contains information used by the online CPU diagnostics daemon, cpudiagd(1M). The daemon runs CPU diagnostics test cputst(1M) in the background periodically. The daemon reads this file during startup. It also rereads this file if it receives a SIGHUP signal.

Each parameter entry in the configuration file is of the form Name=Value. The following entries are supported in the configuration file:

These parameters specify the time interval on which CPU diagnostics test cputst(1M) is scheduled. The test can run at three stress levels low, medium, and high. The time interval for scheduling tests at each of the low, medium and high levels can be configured independently by using the above three parameters respectively.

The value for the above parameters could either be the string DEFAULT or a non-negative integer followed by one of the letters s, m or h which specifies seconds, minutes, and hours respectively. For example, 30s specifies 30 seconds and 15m specifies 15 minutes and 12h specifies 12 hours.

If the value is specified as DEFAULT, the system schedules the tests at the system defined default intervals. On machines with total memory less than 512 Mbytes, the high stress test is not invoked by default. The explicit specification of intervals override the default behavior.

If the value is specified as 0, the invocation of the test at the corresponding stress level is disabled.

CPU_ON_FLT_EXEC=command_line

This parameter specifies the user configurable script/binary that should be executed after detecting a faulty CPU. The command can have optional command line options specified at the end.

After detecting a faulty CPU, an initial offline attempt of the faulty CPU is performed and then the user specified command is invoked irrespective of whether the initial offline attempt had succeeded or not. After executing the user specified executable, one more offline attempt of the faulty CPU is be performed if the faulty CPU has not already been offlined.

The faulty processor ID is passed to the script by setting environment variable CPU_ID_FAILED to the decimal value of the processor ID.

The command line specified should be the absolute executable path name with optional command line arguments.

The user-provided script can be used to notify the system administrator of the failure. The user-provided script can also enable shutting down the user applications to potentially make the subsequent faulty CPU offline attempt by cpudiagd daemon more likely to succeed. Note that the primary reason for a CPU offline attempt to fail is that processes are bound to the faulty CPU.

LOG_MAX_NUM_BACKUPS=[0-9]+

This parameter specifies a number of maximum backup logs that need to be maintained for CPU diagnostics monitor specific information and error logs. The minimum value is 1 and maximum value is 100. By default only one backup log is maintained.

LOG_MAX_SIZE=[0-9]+

This parameter specifies the maximum log size in Kbytes. The minimum value is 1 and maximum value is 1000000, which means the minimum log size is 1 Kbyte and the maximum size is 1 Gbyte. The default value is 1000 which means 1 Mbyte.

LOG_ENABLE_INFO_STATS=yes/no

This parameter specifies whether the statistics about test execution should be logged in the information log or not. By default, it is enabled (which is equivalent to specifying "yes").

Specifying "LOG_ENABLE_INFO_STATS=no" will disable this feature.



Note - Blank lines in the cpudiagd.conf file are ignored. Lines for which the first nonwhite character is a pound sign (#) are treated as comments.



Environment Variables

CPU_ID_FAILED

The processor ID of the detected faulty CPU. This environment variable is exported to the user script/binary that is specified to be executed on fault detection. See the description of CPU_ON_FLT_EXEC parameter.

Examples

Example 1: A Sample cpudiagd.conf Configuration File

# Example configuration file start.
CPU_TEST_FREQ_MIN_STRESS=30s
CPU_TEST_FREQ_MED_STRESS=15m
CPU_TEST_FREQ_HIGH_STRESS=6h
CPU_ON_FLT_EXEC=/home/admin/bin/cpuflt.sh
# end of example configuration file.

The above configuration file specifies that the cputst(1M) should be scheduled to execute at minimum stress level once in 30 seconds, at medium stress level once at 15 minutes, at high stress level once in 6 hours. This configuration file also specifies that the script /home/admin/bin/cpuflt.sh should be executed on detecting a faulty CPU.

Example 2: Another Sample cpudiagd.conf Configuration File

# Example configuration file start.
    CPU_ON_FLT_EXEC=/home/admin/bin/killprocs.sh
# end of example configuration file.

Example killprocs.sh script:

#!/bin/sh
    #
 
    if [ "`psrinfo -s $CPU_ID_FAILED`" != "1" ]; then
      exit 0     # bad cpu is not online now.
    fi
    
    # The bad CPU is still online.
    # kill all processes bound to any CPU (not only the faulty one).
    # (not necessary to kill all bound processes, but example works).
    BOUND_PIDS=`/usr/sbin/pbind -q |/usr/bin/tr ':' ' ' |
                /usr/bin/awk '{print $3;}'`
 
    kill -9 $BOUND_PIDS
    /usr/sbin/psradm -f $CPU_ID_FAILED  # offline the faulty CPU now.
 
    # script end.

The above script would kill all the bound processes. The pbind(1M) command can be used to unbind the bound processes if unbinding is desired over killing.

Example 3: Another Sample cpudiagd.conf Configuration File

# Example configuration file start.
    CPU_ON_FLT_EXEC=/usr/sbin/halt -d
# end of example configuration file.

The above command halts the system immediately on fault detection (irrespective of whether the bad CPU could be offlined or not) and forces a system crash dump as specified by the -d option.

Files

cpudiagd.conf(4) uses the following files:

TABLE C-1 Files Used by cpudiagd.conf

File

Description

/etc/cpudiagd.conf

cpudiagd configuration file

/var/cpudiag/log

Log files directory

/var/cpudiag/log/error.log

Error log file

/var/cpudiag/log/info.log

Information log file

error.log.0, error.log.1, etc.

Backup error logs

info.log.0, info.log.1, etc.

Backup information logs


See Also

cpudiagd(1M), cputst(1M)