C H A P T E R  7

Maintenance and Troubleshooting

This chapter describes some procedures you can use for preventive maintenance and troubleshooting. The topics covered are:


Cleaning Up Defunct Sun CRE Jobs

One preventive maintenance practice that can be beneficial is the routine cleanup of defunct jobs. There are several types of such jobs:

Removing Sun CRE Jobs That Have Exited

When a job does not exit cleanly, it is possible for all of a job's processes to have reached a final state, but the job object itself to not be removed from the Sun CRE database. The following are two indicators of such incompletely exited jobs:

If you see a job in one of these defunct states, perform the following steps to clear the job from the Sun CRE database:

1. Execute mpps -e again in case Sun CRE has had time to update the database (and remove the job).

2. If the job is still running, kill it, specifying its job ID.

% mpkill jid

If mpps continues to report the killed job, use the -C option to mpkill to remove the job object from the Sun CRE database. This must be done as superusee from the master node.

# mpkill -C jid

Removing Sun CRE Jobs That Have Not Terminated

The second type of defunct job includes jobs that are waiting for signals from processes on nodes that have gone off line. The mpps utility displays such jobs in states such as RUNNING, EXITING, SEXTNG, or CORNG.



Note - If the job-killing option of tm.watchd (-Yk) is enabled, Sun CRE handles such situations automatically. This section assumes this option is not enabled.



Kill the job using:

% mpkill jid

There are several variants of the mpkill command, similar to the variants of the Solaris kill command. You may also use:

% mpkill -9 jid 

or

% mpkill -I jid

If these do not succeed, execute mpps -pe to display the unresponsive processes. Then, execute the Solaris ps command on each of the nodes listed. If those processes still exist on any of the nodes, you can remove them using kill -9 pid.

Once you have eliminated defunct jobs, data about the jobs may remain in the Sun CRE database. As superuser from the master node, use mpkill -C to remove this residual data.

Killing Orphaned Processes

When the tm.watchd -Yk option has been enabled, the watch daemon marks processes ORPHAN if they run on nodes that have gone off line. If the node resumes communication with the Sun CRE daemons, the watch daemon will kill the ORPHAN processes. If not, you will have to kill the processes manually using the Solaris kill command. Otherwise, such processes will continue to consume resources.

Symptoms of orphaned processes can be detected by examining error log files or stdout, if you are running from a terminal. You can also search for such errors as
RPC: cannot connect, or RPC: timout. These errors will appear under user.err priority in syslog.



Note - If an mprun process becomes unresponsive on a system, even where tm.watchd -Yk has been enabled, it may be necessary to use Ctrl-c to kill mprun.




Cleaning Up After RSM Failures

The daemon hpc_rsmd, which is started when the cluster is booted, manages access to remote shared memory services on behalf of MPI processes. If an instance of hpc_rsmd exits abnormally, you can use the following script to clean up any files and System V shared memory segments that it leaves behind.

# /etc/init.d/sunhpc.hpc_rsmd  [ start | stop | clean ]


Using Diagnostics

The following sections describe Solaris diagnostics that may be useful in troubleshooting various types of error conditions.

Using Network Diagnostics

You can use /usr/sbin/ping to check whether you can connect to the network interface on another node. For example:

% ping hpc-node3

tests (over the default network) the connection to hpc-node3.

You can use /usr/sbin/spray to determine whether a node can handle significant network traffic. spray indicates the amount of dropped traffic. For example:

% spray -c 100 hpc-node3 

sends 100 small packets to hpc-node3.

Checking Load Averages

You can use mpinfo -N or, if Sun CRE is not running, /usr/bin/uptime, to determine load averages. These averages can help to determine the current load on the machine and how quickly it reached that load level.

Using Interval Diagnostics

The diagnostic programs described below check the status of various parameters. Each accepts a numerical option that specifies the time interval between status checks. If the interval option is not used, the diagnostics output an average value for the respective parameter since boot time. Specify the numerical value at the end of the command to get current information.

Use /usr/bin/netstat to check local system network traffic. For example:

% netstat -ni 3 

checks and reports traffic every three seconds.

Use /usr/bin/iostat to display disk and system usage. For example:

% iostat -c 2 

displays percentage utilizations every two seconds.

Use /usr/bin/vmstat to generate additional information about the virtual memory system. For example:

% vmstat -S 5

reports on swapping activity every five seconds.

It can be useful to run these diagnostics periodically, monitoring their output for multiple intervals.


Interpreting Sun CRE Error Messages

This section presents sample error messages and their interpretations.

No nodes in partition satisfy RRS:
Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused

If these errors can be correlated to jobs being killed, then they can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.

mprun: unique partition: No such object
Query returned excess results:
a.out: (TMTL UL) TMRTE_Abort: Not yet initialized
The attempt to kill your program failed
This might happen, for example, when mpps shows running processes that are actually no longer running.
Use the mpkill -C nn command to clear out such stale jobs.


Note - Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.




Anticipating Common Problems

This section presents some guidelines for preventing and troubleshooting common problems.



Note - If you have set the Cluster-level attribute logfile, all error messages generated by user code will be handled by Sun CRE (not syslog) and will be logged in a file specified by an argument to logfile.



Sun CRE RPC timeouts in user code are generally not recoverable. The job might continue to run, but processes probably will not be able to communicate with each other. There are two ways to deal with this:

/tmp/.hpcshm_mmap.jid.*
Smaller files will have file names of the form:
/tmp/.hpcshm_acf.jid.*

The Sun MPI shared memory protocol module uses these files for interprocess communication on the same node. These files consume swap space.


Understanding Protocol-Related Errors

Errors may occur at cluster startup or at program initialization because of problems finding or loading protocol modules. Such errors are not fatal to the runtime environment (that is, to the Sun CRE daemons), but they do mean that the protocol in question is not available for communication on the cluster.

This section describes some error conditions that may occur in relation to protocol modules(PMs) and the RSM daemon hpc_rsmd.

Errors When Sun CRE Daemons Load Protocol Modules

The errors below are generated when the Sun CRE daemons first start up. These errors can occur because of problems in the hpc.conf file, or because of problems loading the PMs.

All these errors are considered nonfatal to the daemon, but the PM that causes the error will not be usable. The errors below cause the Sun CRE daemons to generate calls to syslog that result in self-explanatory error messages.

The daemons cause a warning to be generated when there are duplicate PM entries in the PMODULES section of hpc.conf. If there are multiple PM entries with the same name, then only the first one is loaded.

Errors When Protocol Modules Discover Interfaces

The errors below are generated at program startup when a PM attempts interface discovery.

These errors are nonfatal to the Sun CRE daemons, but they may mean that the PM causing the error will not be usable. Appropriate error strings are generated by syslog.

-WARNING- Problem detected initializing tcp PM: PM=tcp entry hme is missing tokens
-WARNING- Problem detected initializing tcp PM: Interface hme0 has missing or broken entry in hpc.conf. Will use: Rank=1000,stripe=0, mtu=1500 latency=-1, bandwidth=-1
-WARNING- Problem detected initializing tcp PM: PM=tcp entry hme has extra tokens

Errors When the RSM Protocol Module Reads MPI Options

During the startup of an MPI job, the RSM PM interprets values of options in the MPIOptions section of hpc.conf.

If the MPIOptions section does not exist, the RSM PM uses the default values. The PM ignores any option names it does not recognize. If an option is given a value that is either malformed (for instance, a character instead of a number) or out of bounds, the PM uses a default value and causes a message like the following to be generated:

-WARNING- ignoring rsm_shortmsgsize=10 in hpc.conf, using default value of 384 bytes. Please contact system administrator.

The rsm_links option has other error messages associated with it, indicating an invalid instance number for an interface type or an invalid interface name. In both cases, the invalid item is ignored. The message generated is similar to the following:

-WARNING- ignoring rsm_links entry "hpc-comm0wrsm0,1" in hpc.conf... 

Action of the RSM Daemon

The RSM daemon hpc_rsmd writes messages to syslog upon detection of unusual events. It writes three different levels of syslog messages: error, warning, and information.

Error events that result in the termination of a running MPI job are recorded in the syslog at error level. Error events that do not result in the termination or a running MPI job are recorded in the syslog at warning level. All other syslog messages are written at information level. The syslogd can be configured manually to enable or limit the output of these levels.

During hpc_rsmd initialization, configuration information is written to syslog at information level. An error causes a messages to be written at the syslog error level, and the hpc_rsmd exits.

After initialization, the hpc_rsmd will never spontaneously exit. If the hpc_rsmd detects a recoverable error in a request while it is being sent, it will retransmit the request. If the hpc_rsmd detects a non-recoverable error while a request is being sent, it will abort the request and return an error to the MPI process that made the request. This, in turn, will cause the MPI job to abort. However, the hpc_rsmd path monitor will periodically ping all paths it believes are connected and, upon failure of a ping, will mark the path not available for use.

The RSM PM does not detect error signals itself. In the event that the RSM PM is initialized when the RSM daemon hpc_rsmd is not running, the RSM PM prints out an appropriate error message and abort the MPI job.


Recovering From System Failure

Recovering from system failure involves rebooting Sun CRE and recreating the Sun CRE resource database.

The sunhpc.cre_master reboot and sunhpc.cre_node reboot commands should be used only as a last resort, if the system is not responding (for example, if programs such as mprun, mpinfo, or mpps hang).


procedure icon  To Reboot Sun CRE:

1. Run sunhpc.cre_master reboot on the master node:

# /etc/init.d/sunhpc.cre_master reboot

2. Run sunhpc.cre_node reboot on all the nodes (including the master node if it is running tm.spmd and tm.omd):

# /etc/init.d/sunhpc.cre_node reboot

The procedure attempts to save the system configuration (in the same way as using the mpadmin dump command), kill all the running jobs, and restore the system configuration. Note that the Cluster Console Manager applications may be useful in executing commands on all the nodes in the cluster simultaneously. For information about the Cluster Console Manager applications, see Appendix A.



Note - sunhpc.cre_master reboot saves the existing rdb-log and rdb-save files in /var/hpc/rdb-log.1 and /var/hpc/rdb-save.1. The rdb-log file is a running log of the resource database activity and rdb-save is a snapshot of the database taken at regular intervals.



To recover the Sun CRE after a partial failure--that is, when some but not all daemons have failed, it is possible to clean up bad database entries without losing the configuration information. For example, run the following commands on the master node to clear out the dynamic data while preserving the configuration.

# /opt/SUNWhpc/sbin/ctstartd -l
# /etc/init.d/sunhpc.cre_master reboot

The ctstartd command is necessary in case some of the daemons are not running. The -l option causes the command to run on the local system; in this case, the master node.


Configuring Out Network Controllers

During maintenance or replacement of Sun Fire Link high-performance cluster interconnect hardware, the cluster administrator may wish to configure out the controller(s) that have become unavailable. This can be done in the hpc.conf file in either of two ways: by changing the PM=rsm section or by changing the rsm_links option in the MPIOptions section. This section describes both methods.

Using the PM Section

Stop the Sun CRE and RSM daemons when editing a PM section of hpc.conf and restart them when finished. See Stopping and Restarting Sun CRE and RSM Daemon.

If all controllers on the same network are going to be unavailable, the administrator can add a new line to the PM=rsm section that names the controller instance and sets its AVAIL field to 0.

# RSM Settings
# NAME     RANK  AVAIL
Begin PM=rsm
wrsm       15    1
wrsm0      15    0
End PM=rsm

Once the daemons are restarted, controller 0 will be unavailable on all nodes on the cluster. The wildcard entry wrsm is still set to 1, indicating that all controllers on that network other than controller 0 will be available.

Using the MPIOptions Section

By editing the rsm_links option in the MPIOptions section, the administrator can either remove a controller instance across all nodes or remove a controller instance from one or more specified nodes.

Using the MPIOptions method does not require you to stop and restart the daemons. However, any MPI jobs that are currently running are not notified to not use the controller(s) that are being configured out, and this may cause a job to abort.

To remove an instance across all nodes, set the rsm_links option to exclude that instance. For example, suppose that a cluster has controllers wrsm0 and wrsml and you wish to configure out wrsm0. To do so, set the value of rsm_links to wrsm1 only.

rsm_links wrsm1

To remove a controller instance only on specified nodes, use the syntax node.controller to indicate which controllers on the node(s) in question should remain available. For example, consider a three-node cluster with controllers wrsm0 and wrsm1 on all three nodes. The following entry has the effect of removing controller wrsm0 from node1:

rsm_links wrsm1 node0.wrsm0 node2.wrsm0

This entry specifies that controller wrsm1 is available on all nodes, while controller wrsm0 is available only on nodes node0 and node2.