C H A P T E R  8

Runtime Considerations and Tuning

To understand runtime tuning, you need to understand what happens on your cluster at runtime--that is, how hardware characteristics can impact performance and what the current state of the system is.

This chapter discusses the performance implications of:


Running on a Dedicated System

The primary consideration in achieving maximum performance from an application at runtime is giving it dedicated access to the resources. Useful commands include:

CRE

UNIX

How high is the load?

% mpinfo -N

% uptime

What is causing the load?

% mpps -e

% ps -e


The UNIX commands give information only for the node where the command is issued. The CRE commands return information for all nodes in a cluster.

CRE's mpps command shows only those processes running under the resource manager. For more complete information, try the UNIX ps command. For example, either

% /usr/ucb/ps augx

or

% /usr/bin/ps -e -o pcpu -o pid -o comm | sort -n

will list most busy processes for a particular node.

Note that small background loads can have a dramatic impact. For example, the fsflush daemon flushes memory periodically to disk. On a server with a lot of memory, the default behavior of this daemon might cause a background load of only about 0.2, representing a small fraction of 1 percent of the compute resource of a 64-way server. Nevertheless, if you attempted to run a "dedicated" 64-way parallel job on this server with tight synchronization among the processes, this background activity could potentially disrupt not only one CPU for 20 percent of the time, but in fact all CPUs, because MPI processes are often very tightly coupled. (For the particular case of the fsflush daemon, a system administrator should tune the behavior to be minimally disruptive for large-memory machines.)

In short, it is desirable to leave at least one CPU idle per cluster node. In any case, it is useful to realize that the activity of background daemons is potentially very disruptive to tightly coupled MPI programs.


Setting Sun MPI Environment Variables

Sun MPI uses a variety of techniques to deliver high-performance, robust, and memory-efficient message passing under a wide set of circumstances. In most cases, performance will be good without tuning any environment variables. In certain situations, however, applications will benefit from nondefault behaviors. The Sun MPI environment variables discussed in this section enable you to tune these default behaviors.

If you need a quick and approximate evaluation of your environment variable settings, you can skip this section entirely and rely on the MPProf profiling tool, described further in Chapter 9, to recommend Sun MPI environment variable settings based on collected profiling data.

For greater detail, more information is available in Appendix A and Appendix B.

Are You Running on a Dedicated System?

If your system's capacity is sufficient for running your Sun MPI job, you can commit processors aggressively to your job. Your CPU load should not exceed the number of physical processors. Load is basically defined as the number of MPI processes in your job, but it can be greater if other jobs are running on the system or if your job is multithreaded. Load can be checked with the uptime or mpinfo command, as discussed at the beginning of this chapter.

To run more aggressively, use either of these settings:

This setting causes Sun MPI to "spin" aggressively, regardless of whether it is doing any useful work. If you use this setting, you should leave at least one idle processor per node to service system daemons. If you intend to use all processors on a node, setting this aggressive spin behavior can slow performance, so some experimentation is needed.
While the Solaris Operating Environment schedules processes in generally optimal ways, performance in a dedicated environment is sometimes improved by binding processes to processors. This can be effected by setting the MPI_PROCBIND variable to 1 (one). Detailed control over the binding can be achieved by listing specific processors for binding. See the MPI man page for more details on MPI_PROCBIND.
Performance can deteriorate dramatically with MPI_PROCBIND if multiple processes are bound to the same processor or if the processes are multithreaded.

Does the Code Use System Buffers Safely?

In some MPI programs, processes send large volumes of data with blocking sends before starting to receive messages. The MPI standard specifies that users must explicitly provide buffering in such cases, such as by using MPI_Bsend() calls. In practice, however, some users rely on the standard send (MPI_Send()) to supply unlimited buffering. By default, Sun MPI prevents deadlock in such situations through general polling, which drains system buffers even when no receives have been posted by the user code.

For best performance on typical, safe programs, general polling should be suppressed by using the setting shown in the following example:

% setenv MPI_POLLALL 0

If deadlock results from this setting, you might nonetheless use the setting for best performance if you resolve the deadlock with increased buffering, as discussed in the next section.

Are You Willing to Trade Memory for Performance?

Messages traveling from one MPI process to another are staged in intermediate buffers, internal to Sun MPI. If this buffering is insufficient, senders can stall unnecessarily while receivers drain the buffers.

One alternative is to increase the internal buffering using Sun MPI environment variables. For example, try this setting before you run:

% setenv MPI_SHM_SBPOOLSIZE 8000000% setenv MPI_SHM_NUMPOSTBOX 256

Another alternative is to run your program with the MPProf tool, which suggests environment variable settings if it detects internal buffer congestion. See Chapter 9 for more information on MPProf.

For a more detailed understanding of these environment variables, see Appendix A and Appendix B.

Do You Want to Initialize Sun MPI Resources?

Use of certain Sun MPI resources might be relatively expensive when they are first used. This use can disrupt performance profiles and timings. While it is best, in any case, to ensure that performance has reached a level of equilibrium before profiling starts, a Sun MPI environment variable might be set to move some degree of resource initialization to the MPI_Init() call. Use:

% setenv MPI_FULLCONNINIT 1

Note that this setting does not tend to improve overall performance. However, it might improve performance and enhance profiling in most MPI calls, while slowing down the MPI_Init() call. The initialization time, in extreme cases, can take minutes to complete.

Is More Runtime Diagnostic Information Needed?

Some environment variable settings are advisory and will be ignored due to system administration policies or system resource limitations. Or, some settings may be ignored because a variable name was misspelled. To confirm what Sun MPI environment variable values are being used, set the MPI_PRINTENV environment variable:

% setenv MPI_PRINTENV 1

When multiple interconnects are available on your cluster, you can check which interconnects are actually used by your program with by setting the MPI_SHOW_INTERFACES environment variable:

% setenv MPI_SHOW_INTERFACES 2

 


Launching Jobs on a Multinode Cluster

In a cluster configuration, the mapping of MPI processes to nodes in a cluster can impact application performance significantly. This section describes some important issues"

Minimizing Communication Costs

Communication between MPI processes on the same shared-memory node is much faster than between processes on different nodes. Thus, by collocating processes on the same node, application performance can be increased. Indeed, if one of your servers is very large, you might want to run your entire "distributed-memory" application on a single node.

Meanwhile, not all processes within an MPI job need to communicate efficiently with all others. For example, the MPI processes might logically form a square "process grid," in which there are many messages traveling along rows and columns, or predominantly along one or the other. In such a case, it might not be essential for all processes to be collocated, but only for a process to be collocated with its partners within the same row or column.

Load Balancing

Running all the processes on a single node can improve performance if the node has sufficient resources available to service the job, as explained in the preceding section. At a minimum, it is important to have no more MPI processes on a node than there are CPUs. It might also be desirable to leave at least one CPU per node idle (see Running on a Dedicated System). Additionally, if bandwidth to memory is more important than interprocess communication, you might prefer to underpopulate nodes with processes so that processes do not compete unduly for limited server backplane bandwidth. Finally, if the MPI processes are multithreaded, it is important to have a CPU available for each lightweight process (LWP) within an MPI process. This last consideration is especially tricky because the resource manager (CRE or LSF) might not know at job launch that processes will spawn other LWPs.

Controlling Bisection Bandwidth

Clusters configured with commodity interconnects typically provide little internodal bandwidth per node. Meanwhile, bisection bandwidth might be the limiting factor for performance on a wide range of applications. In this case, if you must run on multiple nodes, you might prefer to run on more nodes rather than on fewer.

This point is illustrated qualitatively in FIGURE 8-1. The high-bandwidth backplanes of large Sun servers provide excellent bisection bandwidth for a single node. Once you have multiple nodes using a commodity interconnect, however, the interface between each node and the network will typically become the bottleneck. Bisection bandwidth starts to recover again when the number of nodes--actually, the number of network interfaces--increases.

 FIGURE 8-1 Relationship Between Bisection Bandwidth and Number of Nodes

Graphic image of the relationship between bisection bandwidth and number of nodes

In practice, every application benefits at least somewhat from increased locality, so collocating more processes per node by reducing the number of nodes has some positive effect. Nevertheless, for codes that are dominated by all-to-all types of communication, increasing the number of nodes can improve performance.

Considering the Role of I/O Servers

The presence of I/O servers in a cluster affects the other issues we have been discussing in this section. If, for example, a program will make heavy use of a particular I/O server, executing the program on that I/O node might improve performance. If the program makes scant use of I/O, you might prefer to avoid I/O nodes, since they might consume nodal resources. If multiple I/O servers are used, you might want to distribute MPI processes in a client job to increase aggregate ("bisection") bandwidth to I/O.

Running Jobs in the Background

Performance experiments conducted in the course of tuning often require multiple runs under varying conditions. It might be desirable to run such jobs in the background.

To run jobs in the background, perhaps from a shell script, use the -n switch with the CRE mprun command when the standard input is not being used. Otherwise, the job could block. The following example shows the use of this switch:

% mprun -n -np 4 a.out &% cat a.csh
#!/bin/csh
mprun -n -np 4 a.out
% a.csh

Limiting Core Dumps

Core dumps can provide valuable debugging information, but they can also induce stifling repercussions for silly mistakes. In particular, core dumps of Sun HPC processes can be very large. For multiprocess jobs, the problem can be compounded, and the effect of dumping multiple large core files over a local network to a single, NFS-mounted file system can be crippling.

To limit core dumps for jobs submitted with the CRE mprun command, simply limit core dumps in the parent shell before submitting the job. If the parent shell is csh, use the command limit coredumpsize 0. If the parent shell is sh, use the ulimit -c 0 command.

Using Line-Buffered Output

When multiple MPI ranks are writing to the same output device, the multiple output streams may interfere with one another, such that output from different ranks can be interleaved in the middle of an output line.

One way of handling this is to specify to CRE that it should use line-buffered output. For example, one may use the -o or -I switches to the mprun command.

The -I syntax is not simple but allows detailed control over a job's I/O streams. For example, consider the sample Fortran MPI code:

include "mpif.h"
 
call MPI_Init(ier)
call MPI_Comm_rank(MPI_COMM_WORLD,me,ier)
call MPI_Barrier(MPI_COMM_WORLD,ier)
do i = 1, 1000
  write(6,'("rank",i4,";  iteration", i6)') me, i
enddo
call MPI_Finalize(ier)
end

Executing the job without line buffering can lead to output lines from different ranks being combined (as shown in this example):

% mprun -np 16 a.out
[...]
rank   2;  iteration    34
rank   2;  iteration    35
rank   2;  iteration    36
rank   2;  iteration    37
rank   2;  iteration    3rank   7;  iteration     1
rank   7;  iteration     2
rank   7;  iteration     3
rank   7;  iteration     4
[...]

In contrast, you can use the -I switch:

% mprun -np 16 -I 0r=/dev/null,1wl,2w=errorfile a.out

Using this switch directs the job:

  • To read stdin from /dev/null
  • To use line buffering for stdout
  • To direct stderr to errorfile

For more information on this syntax, see the section of the mprun man page that covers file descriptor strings.


Multinode Job Launch Under CRE

CRE provides a number of ways to control the mapping of jobs to the respective nodes of a cluster.

Collocal Blocks of Processes

CRE supports the collocation of blocks of processes--that is, all processes within a block are mapped to the same node.

Assume you are performing an LU decomposition on a 4x8 process grid using Sun S3L. If minimization of communication within each block of four consecutive MPI ranks is most important, then these 32 processes could be launched in blocks of 4 collocated MPI processes by using the -Z or -Zt option, respectively:

% mprun -np 32 -Zt 4 a.out
% mprun -np 32 -Z  4 a.out

In either case, MPI ranks 0 through 3 will be mapped to a single node. Likewise, ranks 4 through 7 will be mapped to a single node. Each block of four consecutive MPI ranks is mapped to a node as a block. Using the -Zt option, no two blocks will be mapped to the same node--eight nodes will be used. Using the -Z option, multiple blocks might be mapped to the same node. For example, with the -Zt option, the entire job might be mapped to a single node if it has at least 32 CPUs.

Multithreaded Job

Consider a multithreaded MPI job in which there is one MPI process per node, with each process multithreaded to make use of all the CPUs on the node. You could specify 16 such processes on 16 different nodes by using:

% mprun -Ns -np 16 a.out

Round-Robin Distribution of Processes

Imagine that you have an application that depends on bandwidth for uniform, all-to-all communication. If the code requires more CPUs than can be found on any node within the cluster, it should be run over all the nodes in the cluster to maximize bisection bandwidth. For example, for 32 processes, this can be effected with the command:

% mprun -Ns -W -np 32 a.out

That is, CRE tries to map processes to distinct nodes (because of the -Ns switch, as in the preceding multithreaded case), but it will resort to "wrapping" multiple processes (-W switch) onto a node as necessary.

Detailed Mapping

For more complex mapping requirements, use the mprun switch -m or -l to specify a rankmap as a file or a string, respectively. For example, if the file nodelist contains:

node0
node0 2
node0
node1 4
node2 8

then the command:

% mprun -np 16 -m nodelist a.out

maps the first 4 processes to node0, the next 4 to node1, and the next 8 to node2. Refer to the Sun HPC CluaterTools User's Guide for more information about process mappings.