B - A P P E N D I X -

A P P E N D I X B

Sun MPI Environment Variables

This appendix describes some Sun MPI environment variables and their effects on program performance. It covers the following topics:

Yielding and Descheduling

Polling

Shared-Memory Point-to-Point Message Passing

Shared-Memory Collectives

Running Over TCP

RSM Point-to-Point Message Passing

Summary Table Of Environment Variables

Prescriptions for using MPI environment variables for performance tuning are provided in Chapter 8. Additional information on these and other environment variables can be found in the Sun MPI Programming and Reference Guide.

These environment variables are closely related to the details of the Sun MPI implementation, and their use requires an understanding of the implementation. More details on the Sun MPI implementation can be found in Appendix A.

Yielding and Descheduling

A blocking MPI communication call might not return until its operation has completed. If the operation has stalled, perhaps because there is insufficient buffer space to send or because there is no data ready to receive, Sun MPI will attempt to progress other outstanding, nonblocking messages. If no productive work can be performed, then in the most general case Sun MPI will yield the CPU to other processes, ultimately escalating to the point of descheduling the process by means of the spind daemon.

Setting MPI_COSCHED=0 specifies that processes should not be descheduled. This is the default behavior.

Setting MPI_SPIN=1 suppresses yields. The default value, 0, allows yields.

Polling

By default, Sun MPI polls generally for incoming messages, regardless of whether receives have been posted. To suppress general polling, use MPI_POLLALL=0.

Shared-Memory Point-to-Point Message Passing

The size of each shared-memory buffer is fixed at 1 Kbyte. Most other quantities in shared-memory message passing are settable with MPI environment variables.

For any point-to-point message, Sun MPI will determine at runtime whether the message should be sent via shared memory, remote shared memory, or TCP. The flowchart in FIGURE B-1 illustrates what happens if a message of B bytes is to be sent over shared memory.

FIGURE B-1 Message of B Bytes Sent Over Shared Memory

Graphic image showing a message of B bytes sent over shared memory.

For pipelined messages, MPI_SHM_PIPESIZE bytes are sent under the control of any one postbox. If the message is shorter than 2 x MPI_SHM_PIPESIZE bytes, the message is split roughly into halves.

For cyclic messages, MPI_SHM_CYCLESIZE bytes are sent under the control of any one postbox, so that the footprint of the message in shared memory buffers is 2 x MPI_SHM_CYCLESIZE bytes.

The postbox area consists of MPI_SHM_NUMPOSTBOX postboxes per connection.

By default, each connection has its own pool of buffers, each pool of size MPI_SHM_CPOOLSIZE bytes.

By setting MPI_SHM_SBPOOLSIZE, users can specify that each sender has a pool of buffers, each pool having MPI_SHM_SBPOOLSIZE bytes, to be shared among its various connections. If MPI_SHM_CPOOLSIZE is also set, then any one connection might consume only that many bytes from its send-buffer pool at any one time.

Memory Considerations

In all, the size of the shared-memory area devoted to point-to-point messages is

n x ( n - 1 ) x ( MPI_SHM_NUMPOSTBOX x ( 64 + MPI_SHM_SHORTMSGSIZE ) + MPI_SHM_CPOOLSIZE )

bytes when per-connection pools are used (that is, when MPI_SHM_SBPOOLSIZE is not set), and

n x ( n - 1 ) x MPI_SHM_NUMPOSTBOX x ( 64 + MPI_SHM_SHORTMSGSIZE ) + n x MPI_SHM_SBPOOLSIZE

bytes when per-sender pools are used (that is, when MPI_SHM_SBPOOLSIZE is set).

Performance Considerations

A sender should be able to deposit its message and complete its operation without waiting for any other process. You should typically:

Use the default setting of MPI_EAGERONLY, or set MPI_SHM_RENDVSIZE to be larger than the greatest number of bytes any on-node message will have.

Use the default setting of MPI_SHM_CYCLESTART.

Increase MPI_SHM_CPOOLSIZE to ensure sufficient buffering at all times.

In theory, rendezvous can improve performance for long messages if their receives are posted in a different order than their sends. In practice, the right set of conditions for overall performance improvement with rendezvous messages is rarely met.

Send-buffer pools can be used to provide reduced overall memory consumption for a particular value of MPI_SHM_CPOOLSIZE. If a process will only have outstanding messages to a few other processes at any one time, then set MPI_SHM_SBPOOLSIZE to the number of other processes times MPI_SHM_CPOOLSIZE. Multithreaded applications might suffer, however, since then a sender's threads would contend for a single send-buffer pool instead of for multiple, distinct connection pools.

Pipelining, including for cyclic messages, can roughly double the point-to-point bandwidth between two processes. This is a secondary performance effect, however, since processes tend to get considerably out of step with one another, and since the nodal backplane can become saturated with multiple processes exercising it at the same time.

Restrictions

The short-message area of a postbox must be large enough to point to all the buffers it commands. In practice, this restriction is relatively weak since, if the buffer pool is not too fragmented, the postbox can point to a few, large, contiguous regions of buffer space. In the worst case, however, a postbox will have to point to many disjoint, 1-Kbyte buffers. Each pointer requires 8 bytes, and 8 bytes of the short-message area are reserved. Thus, to avoid runtime errors

( MPI_SHM_SHORTMSGIZE - 8 ) x 1024 / 8

should be at least as large as

max(

MPI_SHM_PIPESTART,

MPI_SHM_PIPESIZE,

MPI_SHM_CYCLESIZE)

If a connection-pool buffer is used, it must be sufficiently large to accommodate the minimum footprint any message will ever require. This means that to avoid runtime errors, MPI_SHM_CPOOLSIZE should be at least as large as

max(

MPI_SHM_PIPESTART,

MPI_SHM_PIPESIZE,

2 x MPI_SHM_CYCLESIZE)

If a send-buffer pool is used and all connections originating from this sender are moving cyclic messages, there must be at least enough room in the send buffer pool to advance one message:

MPI_SHM_SBPOOLSIZE greater than or equal

((np - 1) + 1) x MPI_SHM_CYCLESIZE

Other restrictions are noted in TABLE B-1.

Shared-Memory Collectives

Collective operations in Sun MPI are highly optimized and make use of a general buffer pool within shared memory. MPI_SHM_GBPOOLSIZE sets the amount of space available on a node for the "optimized" collectives in bytes. By default, it is set to 20971520 bytes. This space is used by MPI_Bcast(), MPI_Reduce(), MPI_Allreduce(), MPI_Reduce_scatter(), and MPI_Barrier(), provided that two or more of the MPI processes are on the node.

Memory is allocated from the general buffer pool in three different ways:

When a communicator is created, space is reserved in the general buffer pool for performing barriers, short broadcasts, and a few other purposes.

For larger broadcasts, shared memory is allocated out of the general buffer pool. The maximum buffer-memory footprint in bytes of a broadcast operation is set by an environment variable as

(n / 4) x 2 x MPI_SHM_BCASTSIZE

where n is the number of MPI processes on the node. If less memory is needed than this, then less memory is used. After the broadcast operation, the memory is returned to the general buffer pool.

For reduce operations,

n x n x MPI_SHM_REDUCESIZE

bytes are borrowed from the general buffer pool and returned after the operation.

In essence, MPI_SHM_BCASTSIZE and MPI_SHM_REDUCESIZE set the pipeline sizes for broadcast and reduce operations on large messages. Larger values can improve the efficiency of these operations for very large messages, but the amount of time it takes to fill the pipeline can also increase. Typically, the default values are suitable, but if your application relies exclusively on broadcasts or reduces of very large messages, then you can try doubling or quadrupling the corresponding environment variable using one of the following:

% setenv MPI_SHM_BCASTSIZE 65536

% setenv MPI_SHM_BCASTSIZE 131072

% setenv MPI_SHM_REDUCESIZE 512

% setenv MPI_SHM_REDUCESIZE 1024

If MPI_SHM_GBPOOLSIZE proves to be too small and a collective operation happens to be unable to borrow memory from this pool, the operation will revert to slower algorithms. Hence, under certain circumstances, performance optimization could dictate increasing MPI_SHM_GBPOOLSIZE.

Running Over TCP

TCP ensures reliable dataflow, even over los-prone networks, by retransmitting data as necessary. When the underlying network loses a lot of data, the rate of retransmission can be very high, and delivered MPI performance will suffer accordingly. Increasing synchronization between senders and receivers by lowering the TCP rendezvous threshold with MPI_TCP_RENDVSIZE might help in certain cases. Generally, increased synchronization will hurt performance, but over a loss-prone network it might help mitigate performance degradation.

If the network is not lossy, then lowering the rendezvous threshold would be counterproductive and, indeed, a Sun MPI safeguard might be lifted. For reliable networks, use

% setenv MPI_TCP_SAFEGATHER 0

to speed MPI_Gather() and MPI_Gatherv() performance.

RSM Point-to-Point Message Passing

The RSM protocol has some similarities with the shared-memory protocol, but it also differs substantially, and environment variables are used differently.

The maximum size of a short message is MPI_RSM_SHORTMSGSIZE bytes, with a default value of 3918 bytes. Short RSM messages can span multiple postboxes, but they still do not use any buffers.

The most data that will be sent under any one postbox using buffers for pipelined messages is MPI_RSM_PIPESIZE bytes.

There are MPI_RSM_NUMPOSTBOX postboxes for each RSM connection.

If MPI_RSM_SBPOOLSIZE is unset, then each RSM connection has a buffer pool of MPI_RSM_CPOOLSIZE bytes. If MPI_RSM_SBPOOLSIZE is set, then each process has a pool of buffers that is MPI_RSM_SBPOOLSIZE bytes per remote node for sending messages to processes on the remote node.

Unlike the case of the shared-memory protocol, values of the MPI_RSM_PIPESIZE, MPI_RSM_CPOOLSIZE, and MPI_RSM_SBPOOLSIZE environment variables are merely requests. Values set with the setenv command or printed when MPI_PRINTENV is used might not reflect effective values. In particular, only when connections are actually established are the RSM parameters truly set. Indeed, the effective values could change over the course of program execution if lazy connections are employed.

Striping refers to passing a message over multiple hardware links to get the speedup of their aggregate bandwidth. The number of hardware links used for a single message is limited to the smallest of these values:

MPI_RSM_MAXSTRIPE

rsm_maxstripe (if specified by the system administrator in the hpc.conf file)

the number of available links

When a connection is established between an MPI process and a remote destination process, the links that will be used for that connection are chosen. A job can use different links for different connections. Thus, even if MPI_RSM_MAXSTRIPE or rsm_maxstripe is set to 1, the overall job could conceivably still benefit from multiple hardware links.

Use of rendezvous for RSM messages is controlled with MPI_RSM_RENDVSIZE.

Memory Considerations

Memory is allocated on a node for each remote MPI process that sends messages to it over RSM. If np_local is the number of processes on a particular node, then the memory requirement on the node for RSM message passing from any one remote process is

np_local x ( MPI_RSM_NUMPOSTBOX x 128 + MPI_RSM_CPOOLSIZE )

bytes when MPI_RSM_SBPOOLSIZE is unset, and

np_local x MPI_RSM_NUMPOSTBOX x 128 + MPI_RSM_SBPOOLSIZE

bytes when MPI_RSM_SBPOOLSIZE is set.

The amount of memory actually allocated might be higher or lower than this requirement.

The memory requirement is rounded up to some multiple of 8192 bytes with a minimum of 32768 bytes.

This memory is allocated from a 256-Kbyte (262,144-byte) segment.

If the memory requirement is greater than 256 Kbytes, then insufficient memory will be allocated.

If the memory requirement is less than 256 Kbytes, some allocated memory will go unused. (There is some, but only limited, sharing of segments.)

If less memory is allocated than is required, then requested values of MPI_RSM_CPOOLSIZE or MPI_RSM_SBPOOLSIZE (specified with a setenv command and echoed if MPI_PRINTENV is set) can be reduced at runtime. This can cause the requested value of MPI_RSM_PIPESIZE to be overridden as well.

Each remote MPI process requires its own allocation on the node as described previously.

If multi-way stripes are employed, the memory requirement increases correspondingly.

Performance Considerations

The pipe size should be at most half as big as the connection pool:

2 x MPI_RSM_PIPESIZE less than or equal MPI_RSM_CPOOLSIZE

Otherwise, pipelined transfers will proceed slowly. The library adjusts MPI_RSM_PIPESIZE appropriately.

For pipelined messages, a sender must synchronize with its receiver to ensure that remote writes to buffers have completed before postboxes are written. Long pipelined messages can absorb this synchronization cost, but performance for short pipelined messages will suffer. In some cases, increasing the value of MPI_RSM_SHORTMSGSIZE can mitigate this effect.

Restriction

If the short message size is increased, there must be enough postboxes to accommodate the largest size. The first postbox can hold 23 bytes of payload, while subsequent postboxes in a short messages can each take 63 bytes of payload. Thus, 23 + ( MPI_RSM_NUMPOSTBOX - 1 ) x 63 less than or equal MPI_RSM_SHORTMSGSIZE.

Summary Table Of Environment Variables


Name	Units	Range	Default

Informational
MPI_PRINTENV	(None)	0 or 1	0
MPI_QUIET	(None)	0 or 1	0
MPI_SHOW_ERRORS	(None)	0 or 1	0
MPI_SHOW_INTERFACES	(None)	0 - 3	0

Shared Memory Point-to-Point
MPI_SHM_NUMPOSTBOX	Postboxes	1	16
MPI_SHM_SHORTMSGSIZE	Bytes	Multiple of 64	256
MPI_SHM_PIPESIZE	Bytes	Multiple of 1024	8192
MPI_SHM_PIPESTART	Bytes	Multiple of 1024	2048
MPI_SHM_CYCLESIZE	Bytes	Multiple of 1024	8192
MPI_SHM_CYCLESTART	Bytes	--	The default value is 0x7fffffff for 32-bit and 0x7fffffffffffffff for 64-bit operating environments. That is, by default there is no cyclic message passing.
MPI_SHM_CPOOLSIZE	Bytes	Multiple of 1024	24576 if `MPI_SHM_SBPOOLSIZE` is not set `MPI_SHM_SBPOOLSIZE` if it is set
MPI_SHM_SBPOOLSIZE	Bytes	Multiple of 1024	(Unset)

Shared Memory Collectives
MPI_SHM_BCASTSIZE	Bytes	Multiple of 128	32768
MPI_SHM_REDUCESIZE	Bytes	Multiple of 64	256
MPI_SHM_GBPOOLSIZE	Bytes	>256	20971520

TCP
MPI_TCP_CONNTIMEOUT	Seconds	0	600
MPI_TCP_CONNLOOP	Occurrences	≥0	0
MPI_TCP_SAFEGATHER	(None)	0 or 1	1

RSM
MPI_RSM_NUMPOSTBOX	Postboxes	1 - 15	128
MPI_RSM_SHORTMSGSIZE	Bytes	23 - 905	3918 bytes
MPI_RSM_PIPESIZE	Bytes	Multiple of 1024 up to 15360	64 Kb
MPI_RSM_CPOOLSIZE	Bytes	Multiple of 1024	256 Kb
MPI_RSM_SBPOOLSIZE	Bytes	Multiple of 1024	(Unset)
MPI_RSM_MAXSTRIPE	Bytes	1	rsm_maxstripe, if set by system administrator in `hpc.conf` file Otherwise 2
MPI_RSM_DISABLED	(None)	0 or 1	0

One-Sided Communication
MPI_RSM_GETSIZE	Bytes	1	16Kb
MPI_RSM_PUTSIZE	Bytes	1	16Kb
MPI_USE_AGENT_THREAD	(None)	0 or 1	0

Polling and Flow
MPI_FLOWCONTROL	Messages	≥0	0
MPI_POLLALL	(None)	0 or 1	1

Dedicated Performance
MPI_PROCBIND	(None)	0 or 1	0
MPI_SPIN	(None)	0 or 1	0

Full vs. Lazy Connections
MPI_FULLCONNINIT	(None)	0 or 1	0

Eager vs. Rendezvous
MPI_EAGERONLY	(None)	0 or 1	1
MPI_SHM_RENDVSIZE	Bytes	≥1	24576
MPI_TCP_RENDVSIZE	Bytes	≥1	49152
MPI_RSM_RENDVSIZE	Bytes	≥1	256 Kb

Collectives
MPI_CANONREDUCE	(None)	0 or 1	0
MPI_OPTCOLL	(None)	0 or 1	1

Coscheduling
MPI_COSCHED	(None)	0 or 1	(Unset, or "2")
MPI_SPINDTIMEOUT	Milliseconds	≥0	1000

Handles
MPI_MAXFHANDLES	Handles	≥1	1024
MPI_MAXREQHANDLES	Handles	≥1	1024