A P P E N D I X  B

Sun MPI Environment Variables

This appendix describes some Sun MPI environment variables and their effects on program performance. It covers the following topics:

Prescriptions for using MPI environment variables for performance tuning are provided in Chapter 8. Additional information on these and other environment variables can be found in the Sun MPI Programming and Reference Guide.

These environment variables are closely related to the details of the Sun MPI implementation, and their use requires an understanding of the implementation. More details on the Sun MPI implementation can be found in Appendix A.


Yielding and Descheduling

A blocking MPI communication call might not return until its operation has completed. If the operation has stalled, perhaps because there is insufficient buffer space to send or because there is no data ready to receive, Sun MPI will attempt to progress other outstanding, nonblocking messages. If no productive work can be performed, then in the most general case Sun MPI will yield the CPU to other processes, ultimately escalating to the point of descheduling the process by means of the spind daemon.

Setting MPI_COSCHED=0 specifies that processes should not be descheduled. This is the default behavior.

Setting MPI_SPIN=1 suppresses yields. The default value, 0, allows yields.


Polling

By default, Sun MPI polls generally for incoming messages, regardless of whether receives have been posted. To suppress general polling, use MPI_POLLALL=0.


Shared-Memory Point-to-Point Message Passing

The size of each shared-memory buffer is fixed at 1 Kbyte. Most other quantities in shared-memory message passing are settable with MPI environment variables.

For any point-to-point message, Sun MPI will determine at runtime whether the message should be sent via shared memory, remote shared memory, or TCP. The flowchart in FIGURE B-1 illustrates what happens if a message of B bytes is to be sent over shared memory.

 FIGURE B-1 Message of B Bytes Sent Over Shared Memory

Graphic image showing a message of B bytes sent over shared memory.

For pipelined messages, MPI_SHM_PIPESIZE bytes are sent under the control of any one postbox. If the message is shorter than 2 x MPI_SHM_PIPESIZE bytes, the message is split roughly into halves.

For cyclic messages, MPI_SHM_CYCLESIZE bytes are sent under the control of any one postbox, so that the footprint of the message in shared memory buffers is 2 x MPI_SHM_CYCLESIZE bytes.

The postbox area consists of MPI_SHM_NUMPOSTBOX postboxes per connection.

By default, each connection has its own pool of buffers, each pool of size MPI_SHM_CPOOLSIZE bytes.

By setting MPI_SHM_SBPOOLSIZE, users can specify that each sender has a pool of buffers, each pool having MPI_SHM_SBPOOLSIZE bytes, to be shared among its various connections. If MPI_SHM_CPOOLSIZE is also set, then any one connection might consume only that many bytes from its send-buffer pool at any one time.

Memory Considerations

In all, the size of the shared-memory area devoted to point-to-point messages is

n x ( n - 1 ) x ( MPI_SHM_NUMPOSTBOX x ( 64 + MPI_SHM_SHORTMSGSIZE ) + MPI_SHM_CPOOLSIZE )

bytes when per-connection pools are used (that is, when MPI_SHM_SBPOOLSIZE is not set), and

n x ( n - 1 ) x MPI_SHM_NUMPOSTBOX x ( 64 + MPI_SHM_SHORTMSGSIZE ) + n x MPI_SHM_SBPOOLSIZE

bytes when per-sender pools are used (that is, when MPI_SHM_SBPOOLSIZE is set).

Performance Considerations

A sender should be able to deposit its message and complete its operation without waiting for any other process. You should typically:

In theory, rendezvous can improve performance for long messages if their receives are posted in a different order than their sends. In practice, the right set of conditions for overall performance improvement with rendezvous messages is rarely met.

Send-buffer pools can be used to provide reduced overall memory consumption for a particular value of MPI_SHM_CPOOLSIZE. If a process will only have outstanding messages to a few other processes at any one time, then set MPI_SHM_SBPOOLSIZE to the number of other processes times MPI_SHM_CPOOLSIZE. Multithreaded applications might suffer, however, since then a sender's threads would contend for a single send-buffer pool instead of for multiple, distinct connection pools.

Pipelining, including for cyclic messages, can roughly double the point-to-point bandwidth between two processes. This is a secondary performance effect, however, since processes tend to get considerably out of step with one another, and since the nodal backplane can become saturated with multiple processes exercising it at the same time.

Restrictions

( MPI_SHM_SHORTMSGIZE - 8 ) x 1024 / 8
should be at least as large as
max(
MPI_SHM_PIPESTART,
MPI_SHM_PIPESIZE,
MPI_SHM_CYCLESIZE)
max(
MPI_SHM_PIPESTART,
MPI_SHM_PIPESIZE,
2 x MPI_SHM_CYCLESIZE)
MPI_SHM_SBPOOLSIZE greater than or equal ((np - 1) + 1) x MPI_SHM_CYCLESIZE


Shared-Memory Collectives

Collective operations in Sun MPI are highly optimized and make use of a general buffer pool within shared memory. MPI_SHM_GBPOOLSIZE sets the amount of space available on a node for the "optimized" collectives in bytes. By default, it is set to 20971520 bytes. This space is used by MPI_Bcast(), MPI_Reduce(), MPI_Allreduce(), MPI_Reduce_scatter(), and MPI_Barrier(), provided that two or more of the MPI processes are on the node.

Memory is allocated from the general buffer pool in three different ways:

(n / 4) x 2 x MPI_SHM_BCASTSIZE
where n is the number of MPI processes on the node. If less memory is needed than this, then less memory is used. After the broadcast operation, the memory is returned to the general buffer pool.
n x n x MPI_SHM_REDUCESIZE
bytes are borrowed from the general buffer pool and returned after the operation.

In essence, MPI_SHM_BCASTSIZE and MPI_SHM_REDUCESIZE set the pipeline sizes for broadcast and reduce operations on large messages. Larger values can improve the efficiency of these operations for very large messages, but the amount of time it takes to fill the pipeline can also increase. Typically, the default values are suitable, but if your application relies exclusively on broadcasts or reduces of very large messages, then you can try doubling or quadrupling the corresponding environment variable using one of the following:

% setenv MPI_SHM_BCASTSIZE 65536 
% setenv MPI_SHM_BCASTSIZE 131072 
% setenv MPI_SHM_REDUCESIZE 512 
% setenv MPI_SHM_REDUCESIZE 1024

If MPI_SHM_GBPOOLSIZE proves to be too small and a collective operation happens to be unable to borrow memory from this pool, the operation will revert to slower algorithms. Hence, under certain circumstances, performance optimization could dictate increasing MPI_SHM_GBPOOLSIZE.


Running Over TCP

TCP ensures reliable dataflow, even over los-prone networks, by retransmitting data as necessary. When the underlying network loses a lot of data, the rate of retransmission can be very high, and delivered MPI performance will suffer accordingly. Increasing synchronization between senders and receivers by lowering the TCP rendezvous threshold with MPI_TCP_RENDVSIZE might help in certain cases. Generally, increased synchronization will hurt performance, but over a loss-prone network it might help mitigate performance degradation.

If the network is not lossy, then lowering the rendezvous threshold would be counterproductive and, indeed, a Sun MPI safeguard might be lifted. For reliable networks, use

% setenv MPI_TCP_SAFEGATHER 0

to speed MPI_Gather() and MPI_Gatherv() performance.


RSM Point-to-Point Message Passing

The RSM protocol has some similarities with the shared-memory protocol, but it also differs substantially, and environment variables are used differently.

The maximum size of a short message is MPI_RSM_SHORTMSGSIZE bytes, with a default value of 3918 bytes. Short RSM messages can span multiple postboxes, but they still do not use any buffers.

The most data that will be sent under any one postbox using buffers for pipelined messages is MPI_RSM_PIPESIZE bytes.

There are MPI_RSM_NUMPOSTBOX postboxes for each RSM connection.

If MPI_RSM_SBPOOLSIZE is unset, then each RSM connection has a buffer pool of MPI_RSM_CPOOLSIZE bytes. If MPI_RSM_SBPOOLSIZE is set, then each process has a pool of buffers that is MPI_RSM_SBPOOLSIZE bytes per remote node for sending messages to processes on the remote node.

Unlike the case of the shared-memory protocol, values of the MPI_RSM_PIPESIZE, MPI_RSM_CPOOLSIZE, and MPI_RSM_SBPOOLSIZE environment variables are merely requests. Values set with the setenv command or printed when MPI_PRINTENV is used might not reflect effective values. In particular, only when connections are actually established are the RSM parameters truly set. Indeed, the effective values could change over the course of program execution if lazy connections are employed.

Striping refers to passing a message over multiple hardware links to get the speedup of their aggregate bandwidth. The number of hardware links used for a single message is limited to the smallest of these values:

  • MPI_RSM_MAXSTRIPE
  • rsm_maxstripe (if specified by the system administrator in the hpc.conf file)
  • the number of available links

When a connection is established between an MPI process and a remote destination process, the links that will be used for that connection are chosen. A job can use different links for different connections. Thus, even if MPI_RSM_MAXSTRIPE or rsm_maxstripe is set to 1, the overall job could conceivably still benefit from multiple hardware links.

Use of rendezvous for RSM messages is controlled with MPI_RSM_RENDVSIZE.

Memory Considerations

Memory is allocated on a node for each remote MPI process that sends messages to it over RSM. If np_local is the number of processes on a particular node, then the memory requirement on the node for RSM message passing from any one remote process is

np_local x ( MPI_RSM_NUMPOSTBOX x 128 + MPI_RSM_CPOOLSIZE )

bytes when MPI_RSM_SBPOOLSIZE is unset, and

np_local x MPI_RSM_NUMPOSTBOX x 128 + MPI_RSM_SBPOOLSIZE

bytes when MPI_RSM_SBPOOLSIZE is set.

The amount of memory actually allocated might be higher or lower than this requirement.

  • The memory requirement is rounded up to some multiple of 8192 bytes with a minimum of 32768 bytes.
  • This memory is allocated from a 256-Kbyte (262,144-byte) segment.
    • If the memory requirement is greater than 256 Kbytes, then insufficient memory will be allocated.
    • If the memory requirement is less than 256 Kbytes, some allocated memory will go unused. (There is some, but only limited, sharing of segments.)

If less memory is allocated than is required, then requested values of MPI_RSM_CPOOLSIZE or MPI_RSM_SBPOOLSIZE (specified with a setenv command and echoed if MPI_PRINTENV is set) can be reduced at runtime. This can cause the requested value of MPI_RSM_PIPESIZE to be overridden as well.

Each remote MPI process requires its own allocation on the node as described previously.

If multi-way stripes are employed, the memory requirement increases correspondingly.

Performance Considerations

The pipe size should be at most half as big as the connection pool:

2 x MPI_RSM_PIPESIZE less than or equal MPI_RSM_CPOOLSIZE

Otherwise, pipelined transfers will proceed slowly. The library adjusts MPI_RSM_PIPESIZE appropriately.

For pipelined messages, a sender must synchronize with its receiver to ensure that remote writes to buffers have completed before postboxes are written. Long pipelined messages can absorb this synchronization cost, but performance for short pipelined messages will suffer. In some cases, increasing the value of MPI_RSM_SHORTMSGSIZE can mitigate this effect.

Restriction

If the short message size is increased, there must be enough postboxes to accommodate the largest size. The first postbox can hold 23 bytes of payload, while subsequent postboxes in a short messages can each take 63 bytes of payload. Thus, 23 + ( MPI_RSM_NUMPOSTBOX - 1 ) x 63 less than or equal MPI_RSM_SHORTMSGSIZE.


Summary Table Of Environment Variables

TABLE B-1 Sun MPI Environment Variables

Name

Units

Range

Default

 

 

 

 

Informational

MPI_PRINTENV

(None)

0 or 1

0

MPI_QUIET

(None)

0 or 1

0

MPI_SHOW_ERRORS

(None)

0 or 1

0

MPI_SHOW_INTERFACES

(None)

0 - 3

0

 

 

 

 

Shared Memory Point-to-Point

MPI_SHM_NUMPOSTBOX

Postboxes

1

16

MPI_SHM_SHORTMSGSIZE

Bytes

Multiple of 64

256

MPI_SHM_PIPESIZE

Bytes

Multiple of 1024

8192

MPI_SHM_PIPESTART

Bytes

Multiple of 1024

2048

MPI_SHM_CYCLESIZE

Bytes

Multiple of 1024

8192

MPI_SHM_CYCLESTART

Bytes

--

The default value is 0x7fffffff for 32-bit and 0x7fffffffffffffff for 64-bit operating environments. That is, by default there is no cyclic message passing.

MPI_SHM_CPOOLSIZE

Bytes

Multiple of 1024

24576 if MPI_SHM_SBPOOLSIZE is not set

MPI_SHM_SBPOOLSIZE if it is set

MPI_SHM_SBPOOLSIZE

Bytes

Multiple of 1024

(Unset)

 

 

 

 

Shared Memory Collectives

MPI_SHM_BCASTSIZE

Bytes

Multiple of 128

32768

MPI_SHM_REDUCESIZE

Bytes

Multiple of 64

256

MPI_SHM_GBPOOLSIZE

Bytes

>256

20971520

 

 

 

 

TCP

MPI_TCP_CONNTIMEOUT

Seconds

greater than or equal0

600

MPI_TCP_CONNLOOP

Occurrences

greater than or equal0

0

MPI_TCP_SAFEGATHER

(None)

0 or 1

1

 

 

 

 

RSM

MPI_RSM_NUMPOSTBOX

Postboxes

1 - 15

128

MPI_RSM_SHORTMSGSIZE

Bytes

23 - 905

3918 bytes

MPI_RSM_PIPESIZE

Bytes

Multiple of 1024 up to 15360

64 Kb

MPI_RSM_CPOOLSIZE

Bytes

Multiple of 1024

256 Kb

MPI_RSM_SBPOOLSIZE

Bytes

Multiple of 1024

(Unset)

MPI_RSM_MAXSTRIPE

Bytes

greater than or equal1

  • rsm_maxstripe, if set by system administrator in hpc.conf file
  • Otherwise 2

MPI_RSM_DISABLED

(None)

0 or 1

0

 

 

 

 

One-Sided Communication

 

 

 

MPI_RSM_GETSIZE

Bytes

greater than or equal1

16Kb

MPI_RSM_PUTSIZE

Bytes

greater than or equal1

16Kb

MPI_USE_AGENT_THREAD

(None)

0 or 1

0

 

 

 

 

Polling and Flow

MPI_FLOWCONTROL

Messages

greater than or equal0

0

MPI_POLLALL

(None)

0 or 1

1

 

 

 

 

Dedicated Performance

MPI_PROCBIND

(None)

0 or 1

0

MPI_SPIN

(None)

0 or 1

0

 

 

 

 

Full vs. Lazy Connections

MPI_FULLCONNINIT

(None)

0 or 1

0

 

 

 

 

Eager vs. Rendezvous

MPI_EAGERONLY

(None)

0 or 1

1

MPI_SHM_RENDVSIZE

Bytes

greater than or equal1

24576

MPI_TCP_RENDVSIZE

Bytes

greater than or equal1

49152

MPI_RSM_RENDVSIZE

Bytes

greater than or equal1

256 Kb

 

 

 

 

Collectives

MPI_CANONREDUCE

(None)

0 or 1

0

MPI_OPTCOLL

(None)

0 or 1

1

 

 

 

 

Coscheduling

MPI_COSCHED

(None)

0 or 1

(Unset, or "2")

MPI_SPINDTIMEOUT

Milliseconds

greater than or equal0

1000

 

 

 

 

Handles

MPI_MAXFHANDLES

Handles

greater than or equal1

1024

MPI_MAXREQHANDLES

Handles

greater than or equal1

1024