2 - C H A P T E R -

C H A P T E R 2

Sun MPI Library

This chapter describes the Sun MPI library:

Types of Libraries

Sun MPI Routines

Managing Communicators, Groups, and Contexts

Data Types

Resource Reservation for Batch Processing

Programming With Sun MPI

Multithreaded Programming

Profiling Interface

MPE: Extensions to the Library

Note - Sun MPI I/O is described separately, in Chapter 4.

Types of Libraries

Sun MPI contains four types of libraries, which represent two categories.

32- and 64-bit libraries --If you want to take advantage of the 64-bit capabilities of Sun MPI, you must explicitly link to the 64-bit libraries. The 32-bit libraries are the default in each category.

Thread-safe and non-thread-safe libraries --For multithreaded programs, link with the thread-safe library in the appropriate category unless the program has only one thread calling MPI. For programs that are not multithreaded, you can link against either the thread-safe or the default (non-thread-safe) library. However, nonmultithreaded programs have better performance using the default library, as it does not incur the extra overhead of providing thread safety. Therefore, use the default libraries whenever possible for maximum performance.

For full information about linking to libraries, see Compiling and Linking.

Sun MPI Routines

This section gives a brief description of the routines in the Sun MPI library. All the Sun MPI routines are listed in Appendix A with brief descriptions and their C syntax. For detailed descriptions of individual routines, see the man pages or the MPI Standard. The routines are divided into these categories:

Point-to-Point Communication Routines

One-Sided Communication Routines

Collective Communication Routines

Name-Publishing Routines

Environmental Inquiry Routines

Packing and Unpacking Functions

Point-to-Point Communication Routines

Point-to-point communication routines include the basic send and receive routines in both blocking and nonblocking forms and in four modes.

A blocking send blocks until its message buffer can be written with a new message.

A blocking receive blocks until the received message is in the receive buffer.

Nonblocking sends and receives differ from blocking sends and receives in that they return immediately and their completion must be waited or tested for. It is expected that eventually nonblocking send and receive calls will allow the overlap of communication and computation.

The four modes for MPI point-to-point communication are as follows:

Standard - The completion of a send implies that the message either is buffered internally or has been received. Users are free to overwrite the buffer that they passed in with any of the blocking send or receive routines, after the routine returns.

Buffered - The user guarantees a certain amount of buffering space.

Synchronous - Rendezvous semantics occur between sender and receiver; that is, a send blocks until the corresponding receive has occurred.

Ready - A send can be started only if the matching receive is already posted. The ready mode for sends is a way for the programmer to notify the system that the receive has been posted, so that the underlying system can use a faster protocol if it is available.

One-Sided Communication Routines

Standard MPI communication is two-sided. To complete a transfer of information, both the sending and receiving processes must call appropriate functions. The operation proceeds in two stages, as shown in the following figure. Graphic image showing the two stages of two-sided communication.

This form of communication requires regular synchronization between the sending and receiving processes. That synchronization can become complicated if the receiving process does not know which process is sending it the data it needs. One-sided communication was developed to solve this problem and to reduce the amount of synchronization required even when both sending and receiving processes know each other's identities.

In one-sided communication, a process opens a window in memory and exposes it to all processes that belong to a particular communicator, provided they reside on the same node. As long as that window is open, any process in the communicator and node can put data into it and get data out of it. Graphic image showing the use of memory windows for one-sided put and get operations.

The put requires no complementary operation from the process that opened the window, and is equivalent to the combination of a send and receive operation in two-sided MPI communication.

The functions used to implement one-sided MPI communication fall into three categories and are summarized in TABLE 2-1. You can find their definitions in the MPI Standard. Also, Appendix A of this document provides syntax summaries.


Window Creation
	MPI_Win_create()	Creates a window in memory and exposes it to all processes in the communicator and node.
	MPI_Win_free()	Closes the window created with `MPI_Win_create()`. Requires barrier synchronization.
	MPI_Win_get_group()	Returns a duplicate of the group of the communicator used to create the window.
Data Transfer
	MPI_Accumulate()	Combines data from a process with data already in the window. Different from `MPI_Put()` in that new data is appended to existing data instead of replacing it.
	MPI_Get()	The calling process takes data directly from the window. The opposite of `MPI_Put()`. Equivalent to the combination of a send and receive operation originated by the receiving process.
	MPI_Put()	The calling process loads its data directly into the buffer of the target process. Equivalent to the combination of a send and receive operation originated by the sending process. The opposite of `MPI_Get()`.
Synchronization
	MPI_Win_fence()	Blocks any process from operating on a particular window until all operations relating to that window have completed. Similar to `MPI_Barrier()`, but applies to a window instead of a communicator.
	MPI_Win_lock()	Starts an RMA access epoch. While the lock is in place, only the process whose rank is specified in the function call can be accessed by RMA operations on the window.
	MPI_Win_unlock()	Closes an RMA access epoch begun with a call to `MPI_Win_lock`().
	MPI_Win_start()	Starts an RMA access epoch for window.
	MPI_Win_complete()	Completes an RMA access epoch on window started by a call to `MPI_Win_start()`.
	MPI_Win_post()	Starts an RMA exposure epoch for the local window associated with window.
	MPI_Win_wait()	Completes an RMA exposure epoch started by a call to `MPI_Win_post()` on window.
	MPI_Win_test()	Attempts to complete an RMA exposure epoch; a nonblocking version of `MPI_Win_wait()`.

Some special considerations apply to allocating memory for one-sided communications. For example:

MPI_Alloc_mem allocates memory by means of SysV shared memory. If you find that you cannot allocate any more memory because of system-imposed limits, you can either increase shmsys:shminfo_shmseg in /etc/system or try to preallocate a large segment and then use parts of it for your needs.

If the memory used for a communication window has been allocated by means of the MPI_Alloc_mem function, the memory region specified by the base and size parameters passed to MPI_Win_create should not exceed the limits of the memory region allocated using a call either to malloc or to the MPI_Alloc_mem function. If the memory has been allocated using a call to MPI_Alloc_mem, these limits are checked internally and the call to MPI_Win_create() returns MPI_ERR_OTHER.

Several one-sided communications routines support info keys. These keys, and their descriptions, are listed in TABLE 2-2.


Routine	Info Key	Description
MPI_Win_create()	no_locks	If `MPI_Win_lock` is called on a window created with this info key, the call fails. If this info key is present, it is assumed that the local window is never locked, allowing several internal checks to be skipped, permitting a more efficient implementation.
	sun_noexpose	An info key (unique to Sun MPI) that is interpreted only by the remote shared memory (RSM) protocol module when passed using `MPI_Win_create` (and while the RSM protocol is in use). When explicitly passed with a base address that was not allocated using `MPI_Alloc_mem`, `sun_noexpose` notifies the library not to expose address spaces not specified by the application.
	sun_rsmrexmit_hw	An info key (unique to Sun MPI) that is interpreted only by the RSM protocol module when passed using `MPI_Win_create` (and while the RSM protocol is in use). The `sun_rsmrexmit_hw` info key defines a high-water mark, in bytes, of data that will be sent before any error checking and retransmission of data is done. By default the RSM protocol module assumes the high-water mark to be 16KB.
	sun_shmeuid	The effective user ID (UID) to which the shared-memory segment is set for memory-based protocols, including shared memory (SHM) and RSM. Set this key only when you anticipate connections from programs run by other users. Valid only for server programs run as `root`.
	sun_shmegid	The effective GID to which the shared-memory segment is set for memory-based protocols, including SHM and RSM. Set this key only when you anticipate connections from programs run by other users.
	sun_shmperm	The permissions to which the shared-memory segment is set, in octal, for memory-based protocols, including SHM and RSM. Set this key only when you anticipate connections from programs run by other users.
MPI_Alloc_mem	sun_shmeuid	The effective UID to which the shared-memory segment is set for memory-based protocols, including SHM and RSM. Set this key only when you anticipate connections from programs run by other users. Valid only for server programs run as `root`.
	sun_shmegid	The effective GID to which the shared-memory segment is set for memory-based protocols, including SHM and RSM. Set this key only when you anticipate connections from programs run by other users.
	sun_shmperm	The permissions to which the shared-memory segment is set, in octal, for memory-based protocols, including SHM and RSM. Set this key only when you anticipate connections from programs run by other users.

Several one-sided communications routines support assertions. These assertions, and their descriptions, are listed in TABLE 2-3.


Routine	Assertion Values	Description
MPI_Win_fence()	MPI_MODE_NOPRECEDE	The RSM protocol module recognizes this value and does not issue any RSM close barriers on the window passed into the `MPI_Win_fence()` call. The effect of this assertion value is that `fence` calls execute faster.
	MPI_MODE_NOSTORE MPI_MODE_NOPUT MPI_MODE_NOSUCCEED	All native Sun HPC ClusterTools protocol modules and generic one-sided functions ignore these values. However, they are available for use by third-party protocol modules
`MPI_Win_start()`	`MPI_MODE_NOCHECK`	When this value is passed in to this call, the library assumes that the post call on the target has been called and it is not necessary for the library to check to see if such a call has been made.
`MPI_Win_post()`	`MPI_MODE_NOCHECK`	When this value is passed in to this call, the library assumes that the post call on the target has been called and it is not necessary for the library to check to see if such a call has been made.
`MPI_Win_lock()`	`MPI_MODE_NOCHECK`	The Sun MPI library supports `MPI_MODE_NOCHECK` for the RSM protocol module only. When this lock is set, the lock and unlock calls in the RSM protocol module do not acquire the lock. However, data synchronization does occur.

Collective Communication Routines

Collective communication routines are blocking routines that involve all processes in a communicator and, in most cases, an intercommunicator. Collective communication includes broadcasts and scatters, reductions and gathers, all-gathers and all-to-alls, scans, and a synchronizing barrier call.


Routine	Description
`MPI_Bcast`()	Broadcasts from one process to all others in a communicator or intercommunicator.
`MPI_Scatter`()	Scatters from one process to all others in a communicator or intercommunicator.
`MPI_Scatterv`()	Scatters from all processes to all others in a communicator or intercommunicator.
`MPI_Reduce`()	Reduces from all to one in a communicator or intercommunicator.
`MPI_Allreduce`()	Reduces and then broadcasts result to all nodes in a communicator or intercommunicator.
`MPI_Allreducev`()	Reduces from all processes and then broadcasts result to all nodes in a communicator or intercommunicator.
`MPI_Reduce_scatter`()	Scatters a vector that contains results across the nodes in a communicator.
`MPI_Gather`()	Gathers from all to one in a communicator or intercommunicator.
`MPI_Gatherv()`	Gathers information from all processes in a communicator or intercommunicator.
`MPI_Allgather`()	Gathers and then broadcasts the results of the gather in a communicator or intercommunicator.
`MPI_Allgatherv`()	Gathers from all processes and then broadcasts the results of the gather in a communicator or intercommunicator.
`MPI_Alltoall`()	All processes send data to, and receive data from, all other processes in a communicator or intercommunicator.
`MPI_Alltoallv()`	Like `MPI_Alltoall`(), but user can use vector style (displacement and element count) to specify what data to send and receive.
`MPI_Alltoallw()`	Like `MPI_Alltoallv`(), but user can specify database of individual datablocks, in addition to displacement and element count.
`MPI_Scan`()	Scans (performs a parallel prefix) across processes in a communicator or intercommunicator.
`MPI_Exscan()`	Performs an exclusive prefix reduction on data distributed across the calling processes.
`MPI_Barrier`()	Synchronizes processes in a communicator or intercommunicator (no data is transmitted).

Many of the collective communication calls have alternative vector forms, with which various amounts of data can be sent to or received from various processes. In addition, MPI_Alltoallw() accepts a database of individual datablocks.

The syntax and semantics of these routines are basically consistent with the point-to-point routines (upon which they are built), but there are restrictions to keep them from becoming too complex:

The amount of data sent must exactly match the amount of data specified by the receiver.

There is only one mode, a mode analogous to the standard mode of point-to-point routines.

Using the In-Place Option

Several collectives can pass MPI_IN_PLACE as the value of send-buffer at the root. When they do, sendcount and sendtype are ignored, and the contribution of the root to the gathered vector is assumed to be already in the correct location in the receive bugger. The collectives are as follows:

MPI_Gather()

MPI_Gatherv()

MPI_Scatter()

MPI_Scatterv()

MPI_Allgather()

MPI_Allgatherv()

MPI_Reduce()

MPI_AllReduce()

MPI_Reduce_scatter()

MPI_Scan()

Using Persistent Communication Requests

Sometimes within an inner loop of a parallel computation, a communication with the same argument list is executed repeatedly. The communication can be slightly improved by using a persistent communication request, which reduces the overhead for communication between the process and the communication controller. A persistent request can be thought of as a communication port or half-channel.

Managing Process Topologies

Process topologies are associated with communicators; they are optional attributes that can be given to an intracommunicator (not to an intercommunicator).

Recall that processes in a group are ranked from 0 to n-1. This linear ranking often reflects nothing of the logical communication pattern of the processes, which may be, for instance, a two- or three-dimensional grid. The logical communication pattern is referred to as a virtual topology (separate and distinct from any hardware topology). In MPI, two types of virtual topologies can be created: Cartesian (grid) topology and graph topology.

You can use virtual topologies in your programs by taking physical processor organization into account to provide a ranking of processors that optimizes communication.

Name-Publishing Routines

Name-publishing routines enable client applications to retrieve system-supplied port names. A server calls the MPI_Publish_name() function to publish the name of the service associated with a particular port name. A client application calls the MPI_Lookup_name(), passing it the published service name, and in return gets its associated port name. The server can also call the MPI_Unpublish_name() function to stop publishing names.

Sun's implementation of the MPI Standard does not provide a scope for the published names and does not allow a server to publish the same service name twice. The implementation consists of three routines:

MPI_Publish_name()

MPI_Unpublish_name()

MPI_Lookup_name()

Environmental Inquiry Routines

Environmental inquiry routines are used for starting up and shutting down error-handling routines and timers.

Few MPI routines can be called before MPI_Init() or after MPI_Finalize(). Examples include MPI_Initialized() and MPI_Version(). MPI_Finalize() can be called only if there are no outstanding communications involving that process.

The set of errors handled by MPI depends upon the implementation. See Appendix C for tables listing the Sun MPI error classes.

Packing and Unpacking Functions

Sun's implementation of the MPI Standard provides functions for packing and unpacking messages to be exchanged within an MPI implementation, and in the external32 format used to exchange messages between MPI implementations.

MPI_Pack()

MPI_Unpack()

MPI_Pack_size()

MPI_Pack_external()

MPI_Unpack_external()

MPI_Pack_external_size()

Managing Communicators, Groups, and Contexts

A distinguishing feature of the MPI Standard is that it includes a mechanism for creating separate worlds of communication, accomplished through communicators, groups, and contexts.

A communicator specifies a group of processes that will conduct communication operations within a specified context without affecting or being affected by operations occurring in other groups or contexts elsewhere in the program. A communicator also guarantees that, within any group and context, point-to-point and collective communication are isolated from each other.

A group is an ordered collection of processes. Each process has a rank in the group; the rank runs from 0 to n-1. A process can belong to more than one group; its rank in one group has nothing to do with its rank in any other group.

A context is the internal mechanism by which a communicator guarantees safe communication space to the group.

At program startup, two default communicators are defined:

MPI_COMM_WORLD, which has as a process group all the processes of the job

MPI_COMM_SELF, which is equivalent to an identity communicator

The process group that corresponds to MPI_COMM_WORLD is not predefined, but can be accessed using MPI_COMM_GROUP. One MPI_COMM_SELF communicator is defined for each process, each of which has rank zero in its own communicator. For many programs, these are the only communicators needed.

Communicators are of two kinds: intracommunicators, which conduct operations within a given group of processes; and intercommunicators, which conduct operations between two groups of processes.

Communicators provide a caching mechanism, which allows an application to attach attributes to communicators. Attributes can be user data or any other kind of information.

New groups and new communicators are constructed from existing ones. Group constructor routines are local, and their execution does not require interprocessor communication. Communicator constructor routines are collective, and their execution can require interprocess communication.

You can also create an intercommunicator from two MPI processes that are connected by a socket. Use the MPI_Comm_join() function.

Note - Users who do not need any communicator other than the default MPI_COMM_WORLD communicator--that is, who do not need any sub- or supersets of processes--can plug in MPI_COMM_WORLD wherever a communicator argument is requested. In these circumstances, users can ignore this section and the associated routines. (These routines can be identified from the listing in
Appendix A.)

Data Types

All Sun MPI communication routines have a data type argument. They can be primitive data types, such as integers or floating-point numbers, or they can be user-defined, derived data types that are specified in terms of primitive types.

Derived data types enable users to specify more general, mixed, and noncontiguous communication buffers, such as array sections and structures that contain combinations of primitive data types.

Fortran data types are listed in TABLE 2-5. Data types of Fortran used with the -r8 flag are listed in TABLE 2-6. C data types are listed in TABLE 2-7.


MPI Data Type	Fortran Data Type
MPI_INTEGER	INTEGER INTEGER*4
MPI_INTEGER1	`INTEGER*1` (Fortran 90 only)
MPI_INTEGER2	INTEGER*2
MPI_INTEGER4	INTEGER*4
MPI_INTEGER8	INTEGER*8
MPI_REAL	REAL REAL*4
MPI_REAL4	REAL*4
MPI_REAL8	REAL*8
MPI_REAL16	REAL*16
MPI_DOUBLE_PRECISION	REAL*8 DOUBLE PRECISION
MPI_2DOUBLE_PRECISION	Pair of `DOUBLE PRECISION` variables`^[1]`
MPI_COMPLEX	COMPLEX
MPI_LOGICAL	LOGICAL
MPI_CHARACTER	CHARACTER(1)
MPI_SIGNED_CHAR	INTEGER*1
MPI_DOUBLE_COMPLEX	DOUBLE COMPLEX
MPI_2REAL	Pair of `REAL`s*
MPI_INTEGER2	INTEGER*2
MPI_INTEGER4	INTEGER*4
MPI_2INTEGER	Pair of `INTEGER`s*
MPI_BYTE	no corresponding Fortran data type
MPI_PACKED	no corresponding Fortran data type


MPI Data Type	Fortran -r8 Data Type
MPI_INTEGER	INTEGER*4
MPI_INTEGER1	`INTEGER*1` (Fortran 90 only)
MPI_INTEGER2	INTEGER*2
MPI_INTEGER4	INTEGER*4
MPI_INTEGER8	INTEGER INTEGER*8
MPI_REAL	REAL*4
MPI_REAL4	REAL*4
MPI_REAL8	REAL REAL*8
MPI_REAL16	REAL*16 DOUBLE PRECISION
MPI_DOUBLE_PRECISION	REAL REAL*8
MPI_2DOUBLE_PRECISION	Pair of `REAL*8`^[2]
MPI_COMPLEX	COMPLEX*4
MPI_LOGICAL	LOGICAL
MPI_CHARACTER	CHARACTER(1)
MPI_SIGNED_CHAR	INTEGER*1
MPI_DOUBLE_COMPLEX	COMPLEX
`MPI_2REAL`	Pair of `REAL4`
`MPI_INTEGER2`	INTEGER*2
`MPI_INTEGER4`	INTEGER*4
MPI_2INTEGER	Pair of `INTEGER*4`
`MPI_BYTE`	No corresponding Fortran data type
`MPI_PACKED`	No corresponding Fortran data type


MPI Data Type	C Data Type
MPI_BYTE	No corresponding C data type
MPI_PACKED	No corresponding C data type
MPI_CHAR	signed char
MPI_SIGNED_CHAR	signed char
MPI_UNSIGNED_CHAR	unsigned char
MPI_SHORT	signed short int
MPI_UNSIGNED_SHORT	unsigned short int
MPI_INT	signed int
MPI_UNSIGNED	unsigned int
MPI_LONG	signed long int
MPI_UNSIGNED_LONG	unsigned long int
MPI_LONG_LONG_INT	long long int
MPI_FLOAT	float
MPI_DOUBLE	double
MPI_LONG_DOUBLE	long double
MPI_WCHAR	wchar_t
MPI_UNSIGNED_LONG_LONG	unsigned long long int
MPI_2INT	Pair of `int`^[3]
MPI_FLOAT_INT	`float` and `int`*
MPI_DOUBLE_INT	`double` and `int`*
MPI_LONG_DOUBLE_INT	`long double` and `int`*
MPI_LONG_INT	`long` and `int`*
MPI_SHORT_INT	`short` and `int`*

Resource Reservation for Batch Processing

If you plan to launch a job that uses the MPI_Comm_spawn or MPI_Comm_spawn_multiple functions, you must first reserve the resources with the resource manager that will run the job. As explained in the mprun.1 man page and the Sun HPC ClusterTools Software User's Guide, you can reserve those resources by adding the -nr flag to the mprun command.

When you launch a job with the mprun command from within a resource manager, the number of processes allocated for that job are stored in the environment variable MPI_UNIVERSE_SIZE. It is the sum of the processes allocated with the mprun command's -np flag, and reserved with its -nr flag.

Programming With Sun MPI

Although there are about 190 non-I/O routines in the Sun MPI library, you can write programs for a wide range of problems using only six routines, as described in TABLE 2-8.


Routine	Description
MPI_Init()	Initializes the MPI library.
MPI_Finalize()	Finalizes the MPI library. This includes releasing resources used by the library.
MPI_Comm_size()	Determines the number of processes in a specified communicator.
MPI_Comm_rank()	Determines the rank of calling process within a communicator.
MPI_Send()	Sends a message.
MPI_Recv()	Receives a message.

This set of six routines includes the basic send and receive routines. Programs that depend heavily on collective communication might also include MPI_Bcast() and MPI_Reduce().

The functionality of these routines means you can have the benefit of parallel operations without having to learn the whole library at once. As you become more familiar with programming for message passing, you can start learning the more complex and esoteric routines and add them to your programs as needed.

See Appendix A for a complete list of Sun MPI routines.

Fortran Support

Sun MPI provides extended Fortran support, as described in Section 10.2 of the MPI-2 Standard. In other words, it provides basic Fortran support, plus additional functions that specifically support Fortran 90:

MPI_Type_create_f90_complex()

MPI_Type_create_f90_integer()

MPI_Type_create_f90_real()

MPI_Type_match_size()

MPI_Sizeof()

Basic Fortran support provides the original Fortran bindings and an mpif.h file specified in the MPI-1 Standard. The mpif.h file is valid for both fixed- and free-source forms, as specified in the MPI-2 Standard.

The MPI interface is known to violate the Fortran standard in several ways, but it causes few problems for FORTRAN 77 programs. Violations of the standard can cause more significant problems for Fortran 90 programs, however, if you do not follow the guidelines recommended in the standard. If you are programming in Fortran, and particularly if you are using Fortran 90, you should consult Section 10.2 of the MPI-2 Standard for detailed information about basic Fortran support in an MPI implementation.

Recommendations for All-to-All and All-to-One Communication

The Sun MPI library uses the TCP protocol to communicate over a variety of networks. MPI depends on TCP to ensure reliable, correct data flow. TCP's reliability compensates for unreliability in the underlying network, as the TCP retransmission algorithms handle any segments that are lost or corrupted. In most cases, this works well with good performance characteristics. However, when doing all-to-all and all-to-one communication over certain networks, a large number of TCP segments can be lost, resulting in poor performance.

You can compensate for this diminished performance over TCP in these ways:

When writing your own algorithms, avoid flooding one node with a lot of data.

If you need to do all-to-all or all-to-one communication, use one of the Sun MPI routines to do so. They are implemented in a way that avoids congesting a single node with lots of data. The following routines fall into this category:

MPI_Alltoall(), MPI_Alltoallv(), and MPI_Alltoallw() - These have been implemented using a pairwise communication pattern, so that every rank is communicating with only one other rank at a given time.

MPI_Gather() and MPI_Gatherv() - The root process sends ready-to-send packets to each nonroot-rank process to tell the processes to send their data. In this way, the root process can regulate how much data it is receiving at any one time. Using this ready-to-send method is associated with a minor performance cost, however. For this reason, you can override this method by setting the MPI_TCPSAFEGATHER environment variable to 0. (See Appendix B for information about environment variables.)

Signals and MPI

When running the MPI library over TCP, nonfatal SIGPIPE signals can be generated. To handle them, the library sets the signal handler for SIGPIPE to ignore, overriding the default setting (terminate the process). In this way the MPI library can recover in certain situations. You should therefore avoid changing the SIGPIPE signal handler.

The Sun MPI Fortran and C⁺⁺ bindings are implemented as wrappers on top of the C bindings. The profiling interface is implemented using weak symbols. This means a profiling library need contain only a profiled version of C bindings.

The SIGPIPEs can occur when a process first starts communicating over TCP. This happens because the MPI library creates connections over TCP only when processes actually communicate with one another. There are some unavoidable conditions where SIGPIPEs can be generated when two processes establish a connection. If you want to avoid any SIGPIPEs, set the environment variable MPI_FULLCONNINIT, which creates all connections during MPI_Init() and avoids any situations that might generate a SIGPIPE. For more information about environment variables, see Appendix B.

Multithreaded Programming

When you are linked to one of the thread-safe libraries, Sun MPI calls are thread safe, in accordance with basic tenets of thread safety for MPI mentioned in the MPI-2 specification. As a result:

When two concurrently running threads make MPI calls, the outcome is as if the calls executed in some order.

Blocking MPI calls block the calling thread only. A blocked calling thread does not prevent progress of other runnable threads on the same process, nor does it prevent them from executing MPI calls. Thus, multiple sends and receives are concurrent.

Use MPI_Init_thread() in place of MPI_Init() to initialize the MPI execution environment with a predetermined level of thread support. Use the MPI_Is_thread_main() function to find out whether a thread is the one that called MPI_Init_thread().

Guidelines for Thread-Safe Programming

Each thread within an MPI process can issue MPI calls; however, threads are not separately addressable. That is, the rank of a send or receive call identifies a process, not a thread, which means that no order is defined for the case in which two threads call MPI_Recv() with the same tag and communicator. Such threads are said to be in conflict.

If threads within the same application post conflicting communication calls, data races will result. You can prevent such data races by using distinct communicators or tags for each thread.

In general, adhere to these guidelines:

You must not have a request serviced by more than one thread. Although you can have an operation posted in one thread and then completed in another, you cannot have the operation completed in more than one thread.

A data type or communicator must not be freed by one thread while it is in use by another thread.

Once MPI_Finalize() is called, subsequent calls in any thread will fail.

You must ensure that a sufficient number of lightweight processes (LWPs) are available for your multithreaded program. Failure to do so can degrade performance or even result in deadlock.

You cannot stub the thread calls in your multithreaded program by omitting the threads libraries in the link line. The libmpi.so library automatically calls in the threads libraries, which effectively override any stubs.

The following sections describe more specific guidelines that apply for some routines. They also include some general considerations for collective calls and communicator operations that you should be aware of.

`MPI_Wait`(), `MPI_Waitall`(), `MPI_Waitany`(), `MPI_Waitsome`()

In a program in which two or more threads call one of these routines, you must ensure that they are not waiting for the same request. Similarly, the same request cannot appear in the array of requests of multiple concurrent wait calls.

`MPI_Cancel`()

One thread must not cancel a request while that request is being serviced by another thread.

`MPI_Probe`(), `MPI_Iprobe`()

A call to MPI_Probe() or MPI_Iprobe() from one thread on a given communicator should not have a source rank and tags that match those of any other probes or receives on the same communicator. Otherwise, correct matching of message to probe call might not occur.

Collective Calls

Collective calls are matched on a communicator according to the order in which the calls are issued at each processor. All the processes on a given communicator must make the same collective call. You can avoid the effects of this restriction on the threads on a given processor by using a different communicator for each thread.

No process that belongs to the communicator may omit making a particular collective call; that is, none should be left "dangling."

Communicator Operations

Each of the communicator (or intercommunicator) functions operates simultaneously with each of the noncommunicator functions, regardless of what the parameters are and whether the functions are on the same or different communicators. However, if you are using multiple instances of the same communicator function on the same communicator where all parameters are the same, it cannot be determined which threads belong to which resultant communicator. Therefore, when concurrent threads issue such calls, you must ensure that the calls are synchronized in such a way that threads in separate processes participating in the same communicator operation are grouped together. Do this either by using a different base communicator for each call or by making the calls in single-thread mode before actually using them within the separate threads.

Note also these special situations:

If you are using multiple instances of the same function with differing parameters and multiple threads, you must use separate communicators.

When using splits with multiple instances of the same function with the same parameters, but with different threads at the split, you must use separate communicators.

For example, you might want to produce several communicators in separate sets of threads by performing MPI_Comm_split() on a base communicator. To ensure proper, thread-safe operation, you should replicate the base communicator with MPI_Comm_dup() (in the root thread or in one thread) and then perform MPI_Comm_split() on the resulting duplicate communicators.

Do not free a communicator in one thread if it is still being used by another thread.

Error Handlers

When an error occurs as a result of an MPI call, the handler might not run on the same thread as the thread that made the error-raising call. In other words, you cannot assume that the error handler will execute in the local context of the thread that made the error-raising call. The error handler can be executed by another thread on the same process, distinct from the one that returns the error code. Therefore, you cannot rely on local variables for error handling in threads; instead, use global variables from the process.

Profiling Interface

The Sun HPC ClusterTools software suite includes MPProf, a profiling tool to be used with applications that call Sun MPI library routines. When enabled, MPProf collects information about a program's message-passing activities in a set of intermediate files, one file per MPI process. Once the information is collected, you can invoke the MPProf command-line utility mpprof, which generates a report based on the profiling data stored in the intermediate files. You must enable MPProf before starting an MPI program. You do this by setting the environment variable MPI_PROFILE to 1.

If MPProf is enabled, it creates and initializes the intermediate files with header information when the program's MPI_Init call ends. It also creates an index file that contains a map of the intermediate files. mpprof uses this index file to find the intermediate files.

mpprof includes an interface for interacting with loadable protocol modules (loadable PMs). If an MPI program uses a loadable PM, this interface allows MPProf to collect profiling data that is specific to loadable PM activities.

An mpprof report contains the following classes of performance information:

The percentage of total execution time spent in MPI calls across all processes

The percentage of time each process spent in MPI calls

The number of calls, amount of time spent, and number of bytes sent or received per MPI routine, averaged over all processes, with percent variation among processes

Connectivity statistics (message count and volume) between processor pairs

The settings of environment variables that have performance implications

You can control aspects of mpprof behavior with the following environment variables:

MPI_PROFINTERVAL - Use this environment variable to specify a data sampling period. When this value is greater than 0, a sequence of snapshots is recorded at the prescribed intervals. Each snapshot represents the MPI activity that occurred since the previous snapshot. The default behavior is to record a single snapshot at the time of the MPI_Finalize call.

MPI_PROFDATADIR - Use this environment variable to specify the location where the intermediate files created for each process rank will be stored.

MPI_PROFINDEXDIR - Use this environment variable to specify the location where the index file created for each profiled job will be stored.

MPI_PROFMAXFILESIZE - Use this environment variable to specify the maximum size of intermediate files.

The Sun HPC ClusterTools software suite also provides a conversion utility, mpdump, which converts the data from each intermediate file into a raw (unevaluated) user-readable format. You can use the ASCII files generated by mpdump as input to a report generator of your choice.

Once you've enabled MPProf profiling by setting MPI_PROFILE to 1 (and run a job using mprun) you will find a file in your working directory of the form

mpprof.index.rm.jid

Type

% mpprof mpprof.index.rm.jid

to view the profiling report.

Further instructions for using mpprof and mpdump are provided in the Sun HPC ClusterTools Software User's Guide.

Sun MPI meets the profiling interface requirements described in Chapter 8 of the MPI-1 Standard. This means you can write your own profiling library or choose from a number of available profiling libraries, such as those in the multiprocessing environment (MPE) from Argonne National Laboratory. (See MPE: Extensions to the Library for more information.) The User's Guide for mpich, a Portable Implementation of MPI includes more detailed information about using profiling libraries.

FIGURE 2-1 provides a generic illustration of how the software fits together. In this example, the user is linking against a profiling library that collects information on MPI_Send(). No profiling information is being collected for MPI_Recv().

C profiling interfaces are needed even for Fortran programs. If there is profiling for both the Fortran and C version of an MPI function, then a Fortran call will encounter both profilings.

Be sure you make the library dynamic. A static library can experience the linker problems described in Section 8.4.3 of the MPI 1.1 Standard.

For compiling the program, the user's link line would look like this:

# cc ..... -llibrary-name -lmpi

FIGURE 2-1 Sun MPI Profiling Interface

Graphic image showing the relationships between user programs and libraries in the Sun MPI profiling interface.

To clarify the layering of PMPI profiling, users need to understand the role of weak symbols. A weak symbol is such that, if a user defines the symbol, the user's definition is used. Otherwise, the associated function is used. The relation of weak symbols to associated functions is illustrated in FIGURE 2-2.

FIGURE 2-2 Layering in PMPI Profiling

Graphic image showing the layering in PMPI profiling.

MPE: Extensions to the Library

Although the Sun MPI library does not include or support the multiprocessing environment (MPE) available from Argonne National Laboratory (ANL), it is compatible with MPE. If you would like to use these extensions to the MPI library, see the following instructions for downloading them from ANL and building MPE yourself. Note that this procedure may change if ANL makes changes to MPE.

To Obtain and Build the MPE

The MPE software is available from Argonne National Laboratory. The mpe.tar.gz file is about 240 Kbytes.

1. Use ftp to obtain the file.

ftp://ftp.mcs.anl.gov/pub/mpi/misc/mpe.tar.gz

2. Use gunzip and tar to decompress the software.

# gunzip mpe.tar.gz

# tar xvf mpe.tar

3. Change your current working directory to the mpe directory, and execute configure with the arguments shown.

# cd mpe

# configure -cc=cc -fc=f77 -opt=-I/opt/SUNWhpc/include

4. Execute a make.

# make

Note - Sun MPI does not include the MPE error handlers. You must call the debug routines MPE_Errors_call_dbx_in_xterm() and MPE_Signals_call_debugger() yourself.

Refer to the User's Guide for mpich, a Portable Implementation of MPI for information on how to use MPE. It is available at the Argonne National Laboratory web site:

http://www.mcs.anl.gov/mpi/mpich/

^{1 (TableFootnote) For use with MINLOC and MAXLOC}

^{2 (TableFootnote) For use with MINLOC and MAXLOC}

^{3 (TableFootnote) For use with MINLOC and MAXLOC}