C H A P T E R  4

Performance Programming

This chapter discusses approaches to consider when you are writing new message-passing programs.

The general rules of good programming apply to any code, serial or parallel. This chapter therefore focuses primarily on optimizing MPI interprocess communications and concludes with an extended example.

When you are working with legacy programs, you need to consider the costs of recoding in relation to the benefits.


General Good Programming

The general rules of good programming apply when your goal is to achieve top performance along with robustness and, perhaps, portability.

Clean Programming

The first rule of good performance programming is to employ "clean" programming. Good performance is more likely to stem from good algorithms than from clever "hacks." While tweaking your code for improved performance might work well on one hardware platform, those very tweaks might be counterproductive when the same code is deployed on another platform. A clean source base is typically more useful than one laden with many small performance tweaks. Ideally, you should emphasize readability and maintenance throughout the code base. Use performance profiling to identify any hot spots, and then do low-level tuning to fix the hot spots.

One way to garner good performance while simplifying source code is to use library routines. Advanced algorithms and techniques are available to users simply by issuing calls to high-performance libraries. In certain cases, calls to routines from one library might be speeded up simply by relinking to a higher-performing library. As examples,the following table show selected operations and suggests how these operations might be speeded up,

Operations...

might be speeded up by...

BLAS routines

linking to Sun Performance Library software

Collective MPI operations

formulating in terms of MPI collectives and using Sun MPI

Certain ScaLAPACK routines

linking to Sun S3L


Optimizing Local Computation

The most dramatic impact on scalability in distributed-memory programs comes from optimizing the data decomposition and communication. Aside from parallelization issues, a great deal of performance enhancement can be achieved by optimizing local (on-node) computation. Common techniques include loop rewriting and cache blocking. Compilers can be leveraged by exploring compilation switches (see Chapter 7).

For the most part, the important topic of optimizing serial computation within a parallel program is omitted here. To learn more about this and other areas of performance optimization, consult Techniques For Optimizing Applications: High Performance Computing, by Rajat Garg and Ilya Shapov, Prentice-Hall, 2001, ISBN:
0-13-093476-3. That volume covers serial optimization and various parallelization models. It deals with programming, compilation, and runtime issues and provides numerous concrete examples.


Optimizing MPI Communications

The default behavior of Sun MPI accommodates many programming practices efficiently. Tuning environment variables at runtime can result in even better performance. However, best performance will typically stem from writing the best programs. This section describes good programming practices under the following headings:

These topics are all interwoven. Clearly, reducing the number and volume of messages can reduce communication overheads, but such overheads are inherent to parallelization of serial computation. Serialization is one extreme of load balancing. Load imbalances manifest themselves as performance issues only because of synchronization. Synchronization, in turn, can be mitigated with message buffering, nonblocking operations, or general polling.

Following the general discussion of these issues, this chapter illustrates them in a case study.

Reducing Message Volume

An obvious way to reduce message-passing costs is to reduce the amount of message passing. One method is to reduce the total amount of bytes sent among processes. Further, because a latency cost is associated with each message, aggregating short messages can also improve performance.

Reducing Serialization

Serialization can take many different forms. In multithreaded programming, contention for a lock might induce serialization. In multiprocess programming, serialization might be induced, for example, in I/O operations through a particular process that gathers or scatters data accordingly.

Serialization can also appear as operations that are replicated among all the processes.

Load Balancing

Generally, the impediment to great scalability is not as blatant as serialization, but simply a matter of poor work distribution or load balancing among the processes. A multiprocess job completes only when the process with the most work has finished.

More so than for multithreaded programming, load balancing is an issue in message-passing programming because work distribution or redistribution is expensive in terms of programming and communication costs.

Temporally or spatially localized load imbalances sometimes balance against one another. Imagine, for example, a weather simulation in which simulation of daytime weather typically is more computationally demanding than that of nighttime weather because of expensive radiation calculations. If different processes compute on different geographical domains, then over the course of a simulation day each process should see daytime and nighttime. Such circadian imbalances would average out.

As the degree of synchronization in the simulation is increased, however, the extent to which localized load imbalances degrade overall performance magnifies. In our weather example, this means that if MPI processes are synchronized many times over the course of a simulation day, then all processes will run at the slower, day-time rate, even if this forces night-time processes to sit idle at synchronization points.

Synchronization

The cost of interprocess synchronization is often overlooked. Indeed, the cost of interprocess communication is often due not so much to data movement as to synchronization. Further, if processes are highly synchronized, they tend to congest shared resources such as a network interface or SMP backplane, at certain times and leave those resources idle at other times. Sources of synchronization can include:

For example, the Sun MPI cyclic and rendezvous message-passing protocols induce extra synchronization between senders and receivers in order to reduce use of buffers. Use of such protocols and the size of internal buffering might be changed at runtime with Sun MPI environment variables, which are discussed in Chapter 8.
For example, a receiver must wait if it issues an MPI_Recv before its partner issues the corresponding MPI_Send.

Typically, synchronization should be minimized for best performance. You should:

If a send can be posted very early and the corresponding receive much later, then there would be no problem with data dependency, because the data would be available before it was needed. If internal system buffering is not provided to hold the in-transit message, however, the completion of the send will in some way become synchronized with the receive. This consideration brings up the topics of buffering and nonblocking operations.

Buffering

In most MPI point-to-point communication, for example, using MPI_Send and MPI_Recv calls, data starts in a user buffer on the sending process and ends up in a user buffer on the receiving process. In transit, that data might also be buffered internally multiple times by an MPI implementation.

There are performance consequences to such buffering. Among them:

The MPI standard does not require any particular amount of internal buffering for standard MPI_Send operations. Specifically, the standard warns against issuing too many MPI_Send calls without receiving any messages, as this can lead to deadlock. (See Example 3.9 in the MPI 1.1 standard.) MPI does, however, allow users to provide buffering with MPI_Buffer_attach and then to use such buffering with MPI_Bsend or MPI_Ibsend calls.

Sun MPI, as a particular implementation of the standard, allows users to increase internal buffering in two ways. One way, of course, is with the standard, portable MPI_Buffer_attach call. Another is with Sun MPI-specific runtime environment variables, as discussed in Chapter 8.

There are several drawbacks to using MPI_Buffer_attach. They stem from the fact that a buffered send copies data out of the user buffer into a hidden buffer and then issues a non-blocking send (like MPI_Isend) without giving the user access to the request handle. Non-blocking sends (like MPI_Isend) should be used in preference to buffered sends (like MPI_Bsend) because of these effects of buffered sends:

Typically, performance will benefit more if internal buffering is increased by setting Sun MPI environment variables. This is discussed further in Chapter 8.

Sun MPI environment variables might not be a suitable solution in every case. For example, you might want finer control of buffering or a solution that is portable to other systems. (Beware that the MPI standard provides few, if any, performance portability guarantees.) In such cases, it might be preferable to using nonblocking MPI_Isend sends in place of buffered MPI_Bsend calls. The nonblocking calls give finer control over the buffers and better decouple the senders and receivers.

For best results:

Other examples of internal MPI buffering include MPI_Sendrecv_replace calls and unexpected in-coming messages (that is, messages for which no receive operation has yet been posted).

Nonblocking Operations

The MPI standard offers blocking and nonblocking operations. For example, MPI_Send is a blocking send. This means that the call will not return until it is safe to reuse the specified send buffer. On the other hand, the call might well return before the message is received by the destination process.

Nonblocking operations enable you to make message passing concurrent with computation. Basically, a nonblocking operation might be initiated with one MPI call (such as MPI_Isend, MPI_Start, MPI_Startall, and so on) and completed with another (such as MPI_Wait, MPI_Waitall, and so on). Still other calls might be used to probe status, for example, MPI_Test.

Nonblocking operations might entail a few extra overheads. Indeed, use of a standard MPI_Send and MPI_Recv provides the best performance with Sun MPI for highly synchronized processes, such as in simple ping-pong tests. Generally, however, the benefits of nonblocking operations far outweigh their performance shortcomings.

The way these benefits derive, however, can be subtle. Though nonblocking communications are logically concurrent with user computation, they do not necessarily proceed in parallel. That is, typically, either computation or else communication is being affected at any instant by a CPU. How performance benefits derive from nonblocking communications is discussed further in the case study at the end of this chapter

To maximize the benefits of nonblocking operations:

Polling

Polling is the activity in which a process searches incoming connections for arriving messages whenever the user code enters an MPI call. Two extremes are:

General polling helps deplete system buffers, easing congestion and allowing senders to make the most progress. On the other hand, it requires receiver buffering of unexpected messages and imposes extra overhead for searching connections that might never have any data.

Directed polling focuses MPI on user-specified tasks and keeps MPI from rebuffering or otherwise unnecessarily handling messages the user code has not yet asked to receive. On the other hand, it does not aggressively deplete buffers, so improperly written codes might deadlock.

Thus, user code is most efficient when the following criteria are all met:

Sun MPI Collectives

Collective operations, such as MPI_Barrier(), MPI_Bcast(), MPI_Reduce(), MPI_Alltoall(), and the like, are highly optimized in Sun MPI for UltraSPARC servers and clusters of servers. User codes can benefit from the use of collective operations, both to simplify programming and to benefit automatically from the optimizations, which include:

For Sun MPI programming, you need only keep in mind that the collective operations are optimized and that you should use them. The details of the optimizations used in Sun MPI to implement collective operations are available in Appendix A.

Contiguous Data Types

While interprocess data movement is considered expensive, data movement within a process can also be costly. For example, interprocess data movement via shared memory consists of two bulk transfers. Meanwhile, if data has to be packed at one end and unpacked at the other, then these steps entail just as much data motion, but the movement will be even more expensive because it is slow and fragmented.

You should consider:

Special Considerations for Message Passing Over TCP

Sun MPI supports message passing over any network that runs TCP. While TCP offers reliable data flow, it does so by retransmitting data as necessary. If the underlying network becomes loss-prone under load, TCP might retransmit a runaway volume of data, causing MPI performance to suffer.

For this reason, applications running over TCP might benefit from throttled communications. The following suggestions are likely to increase synchronization and degrade performance. Nonetheless, they might be needed when running over TCP if the underlying network is losing too much data.

To throttle data transfers, you might:


MPI Communications Case Study

The following examples illustrate many of the issues raised in the preceding conceptual discussion. These examples use a highly artificial test code to look at the performance of MPI communication and its interplay with computational load imbalance.

The main lessons to draw from this series of example are:

Algorithms Used

In these examples, each MPI process computes on some data and then circulates that data among the other processes in a ring pattern. That is, 0 sends to 1, 1 sends to 2, 2 sends to 3, and so on, with process np-1 sending to 0. An artificial load imbalance is induced in the computation.

The basic algorithm of this series of examples is illustrated in FIGURE 4-1 for four processes.

 FIGURE 4-1 Basic Ring Sending Algorithm

Graphic image illustrating the basic ring sending algorithm

In this figure, time advances to the right, and the processes are labeled vertically from 0 to 3. Processes compute, then pass data in a ring upward. There are temporal and spatial load imbalances, but in the end all processes have the same amount of work on average.

Even though the load imbalance in the basic algorithm averages out over time, performance degradation results if the communication operations are synchronized, as illustrated in FIGURE 4-2.

 FIGURE 4-2 Basic Ring Sending Algorithm With Synchronization

Graphic image illustrating the basic ring sending algorithm with synchronization

 

Several variations on this basic algorithm are used in the following timing experiments, each of which is accompanied by a brief description.

Algorithm 1

This algorithm causes all processes to send data and then all to receive data. Because no process is receiving when they are all sending, the MPI_Send call must buffer the data to prevent code deadlock. This buffering requirement explicitly violates the MPI 1.1 standard. See Example 3.9, along with associated discussion, in the MPI 1.1 standard.

Nevertheless, Sun MPI can progress messages and avoid deadlock if the messages are sufficiently small or if the Sun MPI environment variable MPI_POLLALL is set to 1, which is the default. (See Appendix A for information on progressing messages.)

CODE EXAMPLE 4-1 Algorithm 1 Implemented in Fortran 90
subroutine compute(lda,n,x,ncompute,me,iup,idown,sum)
include 'mpif.h'
real(8) x(lda,*), sum
 
! phase 1
call compute_kernel(ncompute,n,x(:,1),sum)
 
! phase 2
call MPI_Send(x(:,1),n,MPI_REAL8,iup  ,1,MPI_COMM_WORLD,ier)
 
! phase 3
call MPI_Recv(x(:,1),n,MPI_REAL8,idown,1,MPI_COMM_WORLD,MPI_STATUS_IGNORE,ier)
 
end

The amount of computation to be performed in any iteration on any MPI process is dictated by the variable ncompute and is passed in by the parent subroutine. The array x is made multidimensional because subsequent algorithms will use multibuffering.

Algorithm 2

  • Phase 1: compute on buffer
  • Phase 2: perform communication with MPI_Sendrecv_replace

This algorithm should not deadlock on any compliant MPI implementation, but it entails unneeded overheads for extra buffering and data copying to "replace" the user data.

CODE EXAMPLE 4-2 Algorithm 2 Implemented in Fortran 90
subroutine compute(lda,n,x,ncompute,me,iup,idown,sum)
include 'mpif.h'
real(8) x(lda,*), sum
 
! phase 1
call compute_kernel(ncompute,n,x(:,1),sum)
 
! phase 2
call MPI_Sendrecv_replace(x(:,1),n,MPI_REAL8,iup  ,1, &
                                             idown,1, &
            MPI_COMM_WORLD,MPI_STATUS_IGNORE,ier)
 
end

Algorithm 3

  • Phase 1: compute on buffer
  • Phase 2: perform communication with MPI_Sendrecv

This algorithm removes the "replace" overheads by introducing double buffering.

CODE EXAMPLE 4-3 Algorithm 3 Implemented in Fortran 90
subroutine compute(lda,n,x,ncompute,me,iup,idown,sum)
include 'mpif.h'
 
real(8) :: x(lda,*), sum
integer ibufsend, ibufrecv
save    ibufsend, ibufrecv
data    ibufsend, ibufrecv / 1, 2 /
 
! phase 1
call compute_kernel(ncompute,n,x(:,ibufsend),sum)
 
! phase 2
call MPI_Sendrecv(x(:,ibufsend),n,MPI_REAL8,iup  ,1, &
                  x(:,ibufrecv),n,MPI_REAL8,idown,1, &
                  MPI_COMM_WORLD,MPI_STATUS_IGNORE,ier)
 
! toggle buffers
ibufsend = 3 - ibufsend
ibufrecv = 3 - ibufrecv
 
end

Algorithm 4

  • Phase 1: nonblocking communication with MPI_Isend and MPI_Irecv
  • Phase 2: compute on buffer
  • Phase 3: MPI_Waitall to complete send and receive operations

This algorithm attempts to overlap communication with computation. That is, nonblocking communication is initiated before the computation is started, then the computation is performed, and finally the communication is completed. It employs three buffers: one for data being sent, another for data being received, and another for data used in computation.

Sun MPI does not actually overlap communication and computation, as the ensuing discussion makes clear. The real benefit of this approach is in decoupling processes for the case of computational load imbalance.

CODE EXAMPLE 4-4 Algorithm 4 Implemented in Fortran 90
subroutine compute(lda,n,x,ncompute,me,iup,idown,sum)
 
include 'mpif.h'
 
real(8) :: x(lda,*), sum
integer requests(2)
 
integer ibufsend, ibufrecv, ibufcomp
save    ibufsend, ibufrecv, ibufcomp
data    ibufsend, ibufrecv, ibufcomp / 1, 2, 3 /
 
! phase 1
call MPI_Isend &
  (x(:,ibufsend),n,MPI_REAL8,iup ,1,MPI_COMM_WORLD,requests(1),ier)
call MPI_Irecv &
  (x(:,ibufrecv),n,MPI_REAL8,idown,1,MPI_COMM_WORLD,requests(2),ier)
 
! phase 2
call compute_kernel(ncompute,n,x(:,ibufcomp),sum)
 
! phase 3
call MPI_Waitall(2,requests,MPI_STATUSES_IGNORE,ier)
 
! toggle buffers
ibuffree = ibufsend  ! send buffer is now free
ibufsend = ibufcomp  ! next, send what you just computed on
ibufcomp = ibufrecv  ! next, compute on what you just received
ibufrecv = ibuffree  ! use the free buffer to receive next
 
end

Algorithm 5

  • Phase 1: nonblocking communication with MPI_Isend and MPI_Irecv
  • Phase 2: compute on buffer, with frequent calls to MPI_Testall
  • Phase 3: MPI_Waitall to complete send and receive operations

This algorithm is like Algorithm 4 except that it includes calls to MPI_Testall during computation. (The purpose of this is explained in Use of MPI_Testall.)

CODE EXAMPLE 4-5 Algorithm 5 Implemented in Fortran 90
subroutine compute(lda,n,x,ncompute,me,iup,idown,sum)
include 'mpif.h'
 
real(8) :: x(lda,*), sum
integer requests(2)
logical flag
integer ibufsend, ibufrecv, ibufcomp
save    ibufsend, ibufrecv, ibufcomp
data    ibufsend, ibufrecv, ibufcomp / 1, 2, 3 /
integer nblock0
save    nblock0
data    nblock0 / -1 /
character*20 nblock0_input
integer(4) iargc
 
! determine nblock0 first time through
if ( nblock0 .eq. -1 ) then
  nblock0 = 1024                   ! try 1024
  if ( iargc() .ge. 3 ) then       ! 3rd command-line argument overrides
    call getarg(3,nblock0_input)
    read(nblock0_input,*) nblock0
  endif
endif
 
! phase 1
call MPI_Isend &
  (x(:,ibufsend),n,MPI_REAL8,iup ,1,MPI_COMM_WORLD,requests(1),ier)
call MPI_Irecv &
  (x(:,ibufrecv),n,MPI_REAL8,idown,1,MPI_COMM_WORLD,requests(2),ier)
 
! phase 2
do i = 1, n, nblock0
  nblock = min(nblock0,n-i+1)
  call compute_kernel(ncompute,nblock,x(i,ibufcomp),sum)
  call MPI_Testall(2,requests,flag,MPI_STATUSES_IGNORE,ier)
end do
 
! phase 3
call MPI_Waitall(2,requests,MPI_STATUSES_IGNORE,ier)
 
! toggle buffers
ibuffree = ibufsend  ! send buffer is now free
ibufsend = ibufcomp  ! next, send what you just computed on
ibufcomp = ibufrecv  ! next, compute on what you just received
ibufrecv = ibuffree  ! use the free buffer to receive next
 
end

Making a Complete Program

To make a functioning example, one of the preceding subroutines should be combined with other source code and compiled using

% mpf90 -fast source-files -lmpi

CODE EXAMPLE 4-6 shows a sample Fortran 90 program that serves as the driver.

CODE EXAMPLE 4-6 Driver Program for Example Algorithms
program driver
 
include 'mpif.h'
character*20 arg
integer(4), parameter :: maxn = 500000
integer(4), parameter :: maxnbuffers = 3
integer(4) iargc
real(8) x(maxn,maxnbuffers), t
 
! initialize the buffers
 
x = 0.d0
 
! get the number of compute iterations from the command line
 
ncompute_A = 0
if ( iargc() .ge. 1 ) then
  call getarg(1,arg)
  read(arg,*) ncompute_A
endif
 
ncompute_B = ncompute_A
if ( iargc() .ge. 2 ) then
  call getarg(2,arg)
  read(arg,*) ncompute_B
endif
 
! initialize usual MPI stuff
 
call MPI_Init(ier)
call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)
call MPI_Comm_size(MPI_COMM_WORLD, np, ier)
if ( mod(np,2) .ne. 0 ) then
  print *, "expect even number of processes"
  call MPI_Finalize(ier)
  stop
endif 

! pump a lot of data through to warm up buffers
call warm_up_buffers(maxn,x)
 
! iterations
 
if ( me .eq. 0 ) write(6,'("      bytes/msg            sec/iter           Mbyte/sec")')
niter = 10
n = 0
do while ( n .le. maxn )
 
  ! make measurement and report
  call sub(maxn,n,x,niter,ncompute_A,ncompute_B,t)
  t = t / niter
  if ( me .eq. 0 ) write(6,'(i15,2f20.6)') 8 * n, t, 16.d-6 * n / t
 
  ! bump up n
  n = max( nint(1.2 * n), n + 1 )
 
enddo
 
! shut down
 
call MPI_Finalize(ier)
 
end
 
subroutine sub(lda,n,x,niter,ncompute_A,ncompute_B,t)
 
include 'mpif.h'
real(8) :: x(lda,*), sum, t
 
! figure basic MPI parameters
call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)
call MPI_Comm_size(MPI_COMM_WORLD, np, ier)
 
! initialize sum
sum = 0.d0
 
! figure nearest neighbors
idown = me - 1
iup   = me + 1
if ( idown .lt.  0 ) idown = np - 1
if ( iup   .ge. np ) iup   = 0
 
! start timer
call MPI_Barrier(MPI_COMM_WORLD,ier)
t = MPI_Wtime()

! loop
do iter = 1, niter
 
  ! induce some load imbalance
  if ( iand(iter+me,1) .eq. 0 ) then
    ncompute = ncompute_A
  else
    ncompute = ncompute_B
  endif
 
  ! computation (includes communication)
  call compute(lda,n,x,ncompute,me,iup,idown,sum)
 
enddo
 
! stop timer
call MPI_Barrier(MPI_COMM_WORLD,ier)
t = MPI_Wtime() - t
 
! dummy check to keep compiler from optimizing all "computation" away
if ( abs(sum) .lt. -1.d0 ) print *, "failed dummy check"
 
end
 
 
subroutine compute_kernel(ncomplexity,n,x,sum)
real(8) x(n), sum, t
if ( ncomplexity .eq. 0 ) return
 
 
! sweep over all data
do i = 1, n
 
  ! some elemental operation of particular complexity
  t = 1.d0
  do iloop = 1, ncomplexity
    t = t * x(i)
  enddo
  x(i) = t
  sum = sum + t
 
enddo
 
end

subroutine warm_up_buffers(n,x)
include 'mpif.h'
real(8) x(n,*), t
 
! pump a lot of data through to warm up buffers
! (ideally, use the same traffic pattern as in rest of code)
! (all this wouldn't be necessary if MPI_WARMUP were supported)
 
! usual MPI stuff
call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)
call MPI_Comm_size(MPI_COMM_WORLD, np, ier)
 
! figure nearest neighbors
idown = me - 1
iup   = me + 1
if ( idown .lt.  0 ) idown = np - 1
if ( iup   .ge. np ) iup   = 0
 
! figure number of iterations
nMB   = 500                ! MB that should be enough to run through all buffers
niter = nMB * 1024 * 1024  ! convert to byte
niter = niter / ( 8 * n )  ! convert to number of iterations
 
! iterate
if ( me .eq. 0 ) write(6,'("               Mbyte           Mbyte/sec")')
if ( me .eq. 0 ) write(6,'("            (to date)        (this round)")')
do i = 1, niter
  call MPI_Barrier(MPI_COMM_WORLD,ier)
  t = MPI_Wtime()
  call MPI_Sendrecv(x(:,1),n,MPI_REAL8,iup  ,1, &
                    x(:,2),n,MPI_REAL8,idown,1, &
                    MPI_COMM_WORLD,MPI_STATUS_IGNORE,ier)
  call MPI_Barrier(MPI_COMM_WORLD,ier)
  call MPI_Barrier(MPI_COMM_WORLD,ier)
  t = MPI_Wtime() - t
  if ( me .eq. 0 ) write(6,'(2f20.6)') 8.d-6 * i * n, 8.d-6 * n / t
end do
 
end

Note the following features of the driver program in CODE EXAMPLE 4-6:

  • Load Imbalance. The driver introduces an artificial computational load imbalance. On average, the computational load is balanced, that is, each process performs the same total amount of work as every other process. On any one iteration, however, the processes will have different amounts of work. In particular, on one iteration, every other process will do less work and the remaining processes will do more work. The processes switch roles on every iteration.
  • Multiple Buffering. Some of the algorithms use multiple buffering. To keep the subroutine interfaces all the same, all the code examples support multiple buffers, even for the algorithms that do not use the additional buffers.
  • Bandwidth Reporting. The code reports a Mbyte/sec bandwidth, but this figure also includes time for computation and is not, strictly speaking, just a measurement of communication performance.
  • Buffer Warmup. The subroutine warm_up_buffers passes a series of messages to make sure that MPI internal buffers are touched and ready for fast reuse. Otherwise, spurious performance effects can result when particular buffers are used for the first time.

Timing Experiments With the Algorithms

You can construct a functioning test code by choosing one of the preceding algorithms and then compiling and linking it together with the driver code.

This section shows sample results for the various algorithms running as 4 MPI processes on a Sun Fire 6800 server with 900-MHz CPUs and a 8-Mbyte L2 cache. The command line for program execution is:

% mprun -np 4 a.out

Baseline Results

FIGURE 4-3 shows bandwidth as a function of message size for Algorithms 1 and 2.

 FIGURE 4-3 Bandwidth as a function of message size for Algorithms 1 and 2.

Graph showing bandwidth as a function of message size for Algorithms 1 and 2.

Algorithm 2 proves to be slower than Algorithm 1. This is because MPI_Sendrecv_replace entails extra copying to "replace" data into a single buffer. Further, Sun MPI has special optimizations that cut particular overheads for standard MPI_Send and MPI_Recv calls, which is especially noticeable at small message sizes.

For Algorithm 1, the program reports about 700 Mbytes/s bandwidth for reasonably long messages. Above roughly 24 Kbyte, however, messages complete only due to general polling and the performance impact of processing unexpected messages can be seen in the figure.

Directed Polling

Now, let us rerun this experiment with directed polling. This is affected by turning general polling off:

% setenv MPI_POLLALL 0
% mprun -np 4 a.out
% unsetenv MPI_POLLALL

It is generally good practice to unset environment variables after each experiment so that settings do not persist inadvertently into subsequent experiments.

FIGURE 4-4 shows the resulting bandwidth as a function of message size.

 FIGURE 4-4 Bandwidth as a function of message size with directed polling.

Graph showing bandwidth as a function of message size with directed polling.

Although it is difficult to make a direct comparison with the previous figure, it is clear that direct polling has improved the bandwidth values across the range of message sizes. This is because directed polling leads to more efficient message passing. Time is not used up in searching all connections needlessly.

The highest bandwidth is delivered by Algorithm 1, but it deadlocks when the message size reaches 24 Kbytes. At this point, the standard send MPI_Send no longer is simply depositing its message into internal buffers and returning. Instead, the receiver is expected to start reading the message out of the buffers before the sender can continue. With general polling (see Baseline Results), processes drained the buffers even before receives were posted.

Algorithm 2 also benefits in performance from directed polling, and it provides an MPI-compliant way of passing the messages. That is, it proceeds deadlock-free even as the messages are made very large. Nevertheless, due to extra internal buffering and copying to effect the "replace" behavior of the MPI_Sendrecv_replace operation, this algorithm has the worst performance of the four.

Algorithm 3 employs MPI_Sendrecv and double buffering to eliminate the extra internal buffering and copying. Algorithm 4 employs nonblocking operations. These are fast and avoid deadlock..

Now, let us examine how Algorithm 4, employing nonblocking operations, fares once processes have drifted out of tight synchronization because of computational load imbalances.

Here, we run:

% setenv MPI_POLLALL 0
% mprun -np 4 a.out 1 200
% unsetenv MPI_POLLALL

The driver routine shown earlier (CODE EXAMPLE 4-6) picks up the command-line arguments (1 200) to induce an artificial load imbalance among the MPI processes.

CODE EXAMPLE 4-5 shows bandwidth as a function of message size when nonblocking operations are used.

 FIGURE 4-5 Bandwidth as a function of message size with nonblocking operations.

Graph showing bandwidth as a function of message size with nonblocking operations.

While Algorithm 3 (MPI_Sendrecv) is comparable to Algorithm 4 (MPI_Isend, MPI_Irecv, MPI_Waitall) for synchronized processes, the nonblocking operations in Algorithm 4 offer the potential to decouple the processes and improve performance when there is a computational load imbalance.

The reported bandwidths are substantially decreased because they now include non-negligible computation times.

The internal buffering of Sun MPI becomes congested at longest message sizes, however, making the two algorithms perform equally. This behavior sets in at 24 Kbytes.

Increasing Sun MPI Internal Buffering

Twice now, we have seen that the internal buffering becomes congested at 24 Kbytes. This leads to deadlock in the case of Algorithm 1 and directed polling. It also leads to synchronization of processes with Algorithm 4, even though that algorithm employs nonblocking operations.

Note that Sun MPI has access to multiple protocol modules (PMs) to send messages over different hardware substrates. For example, two processes on the same SMP node exchange messages via the SHM (shared-memory) PM. If the two processes were on different nodes of a cluster interconnected by some commodity network, they would exchange messages via the TCP (standard Transmission Control Protocol) PM. The 24-Kbyte limit we are seeing is specific to the SHM PM.

However, the 24-Kbyte SHM PM buffers can become congested. Buffer sizes might be controlled with MPI_SHM_CPOOLSIZE, whose default value is 24576, or MPI_SHM_SBPOOLSIZE. More information about cyclic message passing and SHM PM buffers might be found in Appendix A. More information about the associated environment variables might be found in Appendix B.

Now, we rerun with:

% setenv MPI_POLLALL 0
% setenv MPI_SHM_SBPOOLSIZE 20000000
% setenv MPI_SHM_NUMPOSTBOX 2048
% mprun -np 4 a.out
% mprun -np 4 a.out 1 200
% unsetenv MPI_POLLALL
% unsetenv MPI_SHM_SBPOOLSIZE
% unsetenv MPI_SHM_NUMPOSTBOX

Here, we have set the environment variables not only to employ direct polling, but also to increase internal buffering.

FIGURE 4-6 shows bandwidth as a function of message size for the case of highly synchronized processes (the command line specifying a.out without additional arguments).

 FIGURE 4-6 Bandwidth as a function of message size with highly synchronized processes.

Graph showing bandwidth as a function of message size with highly synchronized processes.

Having increased the buffering, we see that our illegal Algorithm 1 no longer deadlocks and is once again the performance leader. Strictly speaking, of course, it is still not MPI-compliant and its use remains nonrobust. Algorithm 2, using MPI_Sendrecv_replace, remains the slowest due to extra buffering and copying.

FIGURE 4-7 shows bandwidth as a function of message size for the case of load imbalance (the command line specifying a.out 1 200). Once a computational load imbalance is introduced, Algorithm 4, employing nonblocking operations, becomes the clear leader. All other algorithms are characterized by imbalanced processes advancing in lockstep.

 FIGURE 4-7 Bandwidth as a function of message size with load imbalance.

Graph showing bandwidth as a function of message size with load imbalance.

Use of MPI_Testall

In some cases, for whatever reason, it is not possible to increase Sun MPI internal buffering sufficiently to hold all in-transit messages. For such cases, we can use Algorithm 5, which employs MPI_Testall calls to progress these messages. (For more information on progressing messages, see Progress Engine in Appendix A.)

Here, we run with:

% setenv MPI_POLLALL 0
% mprun -np 4 a.out 1 200     128
% mprun -np 4 a.out 1 200     256
% mprun -np 4 a.out 1 200     384
% mprun -np 4 a.out 1 200     512
% mprun -np 4 a.out 1 200    1024
% mprun -np 4 a.out 1 200    2048
% mprun -np 4 a.out 1 200    4096
% mprun -np 4 a.out 1 200    8192
% mprun -np 4 a.out 1 200   10240
% mprun -np 4 a.out 1 200   12288
% mprun -np 4 a.out 1 200   16384
% unsetenv MPI_POLLALL

The third command-line argument to a.out specifies, in some way, the amount of computation to be performed between MPI_Testall calls.

This is a slightly unorthodox use of MPI_Testall. The standard use of MPI_Test and its variants is to test whether specified messages have completed. The use of MPI_Testall here, however, is to progress all in-transit messages, whether specified in the call or not.

FIGURE 4-8 plots bandwidth against message size for the various frequencies of MPI_Testall calls.

 FIGURE 4-8 Bandwidth as a function of message size with MPI_Testall calls.

Graph showing bandwidth as a function of message size with MPI_Testall calls.

There are too many curves to distinguish individually, but the point is clear. While performance used to dip at 24 Kbytes, introducing MPI_Testall calls in concert with nonblocking message-passing calls has maintained good throughput, even as messages grow to be orders of magnitude beyond the size of the internal buffering. Below the 24-Kbyte mark, of course, the MPI_Testall calls are not needed and do not impact performance materially.

Another view of the data is offered in FIGURE 4-9. This figure plots bandwidth as a function of the amount of computation performed between MPI_Testall calls.

 FIGURE 4-9 Bandwidth as a function of computation between MPI_Testall calls.

Graph showing bandwidth as a function of computation between MPI_Testall calls.

For clarity, FIGURE 4-9 shows only two message sizes: 64 Kbyte and 1 Mbyte. We see that if too little computation is performed, then slight inefficiencies are introduced. More drastic is what happens when too much computation is attempted between MPI_Testall calls. Then, messages are not progressed sufficiently and long wait times lead to degraded performance.

To generalize, if MPI_Testall is called too often, it becomes ineffective at progressing messages. So, the optimal amount of computation between MPI_Testall calls should be large compared with the cost of an ineffective MPI_Testall call, which is on order of roughly 1 microsecond.

When MPI_Testall is called too seldom, interprocess synchronization can induce a severe degradation in performance. As a rule of thumb, the time it takes to fill or deplete MPI buffers sets the upper bound for how much computation to perform between MPI_Testall calls. These buffers are typically on order of tens of Kbytes, memory bandwidths are on order of hundreds of Mbyte/sec. Thus, the upper bound is some fraction of a millisecond.

These are rough rules of thumb, but they indicate that there is a wide range of nearly optimal frequencies for MPI_Testall calls.

Nevertheless, such techniques can be difficult to employ in practice. Challenges include restructuring communication and computation to post nonblocking sends and receives as early as possible while completing them as late as possible and injecting progress-inducing calls effectively.