C H A P T E R  3

Choosing Your Programming Model and Hardware

This chapter outlines some points to consider in planning how to go about developing or porting an HPC application. It provides a high-level overview of how to compare and assess programming models for use on Sun parallel hardware.


Starting Out

The first step in developing a high-performance application is to settle upon your basic approach. To make the best choice among the Sun HPC tools and techniques, you need to:

There are two common models of parallel programming in high performance computing: shared-memory programming and distributed-memory programming. These models are supported on Sun hardware with Sun compilers and with Sun HPC ClusterTools software, respectively. Issues in choosing between the models might include existing source-code base, available software development resources, desired scalability, and target hardware.

The basic Sun HPC ClusterTools programming model is distributed-memory message passing. Such a program executes as a collection of Solaris processes with separate address spaces. The processes compute independently, each on its own local data, and share data only through explicit calls to message-passing routines.

You might choose to use this model regardless of your target hardware. That is, you might run a message-passing program on an SMP cluster or run it entirely on a single, large SMP server. Or, you might choose to forego ClusterTools software and utilize only multithreaded parallelism, running it on a single SMP server. It is also possible to combine the two approaches.


Programming Models

A high-performance application will almost certainly be parallel, but parallelism comes in many forms. The form you choose depends partly on your target hardware (server versus cluster) and partly on the time you have to invest.

Sun provides development tools for several widely used HPC programming models. These products are categorized by memory model: Sun Forte Developer tools for shared-memory programming and Sun HPC ClusterTools for distributed-memory programming.

Shared memory means that all parts of a program can access one another's data freely. This might be because they share a common address space, which is the case with multiple threads of control within a single process. Or, it might result from employing a software mechanism for sharing memory.

Parallelism that is generated by the Sun compilers or is programmed using Solaris or POSIX threads requires a shared address space running on a single Solaris image.

Distributed memory means that multiple processes exchange data only through explicit message passing.

Message-passing programs, where the programmer inserts calls to the MPI library, are the only programs that can run across a cluster of computers. They can also, of course, run on a single computer or even on a serial processor.

TABLE 3-1 summarizes these two product suites.

TABLE 3-1 Comparison of Sun Compiler Suite and Sun HPC ClusterTools Software

Sun Forte Developer Suite

or

Sun ONE Studio, Compiler Collection

Sun HPC ClusterTools Suite

 

 

 

Target hardware

Any Sun workstation or SMP

Any Sun workstation, SMP, or cluster

Memory model

Shared memory

Distributed memory

Runtime resource manager

Solaris Operating Environment

CRE (Cluster Runtime Environment) or third-party product

Parallel execution

Multithreaded

Multiprocess with message passing


Thus, available hardware does not necessarily dictate programming model. A message-passing program can run on any configuration, and a multithreaded program can run on a parallel server (SMP). The only constraint is that a program without message passing cannot run on a cluster.

The choice of programming model, therefore, usually depends more on software preferences and available development time. Only when your performance goals demand the combined resources of a cluster of servers is the message-passing model necessarily required.

A closer look at the differences between shared-memory model and the distributed memory model as they pertain to parallelism reveals some other factors in the choice. The differences are summarized in TABLE 3-2.

TABLE 3-2 Comparison of Shared-Memory and Distributed-Memory Parallelism

Shared Memory

Distributed Memory

Parallelization unit

Loop

Data structure

Compiler-generated parallelism

Available in Fortran 77, Fortran 90, and C via compiler options, directives/pragmas, and OpenMP

No established solution;
options include HPF, split-C,
UPC, OpenMP compiled for
distributed memory,
Co-Array Fortran, and various research projects. None of these are part of the Sun HPC ClusterTools software suite.

Explicit (hand-coded) parallelism

C/C++ and threads (Solaris or POSIX)

Calls to MPI library routines from Fortran 77, Fortran 90, C, or C++




Note - Nonuniform memory architecture (NUMA) is starting to blur the lines between shared- and distributed-memory architectures. That is, the architecture functions as shared memory, but typically the difference in cost between local and remote memory accesses is so great that it might be desirable to manage data locality explicitly. One way to do this is to use message passing.



Even without a detailed look, it is obvious that more parallelism is available with less investment of effort in the shared-memory model.

To illustrate the difference, consider a simple program that adds the values of an array (a global sum). In serial Fortran, the code is:

 REAL A(N), X
 X = 0.
 DO I = 1, N
    X = X + A(I)
 END DO

Compiler-generated parallelism requires little change. In fact, the compiler might well parallelize this simple example automatically. At most, the programmer might need to add a single directive:

        REAL A(N), X
        X = 0.
 
  C$OMP DO REDUCTION(+:X)
 
        DO I = 1, N
           X = X + A(I)
        END DO

To perform this operation with an MPI program, the programmer needs to parallelize the data structure as well as the computational loop. The program would look like this:

       REAL A(NLOCAL), X, XTMP
 
       XTMP = 0.
       DO I = 1, NLOCAL
          XTMP = XTMP + A(I)
       END DO
       CALL MPI_ALLREDUCE
      & (XTMP,X1,MPI_REAL,MPI_SUM,MPI_COMM_WORLD,IERR)

When this program executes, each process can access only its own (local) share of the data array. Explicit message passing is used to combine the results of the multiple concurrent processes.

Clearly, message passing requires more programming effort than shared-memory parallel programming. But this is only one of several factors to consider in choosing a programming model. The trade-off for the increased effort can be a significant increase in performance and scalability.

In choosing your programming model, consider the following factors:

  • If you are updating an existing code, what programming model does it use? In some cases, it is reasonable to migrate from one model to another, but this is rarely easy. For example, to go from shared memory to distributed memory, you must parallelize the data structures and redistribute them throughout the entire source code.
  • What time investment are you willing to make? Compiler-based multithreading (using Forte Developer tools) might allow you to port or develop a program in less time than explicit message passing would require.
  • What is your performance requirement? Is it within or beyond the computing capability associated with a single, uniform memory? Because Sun SMP servers can be very large--up to 106 processors and 576 Gbytes of memory in the current generation. For other purposes, a cluster--and thus distributed-memory programming--will be required.
  • Is your performance requirement (including problem size) likely to increase in the future? If so, it might be worth choosing the message-passing model even if a single server meets your current needs. You can then migrate easily to a cluster in the future. In the meantime, the application might run faster than a shared-memory program on a single SMP because of the MPI discipline of enforcing data locality.

Mixing models is generally possible, but not common.


Scalability

A part of setting your performance goals is to consider how your application will scale.

The primary purpose of message-passing programming is to introduce explicit data decomposition and communication into an application, so that it will scale to higher levels of performance with increased resources. The appeal of a cluster is that it increases the range of scalability: a potentially limitless amount of processing power might be applied to complex problems and huge data sets.

The degree of scalability you can realistically expect is a function of the algorithm, the target hardware, and certain laws of scaling itself.

Amdahl's Law

Unfortunately, decomposing a problem among more and more processors ultimately reaches a point of diminishing returns. This idea is expressed in a formula known as Amdahl's Law.[1] Amdahl's Law assumes (quite reasonably) that a task has only some fraction f that is parallelizable, while the rest of the task is inherently serial. As the number of processors NP is increased, the execution time T for the task decreases as

T = (1-f) + f / NP 

For example, consider the case in which 90 percent of the workload can be parallelized. That is, f = 0.90. The speedup as a function of the number of processors is shown in TABLE 3-3

TABLE 3-3 Speedup with Number of Processors

Processors

(NP)

runtime

(T)

Speedup

(1/T)

Efficiency

1

1.000

1.0

100%

2

0.550

1.8

91%

3

0.400

2.5

83%

 

4

0.325

3.1

77%

6

0.250

4.0

67%

8

0.213

4.7

59%

 

16

0.156

6.4

40%

32

0.128

7.8

24%

64

0.114

8.8

14%


As the parallelizable part of the task is more and more subdivided, the non-parallel 10 percent of the program (in this example) begins to dominate. The maximum speedup achievable is only 10-fold, and the program can actually use only about three or four processors efficiently.

Keep Amdahl's Law in mind when you target a performance level or run prototypes on smaller sets of CPUs than your production target. In the preceding example, if you had started measuring scalability on only two processors, the 1.8-fold speedup would have seemed admirable, but it is actually an indication that scalability beyond that might be quite limited.

In another respect, the scalability story is even worse than Amdahl's Law suggests. As the number of processors increases, so does the overhead of parallelization. Such overhead might include communication costs or interprocessor synchronization. So, observation will typically show that adding more processors will ultimately cause not just diminishing returns but negative returns: eventually, execution time might increase with added resources.

Still, the news is not all bad. With the high-speed interconnects within and between nodes and with the programming techniques described in this manual, your application might well achieve high, and perhaps near linear, speedups for some number of processors. And, in certain situations, you might even achieve superlinear scalability, because adding processors to a problem also provides a greater aggregate cache.

Scaling Laws of Algorithms

Amdahl's Law assumes that the work done by a program is either serial or parallelizable. In fact, an important factor for distributed-memory programming that Amdahl's Law neglects is communication costs. Communication costs increase as the problem size increases, although their overall impact depends on how this term scales vis-a-vis the computational workload.

When the local portion (the subgrid) of a decomposed data set is sufficiently large, local computation can dominate the runtime and amortize the cost of interprocess communication. TABLE 3-4 shows examples of how computation and communication scale for various algorithms. In the table, L is the linear extent of a subgrid while N is the linear extent of the global array.

TABLE 3-4 Scaling of Computation and Communication Times for Selected Algorithms

Algorithm

Communication Type

Communication Count

Computation Count

2-dimensional stencil

nearest neighbor

L

L2

3-dimensional stencil

nearest neighbor

L2

L3

matrix multiply

nearest neighbor

N2

N3

multidimensional FFT

all-to-all

N

N log(N)


With a sufficiently large subgrid, the relative cost of communication can be lowered for most algorithms.

The actual speed-up curve depends also on cluster interconnect speed. If a problem involves many interprocess data transfers over a relatively slow network interconnect, the increased communication costs of a high process count might exceed the performance benefits of parallelization. In such cases, performance might be better with fewer processes collocated on a single SMP. With a faster interconnect, on the other hand, you might see even superlinear scalability with increased process counts because of the larger cache sizes available.


Characterizing Platforms

To set reasonable performance goals, and perhaps to choose among available sets of computing resources, you need to be able to assess the performance characteristics of hardware platforms.

The most basic picture of message-passing performance is built on two parameters: latency and bandwidth. These parameters are commonly cited for point-to-point message passing, that is, simple sends and receives.

  • Latency is the time required to send a null-length message.
  • Bandwidth is the rate at which very long messages are sent.

In this somewhat simplified model, the time required for passing a message between two processes is

time = latency + message-size / bandwidth 

Obviously, short messages are latency-bound and long messages are bandwidth-bound. The crossover message size between the two is given as

crossover-size = latency x bandwidth 

Another performance parameter is bisection bandwidth, which is a measure of the aggregate bandwidth a system can deliver to communication-intensive applications that exhibit little data locality. Bisection bandwidth might not be related to point-to-point bandwidth, because the performance of the system can degrade under load (many active processes) or since multiple CPUs are required to take advantage of the interconnect.

To suggest orders of magnitude, TABLE 3-5 shows sample values of these parameters for two Sun HPC platforms: a large SMP and a 64-node cluster.

TABLE 3-5 Sample Performance Values for MPI Operations on Two Sun Platforms

Platform

Latency

(microseconds)

Bandwidth

(Mbyte/sec)

Crossover size

= lat x bw

(bytes)

Platform

Bisection

bandwidth

(Mbyte/sec)

SMP Enterprise 10000 server

~ 2

~ 200

~ 400

~ 2500

Cluster of

64 nodes connected with TCP network

~ 150

~ 40

~ 6000

~ 2000


The best performance is likely to come from a single server. With Sun servers, this means up to 64 CPUs in the current generation.

For clusters, these values indicate that the TCP cluster is latency-bound. A smaller cluster using a faster interconnect would be less so. On the other hand, many nodes are needed to match the bisection bandwidth of single node.

Basic Hardware Factors

Typically, you work with a fixed set of hardware factors: your system is what it is. From time to time, however, hardware choices might be available, and, in any case, you need to understand the ways in which your system affects program performance. This section describes a number of basic hardware factors.

Processor speed is directly related to the peak floating-point performance a processor can attain. Because an UltraSPARC processor can execute up to one floating-point add and one floating-point multiply per cycle, peak floating-point performance is twice the processor clock speed. For example, a 900-MHz processor would have a peak floating-point performance of 1800 Mflops. In practice, achieved floating-point performance will be less, due to imbalances of additions and multiplies and the necessity of retrieving data from memory rather than cache. Nevertheless, some number of floating-point intensive operations, such as the matrix multiplies that provide the foundation for much of dense linear algebra, can achieve a high fraction of peak performance, and typically increasing processor speed has a positive impact on most performance metrics.

Large L2 (or external) caches can also be important for good performance. While it is desirable to keep data accesses confined to L1 (or on-chip) caches, UltraSPARC processors run quite efficiently from L2 caches as well. When you go beyond L2 cache to memory, however, the drop in performance can be significant. Indeed, though Amdahl's Law and other considerations suggest that performance should scale at best linearly with processor counts, many applications see a range of superlinear scaling, because an increase in processor count implies an increase in aggregate L2 cache size.

The number of processors is, of course, a basic factor in performance because more processors deliver potentially more performance. Naturally, it is not always possible to utilize many processors efficiently, but it is vital that enough processors be present. This means not only that there should be one processor per MPI process, but ideally there should also be a few extra processors per node to handle system daemons and other services.

System speed is an important determinant of performance for memory-access-bound applications. For example, if a code goes often out of its caches, then it might well perform better on 300-MHz processors with a 100-MHz system clock than on 333-MHz processors with an 83-MHz system clock. Similarly, performance speedup from 900-MHz processors to 1200-MHz processors, both with a 150-MHz system clock, is likely to be less than the 4/3 factor suggested by the processor speedup since the memory is at the same speed in both cases.

Memory latency is influenced not only by memory clock speed, but also by system architecture. As a rule, as the maximum size of an architecture expands, memory latency goes up. Hence, applications or workloads that do not require much interprocess communication might well perform better on a cluster of 4-CPU workgroup servers than on a 64-CPU Enterprise 10000 server.

Memory bandwidth is the rate at which large amounts of data can be moved between CPU and memory. It is related not only to cacheline size (amount of data moved per memory operation) and memory latency (amount of time to move one cacheline), but also to the system's ability to prefetch data and have multiple outstanding memory operations at any time.

Memory size is required to support large applications efficiently. While the Solaris Operating Environment will run applications even when there is insufficient physical memory, such use of virtual memory will degrade performance dramatically.

When many processes run on a single node, the backplane bandwidth of the node becomes an issue. Large Sun servers scale very well with high processor counts, but MPI applications can nonetheless tax backplane capabilities either due to local memory operations (within an MPI process) or due to interprocess communications via shared memory. MPI processes located on the same node exchange data by copying into and then out of shared memory. Each copy entails two memory operations: a load and a store. Thus, a two-sided MPI data transfer undergoes four memory operations.

On a 24-CPU Sun Fire 6800 server, with a 9.6-Gbyte/s backplane, this means that a large MPI all-to-all operation can run at about 2.4 Gbyte/s aggregate bandwidth. Here, MPI bandwidth is the rate at which bytes are sent.

For cluster performance, the interconnect between nodes is typically characterized by its latency and bandwidth. Choices include any network that supports TCP, such as HIPPI, ATM, or Gigabit Ethernet, and the Sun Firetrademark Link high-performance cluster interconnect.

Importantly, there will often be wide gaps between the performance specifications of the raw network and what an MPI application will achieve in practice. Notably:

  • Latency might be degraded by software layers, especially operating system interactions in the case of TCP message passing.
  • Bandwidth might be degraded by the network interface (e.g., SBus or PCI).
  • Bandwidth might further be degraded on a loss-prone network if data is dropped under load.

A cluster's bisection bandwidth might be limited by its switch or by the number of network interfaces that tap nodes into the network. In practice, typically the latter is the bottleneck. Thus, increasing the number of nodes might or might not increase bisection bandwidth.

Other Factors

At other times, even other parameters enter the picture. Seemingly identical systems can result in different performance because of the tunable system parameters residing in /etc/system, the degree of memory interleaving in the system, mounting of file systems, and other issues that might be best understood with the help of your system administrator. Further, some transient conditions, such as the operating system's free-page list or virtual-to-physical page mappings, might introduce hard-to-understand performance issues.

For the most part, however, the performance of the underlying hardware is not as complicated an issue as this level of detail implies. As long as your performance goals are in line with your hardware's capabilities, the performance achieved will be dictated largely by the application itself. This manual helps you maximize that potential for MPI applications.


1 (FootNote) G.M. Amdahl, Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.