C H A P T E R  6

Sun S3L Performance Guidelines

This chapter discusses a variety of performance issues as they relate to use of Sun S3L routines. The discussions are organized along the following lines:


Link In the Architecture-Specific Version of Sun Performance Library Software

Sun S3L relies on functions in the Sun Performance Library (libsunperf) software for numerous computations within each process. For best performance, make certain your executable uses the architecture-specific version of the libsunperf library. You can do this by linking your program using the -xarch=v8plusa option for 32-bit executables or the -xarch=v9a option for 64-bit executables.

At runtime, the environment variable LD_LIBRARY_PATH can be used to override link-time library choices. Ordinarily, you should not use this environment variable, as it might link suboptimal libraries, such as the generic SPARC version, rather than one optimized for an UltraSPARC processor.

To unset the LD_LIBRARY_PATH environment variable, use

% unsetenv LD_LIBRARY_PATH

To confirm which libraries will be linked at runtime, use

% ldd executable

If Sun S3L detects that a suboptimal version of the libsunperf library was linked in, it will print a warning message. For example:

S3L warning: Using libsunperf not optimized for UntraSPARC.
For better performance, link using -xarch=v8plusa



Note - For single-process jobs, most Sun S3L functions call the corresponding Sun Performance Library interface, if such an interface exists. Thus, the performance of Sun S3L functions on a single process is usually similar to that of single-threaded Sun Performance Library functions.




Legacy Code Containing ScaLAPACK Calls

Many Sun S3L functions support ScaLAPACK application programming interfaces (APIs). This means you can increase the performance of many parallel programs that use ScaLAPACK calls simply by linking in Sun S3L instead of the public domain software.

Alternatively, you might convert ScaLAPACK array descriptors to Sun S3L array handles and call Sun S3L routines explicitly. By converting the ScaLAPACK array descriptors to the equivalent Sun S3L array handles, you can visualize distributed ScaLAPACK arrays by using Prism and use the Sun S3L simplified array syntax for programming. You will also have full use of the Sun S3L toolkit functions.

Sun S3L provides the function S3L_from_ScaLAPACK_desc that performs this API conversion for you. Refer to the S3L_from_ScaLAPACK_desc man page for details.


Array Distribution

One of the most significant performance-related factors in Sun S3L programming is the distribution of Sun S3L arrays among MPI processes. Sun S3L arrays are distributed, axis by axis, using mapping schemes that are familiar to users of ScaLAPACK or High Performance Fortran. That is, elements along an axis might have any one of the following mappings:

  • Local - All elements are owned by (that is, local to) the same MPI process.
  • Block - The elements are divided into blocks with, at most, one block per process.
  • Cyclic - The elements are divided into small blocks, which are allocated to processes in a round-robin fashion, cycling over processes repeatedly, as needed.

FIGURE 6-1 illustrates these mappings with examples of a one-dimensional array distributed over four processes.

For multidimensional arrays, mapping is specified separately for each axis, as shown in FIGURE 6-2. This diagram illustrates a two-dimensional array's row and column axes being distributed among four processes. Four examples are shown, using a different combination of the three mapping schemes in each. The value represented in each array element is the rank of the process on which that element resides.

 FIGURE 6-1 Array Distribution Examples for a One-Dimensional Array

Graphic depiction of array distribution examples for a one-dimensional array.

 FIGURE 6-2 Array Distribution Examples for Two-Dimensional Array

Graphic depiction of array distribution examples for a two-dimensional array.

In certain respects, local distribution is simply a special case of block distribution, which is just a special case of cyclic distribution. Although related, the three distribution methods can have very different effects on both interprocess communication and load balancing among processes. TABLE 6-1 summarizes the relative effects of the three distribution schemes on these performance components.

TABLE 6-1 Amount of Communication and of Load Balancing with Local, Block, and Cyclic Distribution

Local

Block

Cyclic

Communication (such as near- neighbor communication)

None

(optimal)

Some

Most

(worst)

Load balancing (such as operations on left-half of data set)

None (worst)

Some

Most

(optimal)


The next two sections provide guidelines for when you should use local and cyclic mapping. When none of the conditions described next apply, use block mapping.

When To Use Local Distribution

The chief reason to use local mapping is that it eliminates certain communication.

The following are two general classes of situations in which local distribution should be used:

  • Along a single axis - The detailed versions of the Sun S3L FFT, sort, and grade routines manipulate data only along a single, specified axis. When using the following routines, performance is best when the target axis is local.
    • S3L_fft_detailed
    • S3L_sort_detailed_up
    • S3L_sort_detailed_down
    • S3L_grade_detailed_up
    • S3L_grade_detailed_down
  • Operations that use the multiple-instance paradigm - When operating on a full array using a multiple-instance Sun S3L routine, make data axes local and distribute instance axes. Refer to the chapter on multiple instance routines in the Sun S3L Programming Guide.

When To Use Cyclic Distribution

Some algorithms in linear algebra operate on portions of an array that diminish as the computation progresses. Examples within Sun S3L include logical unit (LU) decomposition (S3L_lu_factor and S3L_lu_solve routines), singular value decomposition (S3L_gen_svd routine), and the least-squares solver (S3L_gen_lsq routine). For these Sun S3L routines, cyclic distribution of the data axes improves load balancing.

Choosing an Optimal Block Size

When declaring an array, you must specify the size of the block to be used in distributing the array axes. Your choice of block size not only affects load balancing, it also trades off between concurrency and cache use efficiency.



Note - Concurrency is the measure of how many different subtasks can be performed at a time. Load balancing is the measure of how evenly the work is divided among the processes. Cache use efficiency is a measure of how much work can be done without updating the cache.



Specifying large block sizes will block multiple computations together. This process leads to various optimizations, such as improved cache reuse and lower MPI latency costs. However, blocking computations reduces concurrency, which inhibits parallelization.

A block size of 1 maximizes concurrency and provides the best load balancing. However, small block sizes degrade cache use efficiency.

Because the goals of maximizing concurrency and cache use efficiency conflict, you must choose a block size that will produce an optimal balance between them. The following guidelines are intended to help you avoid extreme performance penalties:

  • Use the same block size in all dimensions.
  • Limit the block size so that data does not overflow the L2 (external) cache. Cache sizes vary, but block sizes should typically not go over 100.
  • Use a block size of at least 20 to 24 to allow cache reuse.
  • Scale the block size to the size of the matrix. Keep the block size small relative to the size of the matrix to allow ample concurrency.

There is no simple formula for determining an optimal block size that will cover all combinations of matrices, algorithms, numbers of processes, and other such variables. The best guide is experimentation, while keeping the points just outlined in mind.

Illustration of Load Balancing

This section demonstrates the load balancing benefits of cyclic distribution for an algorithm that sums the lower triangle of an array.

It begins by showing how block distribution results in a load imbalance for this algorithm (see FIGURE 6-3). In this example, the array's column axis is block-distributed across processes 0-3. Because process 0 must operate on many more elements than the other processes, total computational time will be bounded by the time it takes process 0 complete. The other processes, particularly process 3, will be idle for much of that time.

 FIGURE 6-3 LOCAL,BLOCK Distribution of a 16x16 Array Across 4 Processes

Graphic depiction of local, block distribution for a 16X16 array across 4 processes.

FIGURE 6-4 shows how cyclic distribution of the column axis delivers better load balancing. In this case, the axis is distributed cyclically, using a block size of 1. Although process 0 still has more elements to operate on than the other processes, cyclical distribution significantly reduces its share of the array elements.

 FIGURE 6-4 LOCAL,CYCLIC Distribution of a 16x16 Array Across 4 Processes

Graphic depiction of local, cyclic distribution for a 16X16 array across 4 processes.

The improvement in load balancing is summarized in TABLE 6-2. In particular, note the decrease in the number of elements allocated to process 0, from 54 to 36. Because process 0 still determines the overall computational time, this drop in element count can be seen as a computational speed-up of 150 percent.

TABLE 6-2 Number of Elements the Processes Operate on in FIGURE 6-3 and FIGURE 6-4

FIGURE 6-3

(BLOCK)

FIGURE 6-4

(CYCLIC)

Process 0

54

36

Process 1

38

32

Process 2

22

28

Process 3

6

24



Process Grid Shape

Ordinarily, Sun S3L will map an S3L array onto a process grid whose logical organization is optimal for the operation to be performed. You can assume that, with few exceptions, performance will be best on the default process grid.

However, if you have a clear understanding of how a Sun S3L routine will make use of an array and you want to try to improve the routine's performance beyond that provided by the default process grid, you can explicitly create process grids usingthe S3L_set_process_grid toolkit function. This toolkit function allows you to control the following process grid characteristics.

  • The grid's rank (number of dimensions)
  • The number of processes along each dimension
  • The order in which processes are organized--column order (the default) or row order
  • The rank sequence to be followed in ordering the processes

For some Sun S3L routines, a process grid's layout can affect both load balancing and the amount of interprocess communication that a given application experiences. For example:

  • A 1 x 1 x 1 x...x NP process grid (where NP = number of processes) makes all but the last array axis local to their respective processes. The last axis is distributed across multiple processes. Interprocess communication is eliminated from every axis but the last. This process grid layout provides a good balance between interprocess communication and optimal load balancing for many algorithms. Except for the axis with the greatest stride, this layout also leaves data in the form expected by a serial Fortran program.
  • Use a square process grid for algorithms that benefit from cyclic distributions. This process grid will promote better load balancing, which is usually the primary reason for choosing cyclic distribution.

Note that, these generalizations can, in some situations, be nullified by various other parameters that also affect performance. If you choose to create a nondefault process grid, you are most likely to arrive at an optimal block size through experimentation, using the guidelines described here as a starting point.


Runtime Mapping to Cluster

The runtime mapping of a process grid to nodes in a cluster can also influence the performance of Sun S3L routines. Communication within a multidimensional process grid generally occurs along a column axis or along a row axis. Thus, you should map all the processes in a process grid column (or row) onto the same node so that the majority of the communication takes place within the node.

Runtime mapping of process grids is effected in two parts:

  • The multidimensional process grid is mapped to one-dimensional MPI ranks within the MPI_COMM_WORLD communicator. By default, Sun S3L uses column- major ordering. See FIGURE 6-5 for an example of column-major ordering of a 4x3 process grid. FIGURE 6-5 also shows row-major ordering of the same process grid.
  • MPI ranks are mapped to the nodes within the cluster by the CRE (or other) resource manager. This topic is discussed in greater detail in Chapter 8.

 FIGURE 6-5 Examples of Column- and Row-Major Ordering for a 4x3 Process Grid

Graphic depiction of column-and row-major ordering for a 4X3 processe grid.

The two mapping stages are illustrated in FIGURE 6-6.

 FIGURE 6-6 Process Grid and Runtime Mapping Phases (Column-Major Process Grid)

Graphic depiction of process grid and runtime mapping phases (column-major process grid).

Neither stage of the mapping, by itself, controls performance. Rather, it is the combination of the two that determines the extent to which communication within the process grid will stay on a node or will be carried out over a network connection, which is an inherently slower path.

Although the ability to control process grid layout and the mapping of process grids to nodes give the programmer considerable flexibility, it is generally sufficient for good performance to:

  • Group consecutive processes so that communication between processes remains within a node as much as possible.
  • Use column-major ordering, which Sun S3L uses by default.


Note - If you do decide to use the S3L_set_process_grid routine--for example, to specify a nondefault process-grid shape--use S3L_MAJOR_COLUMN for the majorness argument. This format will give the process grid column-major ordering. Also, specify 0 for the plist_length argument. This value will ensure that the default rank sequence is used. That is, the process rank sequence will be 0, 1, 2,..., rather than some other sequence. See the S3L_set_process_grid man page for a description of the routine.



For example, assume that 12 MPI processes are organized as a 4x3, column-major process grid. To ensure that communication between processes in the same column remain on node, the first four processes must be mapped to one node, the next four processes to one node (possibly the same node as the first four processes), and so forth.

If your runtime manager is CRE, use:

% mprun -np 12 -Z 4 a.out

For LSF, use:

% bsub -I -n 12 -R "span[ptile=4]" a.out

Note that the semantics of the CRE and LSF examples differ slightly. Although both sets of command-line arguments result in all communication within a column being on-node, they differ in the following way:

  • The CRE command allows multiple columns to be mapped to the same node.
  • The LSF command allows no more than one column per node.

Chapter 8 contains a fuller discussion of runtime mapping.


Use Shared Memory to Lower Communication Costs

Yet another way of reducing communication costs is to run on a single SMP node and allocate Sun S3L data arrays in shared memory. This method allows some Sun S3L routines to operate on data in place. Such memory allocation must be performed with the S3L_declare or S3L_declare_detailed routine.

When declaring an array that will reside in shared memory, you need to specify how the array will be allocated. Specify the method of allocation with the atype argument. TABLE 6-3 lists the two atype values that are valid for declaring an array for shared memory and the underlying mechanism that is used for each.

TABLE 6-3 Using S3L_declare or S3L_declare_detailed Routines to Allocate Arrays in Shared Memory

atype

Underlying Mechanism

Notes

S3L_USE_MMAP

mmap(2)

Specify this value when memory resources are shared with other processes.

S3L_USE_SHMGET

System V shmget(2)

Specify this value only when there will be little risk of depriving other processes of physical memory.



Smaller Data Types Imply Less Memory Traffic

Smaller data types have higher ratios of floating-point operations to memory traffic, and so generally provide better performance. For example, 4-byte floating-point elements are likely to perform better than double-precision 8-byte elements. Similarly, single-precision complex will generally perform better than double-precision complex.


Performance Notes for Specific Routines

This section contains performance-related information about individual Sun S3L routines. TABLE 6-4 summarizes some recommendations.

Symbols used in TABLE 6-4 are defined in FIGURE 6-7:

 FIGURE 6-7 Symbols Used in the Summary of Performance Guideline

Graphic table of definitions of symbols used in the summary of performance guidelines.

Graphic table of definitions of symbols used in the summary of performance guidelines.

Graphic table of definitions of symbols used in the summary of performance guidelines.
TABLE 6-4 Summary of Performance Guidelines for Specific Routines

Operation

Operation

Count

Optimal

Distribution

Optimal

Process Grid

SHM

Optimi-zations

S3L_mat_mult

2 N3 (real)

8 N3 (complex)

Same block size for both axes

Square

No

S3L_matvec_sparse

2 N Nnonzero (real)

8 N Nnonzero (complex)

N/A

N/A

Yes

S3L_lu_factor

2 N3/3 (real)

8 N3/3 (complex)

Block-cyclic; same NB for both axes; NB = 24 or 48

1xNP (small N);

square (big N)

No

S3L_fft, S3L_ifft

5 Nelem log2(Nelem)

Block; (also see S3L_trans)

1x1x1x ... xNP

Yes

S3L_rc_fft, S3L_cr_fft

5 (Nelem/2)log2(Nelem/2)

Block; (also see S3L_trans)

1x1x1x ... xNP

Yes

S3L_fft_detailed

5 Nelem log2(N)

Target axis local

N/A

N/A

S3L_gen_band_factor,

S3L_gen_trid_factor

(Iterative)

Block

1xNP

No

S3L_sym_eigen

(Iterative)

Block; same NB both axes

NPRxNPC, where NPR < NPC

No

S3L_rand_fib

N/A

N/A

N/A

No

S3L_rand_lcg

N/A

Block

1x1x1x ... xNP

No

S3L_gen_lsq

4 N3/3 + 2 N2Nrhs

Block-cyclic; same NB both axes

Square

No

S3L_gen_svd

O(N3) (iterative)

Block-cyclic; same NB both axes

Square

No

S3L_sort, S3L_sort_up, S3L_sort_down, S3L_grade_up, S3L_grade_down

N/A

Block

1x1x1x ... xNP

No

S3L_sort_detailed_up, S3L_sort_detailed_down, S3L_grade_detailed_up, S3L_grade_detailed_down

N/A

Target axis local

N/A

No

S3L_trans

N/A

Block

1x1x1x ... xNP

NP=power of 2

Yes


The operation count expressions shown in TABLE 6-4 provide a yardstick by which a given routine's performance can be evaluated. They can also be used to predict how runtimes are likely to scale with problem size.

For example, assume a matrix multiply yields 350 MFLOPS on a 250-MHz UltraSPARC processor, which has a peak performance of 500 MFLOPS. The floating-point efficiency is then 70 percent, which can be evaluated for acceptability.

Floating-point efficiency is only an approximate guideline for determining an operation's level of performance. It cannot exceed 100 percent, but it might legitimately be much lower under various conditions, such as when operations require extensive memory references or when there is an imbalance between floating-point multiplies and adds. Often, bandwidth to local memory is the limiting factor. For iterative algorithms, the operation count is not fixed.

The S3L_mat_mult Routine

The S3L_mat_mult routine computes the product of two matrices. It is most efficient when:

  • The array is distributed to a large number of processes organized in a square process grid.
  • The same block size is used for both axes.

If it is not possible to provide these conditions for a matrix multiply, ensure that the corresponding axes of the two factors are distributed consistently. For example, for a matrix multiply of size (m,n) = (m,k) x (k,n), use the same block size for the second axis of the first factor and the first axis of the second factor (represented by k in each in case).

The S3L_matvec_sparse Routine

Sun S3L employs its own heuristics for distributing sparse matrices over MPI processes. Consequently, you do not need to consider array distribution or process grid layout for the S3L_matvec_sparse routine.

Shared-memory optimizations are performed only when the sparse matrix is in the S3L_SPARSE_CSR format and the input and output vectors are both allocated in shared memory.

The S3L_lu_factor Routine

The S3L_lu_factor routine uses a parallel, block-partitioned algorithm derived from the ScaLAPACK implementation. It provides best performance for arrays with cyclic distribution.

The following are useful guidelines to keep in mind when choosing block sizes for the S3L_lu_factor routine:

  • Use the same block size in both axes.
  • Use a block size in the 24-100 range to promote good cache reuse but to prevent cache overflows.
  • Use a smaller block size for smaller matrices or for larger numbers of processes to promote better concurrency.

The S3L_lu_factor routine has special optimizations for double-precision, floating-point matrices. Based on knowledge of the external cache size and other process parameters, it uses a specialized matrix multiply routine to increase overall performance, particularly on large matrices.

These optimizations are available to arrays that meet the following conditions:

  • The array is two-dimensional.
  • It is allocated with the S3L_declare_detailed routine, using S3L_USE_MEMALIGN64 for the atype argument.
  • Its data type is double-precision, floating-point.
  • Both axes have the same block size, which should be 24 or 48.

When deciding on a process grid layout for LU factorization, your choices will involve making a trade-off between load balancing and minimizing communication costs. Pivoting will usually be responsible for most communication. The extreme ends of the trade-off spectrum follow:

  • To minimize the communication cost of pivoting, choose a 1 x NP process grid, where NP is the number of MPI processes.
  • To optimize computational load balancing, choose a nearly square process grid.

Some experimentation will be necessary to arrive at the optimal trade-off for your particular requirements.

The S3L_fft, S3L_ifft, S3L_rc_fft, and S3L_cr_fft, S3L_fft_detailed Routines

Performance is best when the extents of the array can be factored into small, prime factors no larger than 13.

The operation count expressions given in TABLE 6-4 for the FFT family of routines provide a good approximation. However, the actual count will depend to some extent on the radix (factors) used. In particular, for a given problem size, the real-to-complex and complex-to-real FFTs have half the operation counts and half the memory requirements of their complex-to-complex counterparts.

The transformed axis should be local. If a multidimensional transform is desired, make all but the last axis local.

It is likely that the resulting transpose will dominate the computation, at least in a multinode cluster. See The S3L_trans Routine.

The S3L_gen_band_factor, S3L_gen_trid_factor, and S3L_gen_band_solve, S3L_gen_trid_solve Routines

These routines tend to have relatively low communication costs, and so tend to scale well.

For best performance of the factorization routines, have all the axes of the array factored locally, except for the last axis, which should be block distributed.

Conversely, the corresponding solver routines perform best when the first axis of the right side array is block distributed and all other axes are local.

The S3L_sym_eigen Routine

Performance of the S3L_sym_eigen routine is sensitive to interprocess latency.

If both eigenvectors and eigenvalues are computed, execution time might be as much as an order of magnitude longer than if only eigenvalues are computed.

The S3L_rand_fib and S3L_rand_lcg Routines

The S3L_rand_fib and S3L_rand_lcg routines initialize parallel arrays using a Lagged-Fibonacci and a Linear Congruential random number generator, respectively. An array initialized by the Lagged-Fibonacci routine will vary depending on the array distribution. In contrast, array initialization by the Linear Congruential method will produce the same result, regardless of the array's distribution.

Because the Linear Congruential random number generator must ensure that the resulting random numbers do not depend on how the array is distributed, it has the additional task of keeping account of the global indices of the array elements. This extra overhead is minimized when local or block distribution is used and greatly increased by distributing the array cyclically. The S3L_rand_lcg routine can be two to three times slower with cyclic distributions than with local or block distributions.

Because the S3L_rand_fib routine fills array elements with random numbers regardless of the elements' global indices, it is significantly faster than with the S3L_rand_lcg routine.

The S3L_rand_lcg routine is based on 64-bit strings. This means it performs better on S3L_long_integer data types than on S3L_integer elements.

The S3L_rand_fib routine, on the other hand, is based on 32-bit integers. It generates S3L_integer elements twice as fast as for S3L_long_integer output.

Both algorithms generate floating-point output more slowly than integers, since they must convert random bit strings into floating-point output. Complex numbers are generated at half the rate of real numbers, since twice as many must be generated.

The S3L_gen_lsq Routine

S3L_gen_lsq finds the least squares solution of an overdetermined system. It is implemented with a QR algorithm. The operation count, shown in TABLE 6-4, applies to real, square matrices. For a real, rectangular (M,N) matrix, the operation count scales as:

  • 2 N Nrhs(2M-N) + 2 N2 (M-N/3) for M greater than or equal N
  • 2 N Nrhs(2M-N) + 2 M2 (N-M/3) for M < N

For complex elements, the operation count is four times as great.

The S3L_gen_svd Routine

For S3L_gen_svd, the convergence of the iterative algorithm depends on the matrix data. Consequently, the count is not well-defined for this routine. However, the S3L_gen_svd routine does tend to scale as N3.

If the singular vectors are computed, the runtime can be roughly an order of magnitude longer than if only singular values are extracted.

The A, U, and V arrays should all be on the same process grid for best performance.

The S3L_gen_iter_solve Routine

Most of the time spent in this routine is in the S3L_mat_vec_sparse routine.

Overall performance depends on more than just the floating-point rate of that subroutine. It is also significantly influenced by the matrix data and by the choice of solver, preconditioner, initial guess, and convergence criteria.

The S3L_acorr, S3L_conv, and S3L_deconv Routines

The performance of these functions depends on the performance of Sun S3L FFTs and, consequently, on the performance of the Sun S3L transposes.

The Routines S3L_sort, S3L_sort_up, S3L_sort_down, S3L_sort_detailed_up, S3L_sort_detailed_down, S3L_grade_up, S3L_grade_down, S3L_grade_detailed_up, and S3L_grade_detailed_down

These routines do not involve floating-point operations. The operation count can vary greatly, depending on the distribution of keys, but it will typically scale from O(N) to O(N log(N)).

Sorts of 64-bit integers can be slower than sorts of 64-bit floating-point numbers.

The S3L_trans Routine

The S3L_trans routine provides communication support to the Sun FFTs as well as to many other Sun S3L algorithms. Best performance is achieved when axis extents are all multiples of the number of processes.

Sun S3L Toolkit Functions

The Sun S3L Toolkit functions are primarily intended for convenience rather than performance. However, some significant performance variations do occur. For example:

  • The S3L_copy_array routine can be very fast or extremely slow, depending on how well the two arrays are aligned.
  • The S3L_forall routine's performance entails relatively significant overhead for each element operated on for function types S3L_ELEM_FN1 and S3L_INDEX_FN. In contrast, the function type S3L_ELEM_FNN amortizes such overhead over many elemental operations.
  • The S3L_set_array_element, S3L_set_array_element_on_proc, S3L_get_array_element, and S3L_get_array_element_on_proc routines perform very small operations. Consequently, overhead costs are a significant component for these routines (as with the S3L_forall function types S3L_ELEM_FN1 and S3L_INDEX_FN).