C H A P T E R  9

Profiling

An important component of performance tuning is profiling, through which you develop a picture of how well your code is running and what sorts of bottlenecks it might have. Profiling can be a difficult task in the simplest of cases, and the complexities multiply with MPI programs because of their parallelism. Without profiling information, however, code optimization can be wasted effort.

This chapter describes:

This chapter includes a few case studies that examine some of the NAS Parallel Benchmarks 2.3. These are available from the NASA Ames Research Center at

http://www.nas.nasa.gov/Software/NPB/index.html


Note - The runs shown in this chapter were not optimized for the platforms on which they executed.




General Profiling Methodology

It is likely that only a few parts of a program account for most of its runtime. Profiling enables you to identify these "hot spots" and characterize their behavior. You can then focus your optimization efforts on the spots where they will have the most effect.

Profiling can be an experimental, exploratory procedure. So you might find yourself rerunning an experiment frequently. It is a challenge to design such runs so that they complete quickly, while still capturing the performance characteristics you are trying to study. There are several ways you can strip down your runs to achieve this balance, including reducing the data set and performing fewer loop iterations. However, regardless of which streamlining method you employ, keep the following caveats in mind:

Basic Approaches

There are various approaches to profiling Sun HPC ClusterTools programs:

TABLE 9-1 outlines the advantages and disadvantages associated with each of these methods of profiling.

TABLE 9-1 Profiling Alternatives

Method

Advantages

Disadvantages

MPProf

  • Very simple to use.
  • Generates self-explanatory performance report.
  • Gives feedback in simple terms (such as which environment variables to set for better performance).
  • Simplicity.
  • Has no knowledge of user code.
  • Performance suggestions are fallible.

Performance Analyzer Profiling

  • No recompilation or relinking is required.
  • Profiles whole programs: user computation and MPI message passing.
  • Identifies time-consuming routines.
  • With -g recompilation and relinking, gives attribution on a per-source-line basis with negligible loss in optimization level.
  • Shows caller-callee relationships.
  • Uses a style familiar to gprof users.
  • On UltraSPARC III microprocessors, profiles based on hardware counters (floating-point operations, cache misses, and so forth).
  • Timeline functionality.
  • Has very limited knowledge of MPI or message passing.
  • Improper timeline support for multinode runs.

Timers

  • Very versatile.
  • Requires manual instrumentation.
  • Requires that you understand the code.

gprof

  • Familiar tool.
  • Provides an overview of user code.
  • Ignores time spent in MPI.

PMPI

  • You can instrument or modify MPI without modifying source.
  • Allows use of other profiling tools.
  • Profiles MPI usage only.
  • Requires integration effort.

The following are sample scenarios:

Using MPProf, you can easily see if your program is spending a lot of time in MPI. It will suggest environment variable values for you.
Using the Performance Analyzer, you can see which routines consume the most time.
Adding timers and other instrumentation around innermost loops might help you if you already have some idea about your code's performance.
Depending on how those tools were constructed, the MPI profiling interface might allow you to continue using them with Sun HPC ClusterTools programs.


MPProf Profiling Tool

Sun HPC ClusterTools software includes the mpprof profiler, a tool that supports programming with the Sun MPI library.

To collect profiling data on a Sun MPI program, set the MPI_PROFILE environment variable before running the program. To generate a report using this data, invoke the mpprof command-line utility. For example:

% setenv MPI_PROFILE 1
% mprun -np 16 a.out
% mpprof mpprof.index.rm.jid

If profiling has been enabled, Sun MPI creates a binary-format profiling data file for each MPI process. The MPProf utility digests the data from these numerous, intermediate, data files--generating statistical summaries in ASCII text. Wherever possible, MPProf adds direct tuning advice to the statistical summaries of MPI process behavior.

MPProf does not provide detailed information about the ordering of message-passing events or interprocess synchronization. That is, MPProf is not a trace-history viewer.



Note - To get information on user code as well as MPI message-passing activity, use Performance Analyzer.



Sample MPProf Output

Output from the MPProf report generator is self explanatory and should require no further documentation. Nevertheless, sample output is included here for readers who do not have immediate access to this tool.

Overview

The first section (sample output shown in box below) is an overview that includes

  • The name of your program
  • When you ran your program
  • What time period is covered in this report
  • What fraction of time was spent in MPI
  • How many MPI ranks are covered in the report
  • OVERVIEW
    
    ========
    
    The program being reported on is 
    
    "/home/myusername/mycode/a.out,"
    
    which ran as job name "cre.6975" on Fri Nov  8 18:45:04 2002.
    
    Profiled Time Range: 
    
    Start at elapsed time 0.000029 secs
    
    End at elapsed time   9.475821 secs
    
    Total duration is     9.475792 secs
    
    Fraction spent in MPI 21.5%
    
    Elapsed time is measured from the end of MPI_Init. Data is being
    
    reported for 16 processes of a 16-process job.
    

Load Balance

The Load section (sample output shown in box below) shows the distribution of how much time was spent in MPI per process. When a process spends much time in MPI, this may be because it's waiting for another process. Thus, performance bottlenecks may be in processes that spend little time in MPI. Those processes, as well as the nodes they ran on, are identified in this section. Other tools, such as the Performance Analyzer, are needed to pinpoint bottlenecks further if load imbalances are apparent.

LOAD BALANCE
============
Data is being reported on 16 MPI processes. The following histogram
shows how these processes were distributed as a function of the
fraction of the time the processes spent  in MPI routines:
 
Number of MPI Processes
 
10-|
 9-|
 8-|
 7-|                                                 #
 6-|                                                 #
 5-|                                                ##
 4-|                                                ##
 3-|                                               ###
 2-|                                               ###
 1-|  #                                           ####
 0-+----+----+----+----+----+----+----+----+----+----+
 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0
 
  Percentage time in MPI
 
Rank   Hostname   MPI Time
0       node0     17.39%
15      node0     61.04%
4       node0     62.03%
13      node0     62.08%
11      node0     62.12%
 
[...]

Sun MPI Environment Variables

As discussed in Chapter 8, Sun MPI environment variables are set by default to values that will be appropriate in many cases, but runtime performance can in cases be enhanced by further tuning.

In particular, Sun MPI implements two-sided message passing between processes that cannot directly access one another's address spaces with internal shared buffers. If these buffers become congested, performance can suffer. MPProf monitors these buffers, watches for congestion, and suggests environment variable values for enhanced performance at the cost of higher memory usage.

The report discusses its recommendations and then provides a summary (sample output shown in box below) that may be "copy-and-pasted" into a session window or run script. The code may then be run again to determine whether better performance results from the recommendations.

MPI ENVIRONMENT VARIABLES
=========================
[...]
 
SUGGESTION SUMMARY
==================
[...]
 
In the C shell, these environment variables may be set by the
following commands:
 
setenv MPI_POLLALL 0
setenv MPI_PROCBIND 1
setenv MPI_SHM_SBPOOLSIZE 368640
setenv MPI_SPIN 1
 
In the Bourne or Korn shell, these environment variables may be
set by the following commands:
 
export MPI_POLLALL=0
export MPI_PROCBIND=1
export MPI_SHM_SBPOOLSIZE=368640
export MPI_SPIN=1

Breakdown by MPI Routine

The BreakdownBy MPI Routine section of the report (sample output shown in box below) gives a breakdown by MPI routine. Attributing bytes sent and received to MPI routines can be tricky due to nonblocking and collective operations. Note that:

  • For nonblocking sends, MPProf attributes bytes sent to the initiating call (such as MPI_Isend), since this is often the call that actually moves the data.
  • For nonblocking receives, MPProf attributes bytes received to the completing call (such as MPI_Wait), since the actual number of bytes received is not known before this call.
  • For collectives, MPProf attributes bytes sent and received as follows:
  • TABLE 9-2 MPProf Attributes Bytes Sent and Received

    Sent

    Received

    MPI_Barrier

    0

    0

    MPI_Bcast

    n (root)

    n (nonroot)

    MPI_Gather

    n

    np * n (root)

    MPI_Scatter

    np * n (root)

    n

    MPI_Allgather

    n

    np * n

    MPI_Alltoall

    np * n

    np * n

    MPI_Reduce

    n

    n (root)

    MPI_Allreduce

    n

    n

    MPI_Scan

    n

    n

    MPI_Reduce_scatter

    np * n

    n


Assumptions:

  • n is the number of elements passed to the MPI collective routine multiplied by the size of the data type used.
  • np is the number of processes participating in the collective operation.
  • Some byte counts are attributed only at the root process or the nonroot processes, as noted.
  • BREAKDOWN BY MPI ROUTINE
    
    ========================
    
     
    
    Here, averages over all MPI processes profiled are reported. The
    
    numbers in parentheses roughly indicate the variations there are
    
    among all of the MPI processes. These variations are computed as
    
    (1-min/max)/2 where "min" and "max" are the minimum and maximum
    
    values, respectively, for each statistic reported. A total of 9
    
    different MPI APIs were called.
    
     
    
    MPI Routine      Time    Calls Made    Sent      Received
    
    MPI_Allreduce  0.15 (39%)  11 (0%)  45276 (0%)  45276  (0%)
    
    MPI_Alltoall   0.00  (8%)  11 (0%)    704 (0%)    704  (0%)
    
    MPI_Alltoallv  0.55  (3%)  11 (0%)  23672 (0%)  23672  (2%)
    
    MPI_Comm_rank  0.00 (16%)   1 (0%)      0 (0%)      0  (0%)
    
    MPI_Comm_size  0.00  (8%)   1 (0%)      0 (0%)      0  (0%)
    
    MPI_Irecv      0.00 (23%)   0 (0%)      0 (0%)      0  (0%)
    
    MPI_Reduce     0.00 (49%)   2 (0%)     12 (0%)      0 (50%)
    
    MPI_Send       0.00 (27%)   0 (0%)      3 (0%)      0  (0%)
    
    MPI_Wait       0.00 (50%)   0 (0%)      0 (0%)      3  (0%)
    
     
    
    Where "Time" is in seconds and "Sent" and "Received" are in bytes.
    

Time Dependence

Performance analysis tools are sometimes classified as profiling tools, which give overall performance characterization of a program, or tracing tools, which give detailed time histories of events that occur during program execution.

While MPProf is a profiling tool, it includes some time-dependent information (a sample of the Time Dependence section of the MPProf report is shown below). This information is coarse but can prove useful in different ways:

  • If your program does not terminate normally (due to program error or your decision to abort the run), some profiling information is still available.
  • You can monitor your program's performance while it is still running. Note that some collected data might not have been flushed to disk yet. This means that if you generate an MPProf report while your program is still running, the report might not represent the latest state of your program.
  • You can sometimes identify different phases of computation (startup, steady-state computation, post processing, and so forth) and focus on any one of them for better analysis using the mpprof switches -s and -e.

In the following example, there is considerable broadcast and barrier activity before the program settles down to steady-state behavior, the performance of which is critical for long-running programs.

TIME DEPENDENCE
===============
Time periods may be missing if no MPI calls were made during the
period. Times for MPI calls that persist over multiple reporting
intervals will only be reported in a single interval; these
reported times may be greater than 100%.
 
period  MPI_Barrier  MPI_Bcast  MPI_Send  MPI_Waitall
 
 1         1.2%       6.9%      0.0%         0.0%
 5         0.0%       0.0%      0.0%       123.0%
 6         0.0%       0.0%      0.0%         4.2%
 7         0.0%     803.3%      0.1%         3.1%
16         0.0%     101.0%      3.9%         0.0%
17         0.0%       7.6%      0.0%         0.0%
18        32.5%       0.0%      0.5%        10.3%
19        18.9%       5.0%     52.4%         0.0%
20       191.9%       0.0%     22.6%        34.5%
21         0.0%       0.0%      0.0%         0.0%
22        10.9%       0.0%      0.0%         0.0%
23        16.7%       2.1%      0.2%         4.3%
24        21.7%       2.1%      0.3%         4.6%
25        21.0%       2.1%      0.3%         5.4%
26        21.1%       3.6%      0.3%         5.2%
27        20.1%       1.9%      0.4%         5.9%
28        20.6%       2.1%      0.3%         5.3%
29        20.4%       2.1%      0.3%         5.5%
30        17.9%       3.6%      0.2%         5.0%
31        22.6%       2.1%      0.3%         5.4%
 

Connections

A connection is a sender/receiver pair. MPProf shows how much point-to-point message traffic there is per connection, with one matrix showing numbers of messages and another showing numbers of bytes. Values are scaled.

A real application is likely to have various communication patterns in it, but the overall matrix may resemble some easy-to-recognize case. The following examples illustrate some simple patterns.

CONNECTIONS
===========
 
               sender
               0  1  2  3  4  5  6  7  8  9
receiver
 
            0  _ 99 99 99 99 99 99 99 99 99
            1  _  _  _  _  _  _  _  _  _  _
            2  _  _  _  _  _  _  _  _  _  _
            3  _  _  _  _  _  _  _  _  _  _
            4  _  _  _  _  _  _  _  _  _  _
            5  _  _  _  _  _  _  _  _  _  _
            6  _  _  _  _  _  _  _  _  _  _
            7  _  _  _  _  _  _  _  _  _  _
            8  _  _  _  _  _  _  _  _  _  _
            9  _  _  _  _  _  _  _  _  _  _

CONNECTIONS
===========
 
               _  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _

 


CONNECTIONS
===========
 
               _ 99  _  _  _  _  _  _  _  _
              99  _ 99  _  _  _  _  _  _  _
               _ 99  _ 99  _  _  _  _  _  _
               _  _ 99  _ 99  _  _  _  _  _
               _  _  _ 99  _ 99  _  _  _  _
               _  _  _  _ 99  _ 99  _  _  _
               _  _  _  _  _ 99  _ 99  _  _
               _  _  _  _  _  _ 99  _ 99  _
               _  _  _  _  _  _  _ 99  _ 99
               _  _  _  _  _  _  _  _ 99  _

 


CONNECTIONS
===========
 
               _ 99  _  _  _  _  _  _  _ 99
              99  _ 99  _  _  _  _  _  _  _
               _ 99  _ 99  _  _  _  _  _  _
               _  _ 99  _ 99  _  _  _  _  _
               _  _  _ 99  _ 99  _  _  _  _
               _  _  _  _ 99  _ 99  _  _  _
               _  _  _  _  _ 99  _ 99  _  _
               _  _  _  _  _  _ 99  _ 99  _
               _  _  _  _  _  _  _ 99  _ 99
              99  _  _  _  _  _  _  _ 99  _
 

CONNECTIONS
===========
 
              _9____99____________________9______
              9_9_____9____________________9_____
              _9_9_____9____________________9____
              __9_9_____9____________________9___
              ___9_9_____9____________________9__
              ____9_9_____9____________________9_
              9____9_______9____________________9
              9_______9____99____________________
              _9_____9_9_____9___________________
              __9_____9_9_____9__________________
              ___9_____9_9_____9_________________
              ____9_____9_9_____9________________
              _____9_____9_9_____9_______________
              ______99____9_______9______________
              _______9_______9____99_____________
              ________9_____9_9_____9____________
              _________9_____9_9_____9___________
              __________9_____9_9_____9__________
              ___________9_____9_9_____9_________
              ____________9_____9_9_____9________
              _____________99____9_______9_______
              ______________9_______9____99______
              _______________9_____9_9_____9_____
              ________________9_____9_9_____9____
              _________________9_____9_9_____9___
              __________________9_____9_9_____9__
              ___________________9_____9_9_____9_
              ____________________99____9_______9
              9____________________9_______9____9
              _9____________________9_____9_9____
              __9____________________9_____9_9___
              ___9____________________9_____9_9__
              ____9____________________9_____9_9_
              _____9____________________9_____9_9
              ______9____________________99____9_
 

Once the connectivity matrices have been displayed, MPProf prints the average message length:

The average length of point-to-point messages was 13916 bytes per message.

This message length can be compared to the product of some characteristic latency in your system and some characteristic bandwidth. For example, if "ping-pong" tests indicate an MPI latency of about 2 microseconds and bandwidth of about 500 Mbyte/sec, then the product is 1000 bytes. An average length of 13916 bytes suggests bandwidth is more important than latency to your application's performance.

Multithreaded Programs

MPProf profiling might be used with multi-threaded MPI programs. If many threads per process are engaged in heavy MPI activity, however, MPProf data collection can slow down measurably. MPProf aggregates data for all threads on a process.

The mpdump Utility

It is sometimes possible to extract more information from MPProf data files than the report generator prints out. You can use the mpdump utility to convert MPProf binary data files to an ASCII format and then process the data directly, perhaps with customized scripts. The format of the ASCII files is undocumented, but it is easy to interpret.

Managing Disk Files.

MPProf writes relatively little data to disk during profiling runs. This means that it is relatively easy to use compared to tracing tools, the data volumes of which must be managed carefully.

Nevertheless, one file per process is generated and it is worth managing these files. Specifically, MPProf profiling generates one data file per process per run, as well as one index file per run that points to the data files. The data files, by default, are stored in /usr/tmp of the respective nodes where the processes ran.

Use

% mpprof -r -g directory mpprof.index.rm.jid

to collect data files from their respective nodes to an archival location if you want to save the files.

Use

% mpprof -r -mpprof.index.rm.jid

to remove data files to clean up after a run.

Incorporating Environment Variable Suggestions

MPProf generates suggestions for tuning Sun MPI environment variables.

These suggestions are often helpful, but they do not guarantee performance improvement. Some experimentation may be necessary. If you incorporate MPProf's suggestions, the next profiling run may indicate further suggestions. Multiple reruns may be required before MPProf no longer has further recommendations, but most of the speedup, if any, is likely to result from the first round or two of experimentation.

Rerunning a program incorporating MPProf's tuning suggestions may be automated. Here is an example.

CODE EXAMPLE 9-1 Sample MPProf Session, Rerunning a Program To Tune Performance
% cat my_mpprof_script.csh
#!/bin/csh
 
set LOGFILE = mpprofrun.logfile
 
# always run with this on 
setenv MPI_PROFILE 1
echo 
echo trial-0
echo setenv MPI_PROFILE 1
 
# iterate at most 10 times
@ iteration = 0
while ( $iteration < 10 )
 
  # run the job
  $* >& $LOGFILE
 
  # archive this run
 
mkdir trial-$iteration
mpprof -r -g trial-$iteration mpprof.index.* < /dev/null
mv $LOGFILE trial-$iteration/log
 
# go to the archive
cd trial-$iteration
 
# report the basics
mpprof mpprof.index.* < /dev/null > mpprof.output
grep "Total duration is" mpprof.output
grep "Fraction spent in" mpprof.output
 
# increment interation count
@ iteration = $iteration + 1
 
# look at the environment variable suggestions
grep "setenv" mpprof.output | grep -v Suggestion > mpprof.envvars
set wclist = ( `wc -l mpprof.envvars` )
echo
 
# start reporting on next round
 
if ( $wclist[1] != 0 ) then
  echo trial-$iteration
  cat mpprof.envvars
  source mpprof.envvars
endif
 
# clean up
rm -f mpprof.envvars
cd ..
 
# convergence?
if ( $wclist[1] == 0 ) then
  break
endif
 
end
 
% my_mpprof_script.csh mprun -np 16 a.out
 
trial-0
setenv MPI_PROFILE 1
  Total duration is    23.702518 secs
  Fraction spent in MPI 21.9
 
%trial-1
setenv MPI_POLLALL 0
setenv MPI_PROCBIND 1
unsetenv MPI_SHM_CPOOLSIZE
setenv MPI_SHM_SBPOOLSIZE 533504
setenv MPI_SPIN 1
  Total duration is    21.697837 secs
  Fraction spent in MPI 14.5%
 
trial-2
setenv MPI_SHM_NUMPOSTBOX 66
setenv MPI_SHM_SBPOOLSIZE 1064960
  Total duration is    19.901631 secs
  Fraction spent in MPI  5.7%
 
trial-3
setenv MPI_SHM_SBPOOLSIZE 1598464
  Total duration is    19.993672 secs
  Fraction spent in MPI  5.9%
 
trial-4
setenv MPI_SHM_SBPOOLSIZE 2129920
  Total duration is    19.946838 secs
  Fraction spent in MPI  5.8%
 
trial-5
setenv MPI_SHM_SBPOOLSIZE 2661376
  Total duration is    19.906594 secs
  Fraction spent in MPI  5.7%
 
 
% ls -d trial-*
 
trial-0  trial-1  trial-2  trial-3  trial-4  trial-5


Performance Analyzer Profiling of Sun MPI Programs

The Performance Analyzer offers a good first step to understanding the performance characteristics of a program on UltraSPARC systems. It combines ease of use with powerful functionality.

As with most profiling tools, there are two basic steps to using Sun's performance analysis tools. The first step is to use the Collector to collect performance data. The second step is to use the Performance Analyzer to examine results. For example, this procedure can be as simple as replacing this:

% mprun -np 16 a.out 3 5 341

with this:

% mprun -np 16 collect a.out 3 5 341
% analyzer test.*.er

Sun's compilers and tools are usually located in the directory /opt/SUNWspro/bin. Check with your system administrator for details for your site.

The following sections show the use of the Performance Analyzer with Sun MPI programs, often revisiting variations of the preceding example.

Data Collection

There are several ways of using the Collector with MPI codes. For example, the simplest usage would be to add collect to the mprun command line:

% mprun -np 16 collect a.out 3 5 341

In this section we discuss some of the issues that can arise when using the Collector, and how to handle them.

Data Volume

The volume of collected data can grow large, especially for long-running or parallel programs. Though the Collector mitigates this problem, the scaling to large volumes remains an issue.

There are a number of useful strategies for managing these data volumes:

  • Increase the profiling interval. The interval might be specified in milliseconds with the -p switch to the collect command. The default value is 10. The actual interval used depends on the resolution of the profiling system. Of course, while increasing the interval reduces the data volume, it can reduce the quality of the sampling data.
For example, to reduce the number of profiled events roughly by a factor of two, use:
% mprun -np 16 collect -p 20 a.out 3 5 341
  • Collect data on only a subset of the MPI processes. In many cases, activity on one MPI process reflects performance behavior on all processes fairly closely. Or, if there are load imbalances among the processes, a larger subset might be used. Of course, limiting data collection to a subset of the processes might bias the profiling data. In particular, a master process might behave unlike any of the other processes. In the following example, we collect on only the first four MPI ranks using a script:
  • % cat csh-script
    
    #!/bin/csh
    
    if ( $MP_RANK < 4 ) then
    
      collect $*
    
    else
    
      $*
    
    endif
    
    % mprun -np 16 csh-script a.out 3 5 341
    

  • Collect data to a local file system. This method does not reduce the volume of data collected, but it helps mitigate the impact. Meanwhile, local file systems are often not mounted on all nodes, so gathering experiment data to a common location should also be part of this approach. In the following example, we collect local data and then gather it centrally:
  • % cat csh-script
    
    #!/bin/csh
    
    collect -d /tmp $*
    
    er_mv /tmp/test.$MP_RANK.er myrun
    
    % mkdir myrun
    
    % mprun -np 16 csh-script a.out 3 5 341
    

This strategy should be a routine part of data collection.

A large parallel job might run out of a central NFS-mounted file system. While this might be adequate for jobs that are not I/O intensive, it might cause a critical bottleneck for profiling sessions. If multiple MPI processes are trying to write large volumes of profiling data simultaneously over NFS to a single file system, that file system, along with network congestion, could lead to tremendous slowdowns and perturbations of the program's performance. It is preferable to collect profiling data to local file systems and, perhaps, gather them to a central directory after program execution.

To identify local file systems, use:

% /usr/bin/df -lk

on each node of the cluster you will use, or ask your system administrator about large-volume, high-performance disk space.

One possible choice of a local file system is /tmp. Note that the /tmp file system on different nodes of a cluster refer to different, respectively local file systems. Also, the /tmp file system might not be very large, and if it becomes filled there might be a great impact on general system operability.

Data Organization

The Collector generates one "experiment" per MPI process. If there are multiple runs of a multiprocess job, therefore, the number of experiments can grow quickly.

To organize these experiments, it often makes sense to gather experiments from a run into a distinctive subdirectory. Use the commands er_mv, er_rm, and er_cp (again, typically under the directory /opt/SUNWspro/bin) to move, remove, or copy experiments, respectively. For more information on these utilities, refer to the corresponding man pages or the Forte Developer (or Sun ONE Compiler Collection) documentation.

If you collect an experiment directly into a directory, make sure that the directory has already been created and, ideally, that no other experiments already exist in it.

Example

% mkdir run1
% er_rm -f run1/*.er
% mprun -np 16 collect -d run1 a.out 3 5 341
% mkdir run2
% er_rm -f run2/*.er
% mprun -np 16 collect -d run2 a.out 3 5 341
% mkdir run3
% er_rm -f run3/*.er
% mprun -np 16 collect -d run3 a.out 3 5 341
%

The er_rm steps are not required because (in this instance) we are using freshly created directories. Nevertheless, these steps serve as reminders to avoid the confusion that can result when too many experiments are gathered in the same directory.

Other Data Collection Issues

The Collector collects sampling information by default. Thus, Sun MPI calls will be profiled if they consume a sufficiently high fraction of runtime. Sometimes it is interesting to capture MPI calls more methodically. In particular, for time-line analysis, it is desirable to mark the beginnings and endings of MPI calls with higher resolution than sampling provides. MPI call tracing may be activated with the
-m on switch to collect.

Sometimes it is desirable to collect data not only on a process, but its descendants, as well. The collect tag will follow descendants with the -F on switch.

For example, the Solaris 9 operating environment supports multiple page sizes with the /usr/bin/ppgsz command-line utility. To collect performance data on a code run with 4-Mbyte memory pages, our sample data collection becomes:

% mprun -np 16 collect -F on ppg sz -o heap=4M,stack=512K a.out 3 5 341

In the particular case of the ppgsz utility, a cleaner alternative may be to preload the mpss.so.1 shared object, so that the ppgsz utility need not appear on the command line. For more information on multiple page size support, see the ppgsz and mpss.so.1 man pages in the Solaris 9 operating environment.

Analyzing Profiling Data

Once data has been gathered with the Collector, it can be viewed with the Performance Analyzer. For example:

% analyzer myrun/test.*.er

Use of the Performance Analyzer is illustrated in the case study.

There are also command-line utilities to aid in data analysis. For more information, see the er_print and er_src man pages. Here are examples of their use:

  • To view a summary of how time was spent in functions, use:
% er_print -functions test.*.er
  • To view a summary of how time was spent in MPI functions, use:
% er_print -functions test.*.er | grep PMPI_
  • To view caller-callee relationships (as with gprof), use:
% er_print -callers-callees test.*.er
  • To view where time is spent on a per-source-line basis, along with compiler commentary, compile and link with the -g switch and use:
% er_print -source myfunction_ 1 test.*.er
The 1 is used to distinguish in case of multiple instances of the named function.
  • To view the compiler commentary associated with an object file without even running your program, compile with the -g switch and use:
% er_src my_file.o

Case Study

In this case study, we examine the NPB 2.3 BT benchmark. We run the program using the following environment variable settings:

% setenv MPI_SPIN 1
% setenv MPI_PROCBIND 1 
% setenv MPI_POLLALL 0 

These settings are not required for Performance Analyzer profiling. We simply use them to profile our code as it would run in production. See Appendix C for more information about using Sun MPI environment variables for high performance. Also refer to the Sun MPI Programming and Reference Guide.

The job is run on a single, shared-memory node using 25 processes.

FIGURE 9-1 shows the Performance Analyzer's first view of the resulting profiling data. This default view shows how time is spent in different functions. Both exclusive and inclusive user CPU times are shown for each function, excluding and including, respectively, time spent in any functions it calls. The top line shows that a total of 3362.710 seconds, over all 25 processes, are profiled.

We see that the functions LHSX(), LHSY(), and LHSZ() account for
523.410+467.960+444.350=1435.72 seconds of that time.

 FIGURE 9-1 Performance Analyzer--Main View

Graphic image of the Performance Analyzer -- main view

Fortran programmers will note that the term function is used in the C sense to include all subprograms, whether they pass return values or not. Further, function names are those generated by the Fortran compiler. That is, by default they are converted to lower case and have an underscore appended.

In FIGURE 9-2, we see how this overview appears when MPI tracing is turned on with the Collector's -m on switch. Note that accounting for bytes sent and received is slightly different from the way accounting is handled by MPProf. In particular, for bytes received, the size of the receive buffer is used rather than the actual number of bytes received. For more information on how bytes are counted for MPI calls, refer to the Performance Analyzer documentation volume Program Performance Analysis Tools.

 

FIGURE 9-2 Performance Analyzer--Main View With Tracing Enabled

Graphic image of the Performance Analyzer -- main view with tracing enabled.

We can see how time is spent within a subroutine if the code was compiled and linked with the -g switch. This switch introduces minimal impact on optimization and parallelization, and it can be employed rather freely by performance-oriented users. When we click on the Source tab, the Performance Analyzer brings up a text editor for the highlighted function. The choice of text editor can be changed under the Options menu with the Text Editor Options selection. The displayed, annotated source code includes the selected metrics, the user source code, and compiler commentary. A small fragment is shown in FIGURE 9-3. In particular, notice that the Performance Analyzer highlights hot (time-consuming) lines of code. Only a small fragment is shown since, in practice, the annotated source can become rather wide.

 FIGURE 9-3 Performance Analyzer--Source View

Graphic image of the Performance Analyzer -- source view

To get a better idea of where time is spent at a high level in the code, you can also click on the Callers-Callees tab. FIGURE 9-4 shows a possible Callers-Callee view. The selected function appears in the middle section, its callers appear above it, and its callees below. Selected metrics are shown for all displayed functions. By clicking on callers, you can find where time incurred in the particular function occurs in the source code at a high level. By clicking on callees, you can find more detail on expensive calls a particular function might be making. This sort of analysis is probably familiar to gprof users, but the Performance Analyzer has features that go beyond some of gprof's limitations. For more information about gprof, see Using the gprof Utility.

 FIGURE 9-4 Performance Analyzer--Callers-Callees View

Graphic image of the Performance Analyzer -- callers-callees view.

Different metrics can be selected for the various displays by clicking on the Metrics button, as seen at the bottom of FIGURE 9-1. You can choose which metrics are shown, the order in which they are shown, and which metric should be used for sorting.

Click on the Timeline tab to see a view similar to the one shown in FIGURE 9-5. Time is shown across the horizontal axis, while experiments appear on the vertical axis. Experiments are numbered sequentially, starting from 1. For example, MPI rank 7 might appear as experiment 8. The timeline view helps you see synchronization points, load imbalances, and other performance characteristics of your parallel MPI program.

 

FIGURE 9-5 Performance Analyzer--Timeline View

Graphic image of the Performance Analyzer -- timeline view

In FIGURE 9-6, another timeline view is shown. Click on an event to see its callstack in the lower right corner. When you zoom in to the time scale of the clock sampling (10 milliseconds by default), clock-sampled events appear discretely on the view. Using MPI tracing helps maintain resolution even to the highest levels of zoom.

 FIGURE 9-6 Performance Analyzer--Timeline View With Callstack

Graphic image of the Performance Analyzer -- timeline view with callstack.

Because clocks on different nodes of a cluster are not guaranteed to be synchronized, and because the Performance Analyzer makes no attempt to synchronize time stamps from different nodes, timeline views of Sun MPI runs that were distributed over multiple nodes in a cluster are not guaranteed to be displayed properly.

Overview of Functions

In a profile, you will typically find many unfamiliar functions that do not appear explicitly in your code. Further, it can happen that none of the familiar MPI calls you do use will appear, such as MPI_Isend(), MPI_Irecv(), MPI_Wait(), or MPI_Waitall().

FIGURE 9-7 shows examples of functions you might find in your profiles, along with explanations of what you are seeing. The functions are organized by load object, such as an executable (here, a.out) or a dynamic library.

 

FIGURE 9-7 Examples of Functions That Might Appear in Profiles.

Table of examples of functions that might appear in profiles

Note that:

  • The top function is _start, which calls the main function. In Fortran programs, the main function calls the MAIN_ function.
  • By default, the Fortran compiler converts subprogram names to lower case and appends underscores. For example, the Fortran routine MY_FORTRAN_ROUTINE() would be converted to my_fortran_routine_().
  • The MPI standard defines a profiling interface, which provides that MPI_* functions should also be accessible by using the shifted names PMPI_*. In the Sun MPI implementation, this means that all user-callable functions are named with their PMPI_* forms, with a pmpi_* wrapper for Fortran use.
For example, a C call to MPI_Send() will enter the function PMPI_Send.
Sun MPI uses a number of internal routines, which will appear in profiles.
  • A Fortran call to MPI_SEND() enters the function pmpi_send_, which in turn calls PMPI_Send.
  • libcollector intercepts particular calls to the Sun MPI or threads libraries to support synchronization tracing. Thus, functions such as PMPI_Recv and mutex_lock might appear twice in profiles -- once belonging to a user-callable library and once belonging to libcollector.
  • Sun MPI calls routines in other Sun HPC ClusterTools libraries.
  • Various Sun HPC ClusterTools libraries call other standard libraries. Notably:
    • The _poll, poll, and yield calls might be called by a Sun MPI process for waiting.
    • The memcpy call is often called when an MPI process is copying data locally, such as for on-node message passing.
    • The _read, read, _write, _writev, and writev calls are used in off-node message passing over TCP.

One way to get an overview of which MPI calls are being made, and which are most important, is to look for the PMPI entry points into the MPI library. For our case study example:

% er_print -function proc-0.er | grep PMPI_

0.050 16.980 PMPI_Wait

0.030 0.810 PMPI_Isend

0.030 16.930 PMPI_Wait

0.020 0.490 PMPI_Irecv

0. 0. PMPI_Finalize

0. 0.150 PMPI_Init

0. 1.630 PMPI_Waitall

0. 1.630 PMPI_Waitall

%


In this example, roughly 20 seconds out of 146.93 seconds profiled are due to MPI calls. The exclusive times are typically small and meaningless. Synchronizing calls, such as PMPI_Wait and PMPI_Waitall, appear twice, once due to libmpi and once to libcollector. Such duplication can be ignored.

If ever there is a question about what role an unfamiliar (but possibly time-consuming) function is playing, within the Performance Analyzer you can:

  • Choose Callers-Callees to see which function is calling it.
  • Choose Show Summary Metrics from the View menu to see what is displayed under Load Object.
  • Choose Select Load Objects Included from the View menu to restrict viewing to functions that belong to specific load objects (such as your executable or libmpi).

MPI Wait Times

Time might be spent in MPI calls for a wide variety of reasons. Specifically, much of that time might be spent waiting for some condition (a message to arrive, internal buffering to become available, and so on), rather than in moving data.

While an MPI process waits, it might or might not tie up a CPU. Nevertheless, such wait time probably has a cost in terms of program performance.

Your options for ensuring that wait time is profiled include:

  • Have Sun MPI spin a CPU aggressively during wait situations. This option requires turning off coscheduling and turning on spin behavior. Both are off by default. To do this, use the following environment variable settings:
  • % setenv MPI_COSCHED 0
    
    % setenv MPI_SPIN    1
    

  • Select wall-clock time, rather than CPU time, as the profiling metric. Profiling metrics are described earlier in this chapter in the case study.


Other Profiling Approaches

Both Sun MPI and the Solaris environment offer useful profiling facilities. Using the MPI profiling interface, you can investigate MPI calls. Using your own timer calls, you can profile specific behaviors. Using the Solaris gprof utility, you can profile diverse multiprocess codes, including those using MPI. You can also use a third-party performance analyzer, VAMPIR, with Sun MPI programs.

Using the MPI Profiling Interface

The MPI standard supports a profiling interface, which allows any user to profile either individual MPI calls or the entire library. This interface supports two equivalent APIs for each MPI routine. One has the prefix MPI_, while the other has PMPI_. User codes typically call the MPI_ routines. A profiling routine or library will typically provide wrappers for the MPI_ routines that simply call the PMPI_ ones, with timer calls around the PMPI_ call.

You can use this interface to change the behavior of MPI routines without modifying your source code. For example, suppose you believe that most of the time spent in some collective call, such as MPI_Allreduce, is due to the synchronization of the processes that is implicit to such a call. Then, you can compile a wrapper such as the one that follows, and link it into your code before the -lmpi switch. The effect will be that time profiled by MPI_Allreduce calls will be due exclusively to the MPI_Allreduce operation, with synchronization costs attributed to barrier operations.

subroutine MPI_Allreduce(x,y,n,type,op,comm,ier)
integer x(*), y(*), n, type, op, comm, ier
call PMPI_Barrier(comm,ier)
call PMPI_Allreduce(x,y,n,type,op,comm,ier)
end 

Profiling wrappers or libraries can be used even with application binaries that have already been linked. Refer to the Solaris man page for ld for more information about the environment variable LD_PRELOAD.

You can get profiling libraries from independent sources for use with Sun MPI. An example of a profiling library is included in the multiprocessing environment (MPE) from Argonne National Laboratory. Several external profiling tools can be made to work with Sun MPI using this mechanism. For more information on this library and on the MPI profiling interface, refer to the Sun MPI Programming and Reference Guide.

Inserting MPI Timer Calls

Sun HPC ClusterTools implements the MPI timer call MPI_Wtime (demonstrated in the example that follows) with the high-resolution timer gethrtime. If you use MPI_Wtime calls, you should use them to measure sections that last more than several microseconds. Times on different processes are not guaranteed to be synchronized. For information about the gethrtime timer, see the gethrtime(3C) man page.

When profiling multiprocess codes, ensure that the timings are not distorted by the asynchrony of the various processes. For this purpose, you usually need to synchronize the processes before starting and before stopping the timer.

In the following example, most processes might accumulate time in the interesting, timed portion, waiting for process 0 (zero) to emerge from uninteresting initialization. This condition would skew your program's timings. For example:

CALL MPI_COMM_RANK(MPI_COMM_WORLD,ME,IER)
IF ( ME .EQ. 0 ) THEN 
    initialization
END IF
! place barrier here
! CALL MPI_BARRIER(MPI_COMM_WORLD, IER)
T_START = MPI_WTIME()
    timed portion
T_END = MPI_WTIME() 

When stopping a timer, remember that measurements of elapsed time will differ on different processes. So, execute another barrier before the "stop" timer. Alternatively, use "maximum" elapsed time for all processes.

Avoid timing very small fragments of code. This is good advice when debugging uniprocessor codes, and the consequences are greater with many processors. Code fragments perform differently when timed in isolation. The introduction of barrier calls for timing purposes can be disruptive for short intervals.

Using the gprof Utility

The Solaris utility gprof might be used for multiprocess codes, such as those that use MPI. Several points should be noted:

  • Compile and link your programs with the -pg (Fortran) or -xpg (C) switch.
  • Use the environment variable PROFDIR to profile multiprocess jobs, such as those that use MPI.
  • Use the gprof command after program execution to gather summary statistics either on individual processes or for multiprocess aggregates.

Note, however, that the gprof utility has several limitations.

  • The gprof utility requires recompilation and relinking.
  • Many libraries do not have gprof versions. For example, activity spent within Sun MPI calls do not appear in gprof profiles.
  • The gprof utility apportions time equally among all callers. For example, assume a matrix-multiply routine is called from one caller for small matrices and an equal number of times from another caller for large matrices. The gprof utility attributes time spent in the matrix multiplication equally to both callers, even if the large-matrix operations are substantially more time consuming.
  • It does not count time spent in sleeps and yields, which can skew results.
  • It loses the relationships between process ids (used to tag profile files) and MPI process ranks.
  • Its profiles from different processes might overwrite one another if a multiprocess job spans multiple nodes.
  • It does not interoperate with Prism software.

Note that the Performance Analyzer is simple to use, provides the profiling information that gprof does, offers additional functionality, and avoids the pitfalls. Thus, gprof users are highly encouraged to migrate to the Performance Analyzer for both MPI and non-MPI codes.

For more information about the gprof utility, refer to the gprof man page. For more information about the Performance Analyzer, refer to the documentation supplied with the Forte Developer 6, update 2 and Sun ONE Studio 7 Compiler Collections.

Using the VAMPIR Performance Analyzer

The VAMPIR performance analysis tool (version 3.0) from Pallas GmbH now supports Sun MPI programs.

Sun MPI Features Tested With VAMPIR

VAMPIR supports:

  • Codes compiled using Sun C, C++, and Fortran compilers
  • MPI-IO functions
  • User-defined datatypes
  • Tag matching of MPI messages
  • Both 32-bit and 64-bit codes
  • Bandwidth and latency measurement using various protocols (for example: rsm and shm)
  • Data collection in clusters operating under different distributed resource managers: SGE, PBS, and LSF
  • Correction for clock skew when debugging parallel applications that run across clustered systems

Limitations of VAMPIR Support

VAMPIR's click to source feature is not implemented for Sun MPI programs compiled using Forte 6 update 2 or SunONE Studio 7 compiler collection compilers.

vampirtrace, VAMPIR's profiling MPI (PMPI) library, is MPI-1 (with I/O operations) compliant. Consequently, all MPI-2 functions other than MPI-IO are unsupported (Process Creation and Management, One-sided Operations, and Extended Collective Operations).



Note - Use of certain MPI-2 constants (such as MPI_STATUS_IGNORE) in MPI-1 functions can cause errors or core dumps.



Notes on Compilation

VAMPIR supports the performance analysis of Sun MPI codes only when static library linkage is used.

Until VAMPIR provides a dynamic library, VAMPIR cannot analyze Sun MPI applications unless the applications are recompiled.

Examples of compilation using a static vampirtrace library (libVT):

for C programs:

mpcc -o connectivity connectivity.c -L/opt/vampir/lib -R/opt/vampir/lib -lVT /
-lmpi -lnsl -lm

for Fortran programs:

mpf77 monte.f -o monte -L/opt/vampir/lib -R/opt/vampir/lib -lVT -lmpi -lnsl 

for C++ programs:

mpCC -o prime prime.cc -L/opt/vampir/lib -R/opt/vampir/lib -lVT -lmpi++ -lmpi 
/ -lnsl