8 - C H A P T E R -

C H A P T E R 8

Profiling Programs With MPPROF

This chapter explains how to use the MPProf utilities and environment variables to extract profiling information from executing MPI programs and generate formatted reports from that data. It also explains how to convert intermediate binary files to unanalyzed ASCII files. It has four sections:

Enabling MPI Profiling

Controlling Data Collection

Using mpprof to Generate Reports

Using mpdump to Convert Intermediate Binary Files to ASCII Files

Enabling MPI Profiling

To enable MPI profiling with MPProf, set the MPI_PROFILE environment variable to 1 before starting the MPI program.

% setenv MPI_PROFILE 1

When MPProf is enabled, it will extract information about calls to Sun MPI routines and store the information in a set of intermediate files. A separate data file is created for each MPI process rank in which MPI calls were made.

MPProf also creates an index file describing the locations of the intermediate files. The index file name has the form:

mpprof.index.rm.jid

where rm is the name of the resource manager and jid is the job ID. The report generator, mpprof, uses this index file to gather the profiling data and associate it with particular MPI processes.

Note - If the MPI program uses an instrumented loadable protocol module (PM), MPProf passes PM-related profiling data back to the PM. This allows the reporting of PM profile data independent of mpprof.

Controlling Data Collection

MPProf provides the following environment variables to control various aspects of MPProf behavior:

MPI_PROFDATADIR - Specify a location where the intermediate files will be stored.

MPI_PROFINDEXDIR - Specify a location where the index file will be stored.

MPI_PROFINTERVAL - Specify an interval between intermediate file updates.

MPI_PROFMAXFILESIZE - Specify the maximum size, in kilobytes, that can be written to any intermediate file.

These environment variables are explained in the following sections.

`MPI_PROFDATADIR`

The environment variable MPI_PROFDATADIR can be used to specify a nondefault location where the intermediate files will be created for each process. This directory must exist.

If profiling is enabled and MPI_PROFDATADIR specifies a directory that does not exist, MPProf will output an error message and abort the program. If MPI_PROFDATADIR is not set and profiling is enabled, intermediate files will be created in /usr/tmp.

`MPI_PROFINDEXFDIR`

The environment variable MPI_PROFINDEXDIR can be used to specify a nondefault location for storing the index file. This directory must exist.

If profiling is enabled and MPI_PROFDATADIR specifies a directory that does not exist, MPProf will output an error message and abort the program. If profiling is enabled and MPI_PROFINDEXDIR is not set, the index file will be stored locally in the current directory.

`MPI_PROFINTERVAL`

The environment variable MPI_PROFINTERVAL can be used to specify a data sampling period. When this value is a number greater than 0, the intermediate files will be updated at the prescribed intervals. The data recorded with each update represent the MPI activity that occurred since the previous update.

Setting MPI_PROFINTERVAL to 0 forces updates to be made for every MPI call. Setting MPI_PROFINTERVAL to Inf (meaning infinity) causes the intermediate files to be updated only once, at MPI_Finalize time. If MPI_PROFINTERVAL is unset or is set without a value, the default value of 60 seconds will be used.t

The following example sets an interval of 5 seconds:

% setenv MPI_PROFINTERVAL 5

In this case, the first update of intermediate files would occur approximately 5 seconds after the MPI_Init call, with additional updates appended to the file at 5-second intervals.

If time intervals are used and an MPI program terminates before the MPI_Finalize call, any updates that were made can be used by mpprof to generate a profile of program operations up to the point of termination.

`MPI_PROFMAXFILESIZE`

The environment variable MPI_PROFMAXFILESIZE can be used to specify, in kilobytes, the maximum amount of data a process can record in its intermediate file. The default value is 51,200 kilobytes (approximately 50 megabytes). Setting MPI_PROFMAXFILESIZE to unlimited removes any limits on the size to which the intermediate files can grow. Setting MPI_PROFMAXFILESIZE to 0 is an error and will cause MPProf to abort the program

If a write to a given intermediate file exceeds the MPI_PROFMAXFILESIZE limit, the write operation will complete, but that process will be unable to record additional profiling data. Profiling can continue for other processes that have not reached the file size limit.

Using mpprof to Generate Reports

This section shows how to use the mpprof command, which is used to generate reports from profiling information that is collected by the MPI library. To enable MPI profiling, you set the MPI_PROFILE environment variable to 1 before executing the mprun command.

When you use the mpprof command, you always specify an index file that points to files that contain profiling data. The naming convention for the index file is mpprof.index.rm.jid, where rm is the resource manager used and jid is the job ID assigned to the MPI program. When profiling is enabled, the MPI library creates the index file in the current directory by default.

Note - The mpdump command, discussed in the next section, is used to translate binary profiling data into ASCII text.

mpprof Command Syntax

The mpprof command syntax is

mpprof [-h] [-V] [-r] [-S] [-g output-directory] [-p process-set]

[-c columns] -s start-time] [-e end-time] [-o output-file]

index-file

where the following command options are available:


Option	Description
-h	Lists the optional arguments and exits.
-V	Displays the program version during startup.
-r	Removes the intermediate files associated with the specified index-file. By default, the remove command will use rsh for each node in order to unlink the files. If the -r option is used with the -S option, the secure shell (ssh) will be used instead of rsh. If the -r option is used with the -p option, only a subset of the files will be removed.
-S	Use the secure ssh and scp commands instead of rsh and rcp.
-g	Gathers (copies) the intermediate files associated with the specified index file to the specified output-directory. The file names will be preserved and will overwrite any existing files of the same name. A new index file will be created in the specified output-directory to be used with the files. If the -p option is used with the -g option, a subset of the files will be copied. .The -g option can be used with the -r option as a way to archive the intermediate data files. If the -g output-directory contains previously stored intermediate files, a new index file will be created. No data will be lost.
-p	Specifies a subset of the processes to be analyzed. The -p process-set option is entered as a comma-separated list of tuples. If the -p option is used with -S and/or -g, the operation will be performed only on the specified subset. If a process set encompasses the complete set of ranks, the -p option will have no effect. If the -p option is given a stride, the upper bound of the stride is omitted. For example, if a stride is specified as 1:4:2, the upper bound (4) is not included in the set. In this example, the resulting set is {1,3}. See the section "Reporting on Specific Processes" below for additional information.
-c	Specifies the output width of the generated report. The smallest number of columns supported is 35. When mpprof prints to a terminal window, the display width will be adjusted automatically unless this option is specified. Wrapping will occur if the terminal window has fewer than 35 columns.
-s	Specifies that any profiling data generated before the specified start-time should be ignored. This time reference is measured from the point at which the program calls BMPI_Init. The -s option is useful only when a profiling interval has been specified for the intermediate files. If no start-time is specified on the mpprof command line, the earliest time specified in any intermediate file is used to begin data recording.
-e	Specifies that any profiling data occurring after the specified end-time should be ignored. This time is also measured in seconds, beginning from the time the program calls MPI_Init. This option can be used with, or in place of, the -s option. If no end-time is specified, the latest time in any intermediate file is used for reporting.
-o	Specifies that the report should be written to the specified output-file. If this option is not used or if the output-file is `-', the report will be written to stdout. If the output-file already exists, it will be overwritten.

Generating a Message Passing Report

The mpprof command generates a report file based on processes listed in a specified index file. The following example generates a report file called report.txt that is based on process profile data stored in files that are specified in an index file called mpprof.index.cre.14:

% mpprof -o report.txt mpprof.index.cre.14

When you enter this command, a new file called report.txt is created in the current directory. If you did not use the -o report.txt command line option, the report would be sent to the standard output.

Reporting on Specific Processes

You can fine-tune profiling output by specifying a subset of processes of the job, using the -p option as shown in the following example:

% mpprof -p 0,3-5,9:12,18:27:3 -o report.txt mpprof.index.cre.14

The output report file is always specified before the index file. The preceding command causes these processes to be analyzed for the report: 0, 3, 4, 5, 9, 10, 11, 12, 18, 21, 24, and 27. To view the report, open the report.txt file in the current directory.

Reporting Processes That Occur After a Specified Time Interval

The following example runs the mpprof command to generate a report on all processes that occur starting 2 minutes (120 seconds) or later after a call to MPI_Init, which initializes the accumulation of data on processes:

% mpprof -s 120 -o report.txt mpprof.index.cre.14

The -s option is useful when you want to exclude startup routines from reporting because startup routines may not be critical to performance. To view the report, open the report.txt file in the current directory.

To Save Report Output for Later Use

If you want to save report output to a specific directory other than the current directory, use the -g option:

% mpprof -g report.txt mpprof.index.cre.14

You can combine the -g reportname option with the -r option, removing the intermediate data files created during the profiling process at the same time as you store the report output:

% mpprof -r -g report.txt mpprof.index.cre.14

A Sample Report

The sample report shown is the result of running the following command, which places the report in a file called report.txt:

% mpprof -p 0,3-5,9:12,18:27:3 -o report.txt mpprof.index.cre.39

Note - The file mpprof.index.cre.39 was created when this MPI command was run:
% mprun -np 28 connectivity
The command generates 28 separate files.

The sample report is shown below:

OVERVIEW

========

The program being reported on is "/home/mscfie/tests/first-string/connectivity," which ran as job name "cre.39" on Tue Oct 29 14:04:29 2002.

Profiled Time Range:

  Start at elapsed time 0.000030 secs

  End at elapsed time   0.008181 secs

  Total duration is     0.008151 secs

  Fraction spent in MPI 85.9%

Elapsed time is measured from the end of MPI_Init. Data is being reported for

12 processes of a 28-process job.

LOAD BALANCE

============

Data is being reported on 12 MPI processes. The following histogram shows how

these processes were distributed as a function of the fraction of the time the

processes spent in MPI routines:

  Number of MPI Processes

 10-|

9-|

8-|

7-|

6-|

5-|

4-|

3-|

2-|

  1-|  # #      #      #          #       # #  #         ##        #      #

    |  # #      #      #          #       # #  #         ##        #      #

  0-+----+----+----+----+----+----+----+----+----+----+----+----+----+------

   84.5 84.7 84.9 85.1 85.3 85.5 85.7 85.9 86.1 86.4 86.6 86.8 87.0 87.2

     Percentage time in MPI

Rank   Hostname   MPI Time

  12  hpc-smp6-0    84.51%

  24  hpc-smp6-0    84.62%

  27  hpc-smp6-0    84.91%

   0  hpc-smp6-0    85.18%

   5  hpc-smp6-0    85.62%

  10  hpc-smp6-0    85.96%

   4  hpc-smp6-0    86.03%

   3  hpc-smp6-0    86.18%

  21  hpc-smp6-0    86.59%

  11  hpc-smp6-0    86.63%

  18  hpc-smp6-0    87.00%

   9  hpc-smp6-0    87.30%

Low MPI time for an MPI process may indicate the process has too much of a compute load. A high compute load forces the other processes to wait, increasing their MPI time.

To focus reporting on one or more particular processes you may use the -p option on the command line for mpprof. Type "mpprof -h" for more information.

MPI ENVIRONMENT VARIABLES

=========================

MPI_POLLALL:

You ran with full polling of connections. This means that Sun MPI monitored all

connections for incoming messages, whether your program explicitly posted receive requests for those connections or not. Typically, this leads to a degradation in performance.

Suggestion: Set the environment variable MPI_POLLALL to "0".

Warning: If your program relies on MPI_Send to provide substantial internal buffering of messages, this suggestion could result in deadlock. On the other

hand, that would be an indication that the program in not MPI compliant. If such deadlock results, it may be resolved by disregarding this suggestion. Even better performance could result, however, by modifying the program to post the appropriate receives or, in some cases, by setting MPI environment variables to increase internal buffers.

SHM

===

The loadable protocol module "shm" has nothing to report.

SUGGESTION SUMMARY

==================

Summary of environment variable suggestions:

  Set: MPI_POLLALL=0

In the C shell, these environment variables may be set by the following commands:

setenv MPI_POLLALL 0

In the Bourne or Korn shell, these environment variables may be set by the following commands:

export MPI_POLLALL=0

BREAKDOWN BY MPI ROUTINE

========================

Here, averages over all MPI processes profiled are reported. The numbers in parentheses roughly indicate the variations there are among all of the MPI processes. These variations are computed as (1-min/max)/2 where "min" and "max"

are the minimum and maximum values, respectively, for each statistic reported.

A total of 5 different MPI APIs were called.

 MPI Routine         Time        Calls Made     Sent       Received

MPI_Barrier    0.001625 (49.8%)   1  (0.0%)    0  (0.0%)    0  (0.0%)

MPI_Comm_rank  0.000003 (13.5%)   1  (0.0%)    0  (0.0%)    0  (0.0%)

MPI_Comm_size  0.000002 (10.1%)   1  (0.0%)    0  (0.0%)    0  (0.0%)

MPI_Recv       0.003758 (39.3%)  27  (0.0%)    0  (0.0%)  108  (0.0%)

MPI_Send       0.001612 (43.2%)  27  (0.0%)  108  (0.0%)    0  (0.0%)

Where "Time" is in seconds and "Sent" and "Received" are in bytes.

TIME DEPENDENCE

===============

Here is a rough depiction of time variations in MPI usage over the reported time range. The fraction of time spent in each of the top 4 MPI routines overall are shown for 60.0-second time periods.

Each time period is specified with an integer. This integer roughly corresponds

to the number of time periods passed, with the first time period lasting from

0-60.0 seconds. There are 1 different time periods. Time periods may be missing

if no MPI calls were made during the period. Times for MPI calls that persist

over multiple reporting intervals will only be reported in a single interval;

these reported times may be greater than 100%.

period  MPI_Recv  MPI_Barrier  MPI_Send  MPI_Comm_rank

     1      0.0%         0.0%      0.0%           0.0%

CONNECTIONS

===========

A connection is a sender/receiver pair. For 28 processes, there are 28x28=784

connections, including send-to-self connections.

You asked to see data on a 12-member subset of the MPI processes. Only point-to-point messages that were sent from this subset are reported. The following statistics represent 12x12=144 connections.

Here are statistics on the messages sent for each connection, reported on a

scale of 0-99 with 99 corresponding to 1 messages:

sender

     0  3  4  5  9 10 11 12 18 21 24 27

receiver

  0  _ 99 99 99 99 99 99 99 99 99 99 99

  1 99 99 99 99 99 99 99 99 99 99 99 99

  2 99 99 99 99 99 99 99 99 99 99 99 99

  3 99  _ 99 99 99 99 99 99 99 99 99 99

  4 99 99  _ 99 99 99 99 99 99 99 99 99

  5 99 99 99  _ 99 99 99 99 99 99 99 99

  6 99 99 99 99 99 99 99 99 99 99 99 99

  7 99 99 99 99 99 99 99 99 99 99 99 99

  8 99 99 99 99 99 99 99 99 99 99 99 99

  9 99 99 99 99  _ 99 99 99 99 99 99 99

 10 99 99 99 99 99  _ 99 99 99 99 99 99

 11 99 99 99 99 99 99  _ 99 99 99 99 99

 12 99 99 99 99 99 99 99  _ 99 99 99 99

 13 99 99 99 99 99 99 99 99 99 99 99 99

 14 99 99 99 99 99 99 99 99 99 99 99 99

 15 99 99 99 99 99 99 99 99 99 99 99 99

 16 99 99 99 99 99 99 99 99 99 99 99 99

 17 99 99 99 99 99 99 99 99 99 99 99 99

 18 99 99 99 99 99 99 99 99  _ 99 99 99

 19 99 99 99 99 99 99 99 99 99 99 99 99

 20 99 99 99 99 99 99 99 99 99 99 99 99

 21 99 99 99 99 99 99 99 99 99  _ 99 99

 22 99 99 99 99 99 99 99 99 99 99 99 99

 23 99 99 99 99 99 99 99 99 99 99 99 99

 24 99 99 99 99 99 99 99 99 99 99  _ 99

 25 99 99 99 99 99 99 99 99 99 99 99 99

 26 99 99 99 99 99 99 99 99 99 99 99 99

 27 99 99 99 99 99 99 99 99 99 99 99  _

Here are statistics on the bytes sent for each connection, reported on a scale

of 0-99 with 99 corresponding to 4 bytes:

    sender

     0  3  4  5  9 10 11 12 18 21 24 27

receiver

  0  _ 99 99 99 99 99 99 99 99 99 99 99

  1 99 99 99 99 99 99 99 99 99 99 99 99

  2 99 99 99 99 99 99 99 99 99 99 99 99

  3 99  _ 99 99 99 99 99 99 99 99 99 99

  4 99 99  _ 99 99 99 99 99 99 99 99 99

  5 99 99 99  _ 99 99 99 99 99 99 99 99

  6 99 99 99 99 99 99 99 99 99 99 99 99

  7 99 99 99 99 99 99 99 99 99 99 99 99

  8 99 99 99 99 99 99 99 99 99 99 99 99

  9 99 99 99 99  _ 99 99 99 99 99 99 99

 10 99 99 99 99 99  _ 99 99 99 99 99 99

 11 99 99 99 99 99 99  _ 99 99 99 99 99

 12 99 99 99 99 99 99 99  _ 99 99 99 99

 13 99 99 99 99 99 99 99 99 99 99 99 99

 14 99 99 99 99 99 99 99 99 99 99 99 99

 15 99 99 99 99 99 99 99 99 99 99 99 99

 16 99 99 99 99 99 99 99 99 99 99 99 99

 17 99 99 99 99 99 99 99 99 99 99 99 99

 18 99 99 99 99 99 99 99 99  _ 99 99 99

 19 99 99 99 99 99 99 99 99 99 99 99 99

 20 99 99 99 99 99 99 99 99 99 99 99 99

 21 99 99 99 99 99 99 99 99 99  _ 99 99

 22 99 99 99 99 99 99 99 99 99 99 99 99

 23 99 99 99 99 99 99 99 99 99 99 99 99

 24 99 99 99 99 99 99 99 99 99 99  _ 99

 25 99 99 99 99 99 99 99 99 99 99 99 99

 26 99 99 99 99 99 99 99 99 99 99 99 99

 27 99 99 99 99 99 99 99 99 99 99 99  _

The average length of point-to-point messages was 4 bytes per message.

Using `mpdump` to Convert Intermediate Binary Files to ASCII Files

The mpdump command converts each raw (unanalyzed) intermediate file that is generated by the MPI library into a readable ASCII file. The mpdump command produces files that have the .txt extension. For example, the following command creates a series of ASCII files in the current directory (the default) based on all the processes that are listed in the index file mpprof.index.cre.14:

% mpdump mpprof.index.cre.14

The `mpdump` Command Syntax

The syntax of the mpdump command is

mpdump [-h] [-V] [-S] [-p process-set] [-o output-directory] index-file

where the following command options are available:


Option	Description
-h	Lists the optional arguments and exits.
-V	Displays the program version during start up.
-S	Uses the secure ssh and scp commands instead of rsh and rcp.
-p	Specifies a subset of the processes to be included in the mpdump output. The `-p` process-set option is entered as a comma-separated list of tuples. If the -p option is used with -S and/or -g, the operation will be performed only on the specified subset. If a process set encompasses the complete set of ranks, the -p option will have no effect. If the -p option is given a stride, the upper bound of the stride is omitted. For example, if a stride is specified as 1:4:2, the upper bound (4) is not included in the set. In this example, the resulting set is {1,3}.
-o	Specifies that the mpdump output files are to be written to the named directory. The named directory must already exist. If the specified output-directory does not exist, an error message will be issued. If -o is not specified, the output files will be written to the current directory.

A Sample `mpdump` File

The following command creates 28 ASCII text files that are based on the 28 files that were created in the previous section using the mprun command.

% mpdump mpprof.index.cre.39

Each of the 28 files contains profiling data about a process in the job. The contents of one of the files is shown below:

H: version="MPProf 1.0" np=28 rank=0 timeperiod=60.000000;

H: jobname="cre.39" pid=8399 date="Tue Oct 29 14:04:29 2002"

H: maxfilesize=52428800 hostname="hpc-smp6-0"

H: arg0="/home/mshonle/tests/first-string/connectivity"

V: variable=MPI_POLLALL actual=1 user=<unset>;

V: variable=MPI_PROCBIND actual=1 user=L;

V: variable=MPI_SPIN actual=1 user=1;

V: variable=MPI_CANONREDUCE actual=0 user=<unset>;

V: variable=MPI_OPTCOLL actual=1 user=<unset>;

V: variable=MPI_EAGERONLY actual=1 user=<unset>;

V: variable=MPI_COSCHED actual=2 user=<unset>;

V: variable=MPI_FLOWCONTROL actual=0 user=<unset>;

V: variable=MPI_FULLCONNINIT actual=0 user=<unset>;

V: variable=MPI_WARMUP actual=0 user=<unset>;

x:shm.2: H: version="MPProf 1.0" hostname="hpc-smp6-0";

x:shm.2: V: variable=MPI_SHM_CPOOLSIZE actual=24576 user=<unset>;

x:shm.2: V: variable=MPI_SHM_CYCLESIZE actual=8192 user=<unset>;

x:shm.2: V: variable=MPI_SHM_CYCLESTART actual=24576 user=<unset>;

x:shm.2: V: variable=MPI_SHM_NUMPOSTBOX actual=16 user=<unset>;

x:shm.2: V: variable=MPI_SHM_PIPESIZE actual=8192 user=<unset>;

x:shm.2: V: variable=MPI_SHM_PIPESTART actual=2048 user=<unset>;

x:shm.2: V: variable=MPI_SHM_SBPOOLSIZE actual=0 user=<unset>;

x:shm.2: V: variable=MPI_SHM_SHORTMSGSIZE actual=256 user=<unset>;

x:shm.2: V: variable=MPI_SHM_RENDVSIZE actual=24576 user=<unset>;

x:shm.2: V: variable=MPI_SHM_GBPOOLSIZE actual=20971520 user=<unset>;

h: start=0.000030 end=0.008181 snapshot=1;

c: routine=MPI_Barrier time=0.005588 bytessent=0 bytesrecv=0 calls=1;

c: routine=MPI_Comm_rank time=0.000003 bytessent=0 bytesrecv=0 calls=1;

c: routine=MPI_Comm_size time=0.000002 bytessent=0 bytesrecv=0 calls=1;

c: routine=MPI_Recv time=0.001103 bytessent=0 bytesrecv=108 calls=27;

c: routine=MPI_Send time=0.000248 bytessent=108 bytesrecv=0 calls=27;

p: destrank=1 bytessent=4 msgssent=1;

p: destrank=2 bytessent=4 msgssent=1;

p: destrank=3 bytessent=4 msgssent=1;

p: destrank=4 bytessent=4 msgssent=1;

p: destrank=5 bytessent=4 msgssent=1;

p: destrank=6 bytessent=4 msgssent=1;

p: destrank=7 bytessent=4 msgssent=1;

p: destrank=8 bytessent=4 msgssent=1;

p: destrank=9 bytessent=4 msgssent=1;

p: destrank=10 bytessent=4 msgssent=1;

p: destrank=11 bytessent=4 msgssent=1;

p: destrank=12 bytessent=4 msgssent=1;

p: destrank=13 bytessent=4 msgssent=1;

p: destrank=14 bytessent=4 msgssent=1;

p: destrank=15 bytessent=4 msgssent=1;

p: destrank=16 bytessent=4 msgssent=1;

p: destrank=17 bytessent=4 msgssent=1;

p: destrank=18 bytessent=4 msgssent=1;

p: destrank=19 bytessent=4 msgssent=1;

p: destrank=20 bytessent=4 msgssent=1;

p: destrank=21 bytessent=4 msgssent=1;

p: destrank=22 bytessent=4 msgssent=1;

p: destrank=23 bytessent=4 msgssent=1;

p: destrank=24 bytessent=4 msgssent=1;

p: destrank=25 bytessent=4 msgssent=1;

p: destrank=26 bytessent=4 msgssent=1;

p: destrank=27 bytessent=4 msgssent=1;

x:shm.2: alloc_mem=MPI_Alloc_mem no_memory_allocated=0;

e:

Enabling MPI Profiling

Controlling Data Collection

MPI_PROFDATADIR

MPI_PROFINDEXFDIR

MPI_PROFINTERVAL

MPI_PROFMAXFILESIZE