Compaq Fortran
User Manual for
Tru64 UNIX and
Linux Alpha Systems


Previous Contents Index


Chapter 5
Performance: Making Programs Run Faster

This chapter contains the following topics:

Note

To invoke the Compaq Fortran compiler, use:
  • f90 on Tru64 UNIX Alpha systems
  • fort command on Linux Alpha systems


This chapter uses f90 to indicate invoking Compaq Fortran on both systems, so replace this command with fort if you are working on a Linux Alpha system.

To invoke the Compaq C compiler, use:
  • cc on Tru64 UNIX Alpha systems
  • ccc on Linux Alpha systems


This chapter uses cc to indicate invoking Compaq C on both systems, so replace this command with ccc if you are working on a Linux Alpha system.

5.1 Efficient Compilation and the Software Environment

Before you attempt to analyze and improve program performance, you should:

5.1.1 Install the Latest Version of Compaq Fortran and Performance Products

To ensure that your software development environment can significantly improve the run-time performance of your applications, obtain and install the following optional software products:

For More Information:

5.1.2 Compile Using Multiple Source Files and Appropriate f90 Options

During the earlier stages of program development, you can use incremental compilation with minimal optimization. For example:


% f90 -c -O1 sub2.f90
% f90 -c -O1 sub3.f90
% f90 -o main.out -g -O0 main.f90  sub2.o  sub3.o

During the later stages of program development, you should specify multiple source files together and use an optimization level of at least -o4 on the f90 command line to allow more interprocedure optimizations to occur. For instance, the following command compiles all three source files together using the default level of optimization ( -o4 ):


% f90 -o main.out main.f90  sub2.f90  sub3.f90

Compiling multiple source files lets the compiler examine more code for possible optimizations, which results in:

For very large programs, compiling all source files together may not be practical. In such instances, consider compiling source files containing related routines together using multiple f90 commands, rather than compiling source files individually.

Table 5-1 shows f90 options that can improve performance. Most of these options do not affect the accuracy of the results, while others improve run-time performance but can change some numeric results.

Compaq Fortran performs certain optimizations unless you specify the appropriate f90 command options. Additional optimizations can be enabled or disabled using f90 command options.

Table 5-1 lists the f90 options that can directly improve run-time performance.

Table 5-1 Options That Affect Run-Time Performance
Option Names Description For More Information
-align keyword Controls whether padding bytes are added between data items within common blocks, derived-type data, and Compaq Fortran record structures to make the data items naturally aligned. Section 5.4
-architecture keyword Determines the type of Alpha architecture code instructions to be generated for the program unit being compiled. All Alpha processors implement a core set of instructions; certain processor versions include additional instruction extensions. Section 3.5
-cord and -feedback file Uses a feedback file created during a previous compilation by specifying the -gen_feedback option. These options use the feedback file to improve run-time performance, optionally using cord to rearrange procedures. Section 5.3.5
-fast Sets the following performance-related options:
-align dcommons
-align sequence
-arch host
-assume bigarrays (TU*X ONLY)
-assume nozsize (TU*X ONLY)
-assume noaccuracy_sensitive (same as -fp_reorder )
-math_library fast
-tune host
See description of each option
-fp_reorder Allows the compiler to reorder code based on algebraic identities to improve performance, enabling certain optimizations. The numeric results can be slightly different from the default ( -no_fp_reorder ) because of the way intermediate results are rounded. This slight difference in numeric results is acceptable to most programs. Section 5.9.7
-gen_feedback Requests generated code that allows accurate feedback information for subsequent use of the -feedback file option (optionally with cord ). Using -gen_feedback changes the default optimization level from -o4 to -o0 . Section 5.3.5
-hpf num and related options (TU*X ONLY) Specifies that the code generated for this program will allow parallel execution on multiple processors Section 3.50
-inline all Inlines every call that can possibly be inlined while generating correct code. Certain recursive routines are not inlined to prevent infinite loops. Section 5.9.3
-inline speed Inlines procedures that will improve run-time performance with a likely significant increase in program size. Section 5.9.3
-inline size Inlines procedures that will improve run-time performance without a significant increase in program size. This type of inlining occurs at optimization level -o4 and -o5 . Section 5.9.3
-math_library fast Requests the use of certain math library routines (used by intrinsic functions) that provide faster speed. Using this option causes a slight loss of accuracy and provides less reliable arithmetic exception checking to get significant performance improvements in those functions. Section 3.61
-mp (TU*X ONLY) Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems Section 3.64
-o n ( -o0 to -o5 ) Controls the optimization level and thus the types of optimization performed. The default optimization level is -o4 , unless you specify -g2 , -g , or -gen_feedback , which changes the default to -o0 (no optimizations). Use -o5 to activate loop transformation optimizations. Section 5.8
-om (TU*X ONLY) Used with the -non_shared option to request certain code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window. Section 3.73
-omp (TU*X ONLY) Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems Section 3.74
-p , -p1 Requests profiling information, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. Section 5.3
-pg Requests profiling information for the gprof tool, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. Section 5.3
-pipeline Activates the software pipelining optimization (a subset of -o4 ). Section 3.76
-speculate keyword (TU*X ONLY) Enables the speculative execution optimization, a form of instruction scheduling for conditional expressions. Section 3.84
-transform_loops Activates a group of loop transformation optimizations (a subset of -o5 ). Section 3.89
-tune keyword Specifies the target processor generation (chip) architecture on which the program will be run, allowing the optimizer to make decisions about instruction tuning optimizations needed to create the most efficient code. Keywords allow specifying one particular Alpha processor generation type, multiple processor generation types, or the processor generation type currently in use during compilation. Regardless of the setting of -tune keyword , the generated code will run correctly on all implementations of the Alpha architecture. Section 5.9.4
-unroll num Specifies the number of times a loop is unrolled ( num) when specified with optimization level -o3 or higher. If you omit -unroll num , the optimizer determines how many times loops are unrolled. Section 5.8.4.1

Table 5-2 lists options that can slow program performance. Some applications that require floating-point exception handling or rounding might need to use the -fpen and -fprm dynamic options. Other applications might need to use the -assume dummy_aliases or -vms options for compatibility reasons. Other options listed in Table 5-2 are primarily for troubleshooting or debugging purposes.

Table 5-2 Options that Slow Run-Time Performance
Option Names Description For More Information
-assume dummy_aliases Forces the compiler to assume that dummy (formal) arguments to procedures share memory locations with other dummy arguments or with variables shared through use association, host association, or common block use. These program semantics slow performance, so you should specify -assume dummy_aliases only for the called subprograms that depend on such aliases.

The use of dummy aliases violates the FORTRAN-77 and Fortran 95/90 standards but occurs in some older programs.

Section 5.9.8
-c If you use -c when compiling multiple source files, also specify -o output to compile many source files together into one object file. Separate compilations prevent certain interprocedure optimizations, such as when using multiple f90 commands or using -c without the -o output option. Section 2.1.6
-check bounds Generates extra code for array bounds checking at run time. Section 3.23
-check omp_bindings (TU*X ONLY) Provides run-time checking to enforce the binding rules for OpenMP Fortran API (parallel processing) compiler directives inserted in source code. Section 3.26
-check overflow Generates extra code to check integer calculations for arithmetic overflow at run time. Once the program is debugged, omit this option to reduce executable program size and slightly improve run-time performance. Section 3.28
-fpe n values greater than -fpe0 Using -fpe1 (TU*X ONLY) , -fpe2 (TU*X ONLY) , -fpe3 , or -fpe4 (TU*X ONLY) (or using the for_set_fpe routine to set equivalent exception handling) slows program execution. For programs that specify -fpe3 or -fpe4 (TU*X ONLY) , the impact on run-time performance can be significant. Section 3.44
-fprm dynamic (TU*X ONLY) Certain rounding modes and changing the rounding mode can slow program execution slightly. Section 3.46
-g , -g2 , -g3 Generates extra symbol table information in the object file. Specifying -g or -g2 also reduces the default level of optimization to -o0 . Section 3.48
-inline none
-inline manual
Prevents the inlining of all procedures (except statement functions). Section 5.9.3
-o0 , -o1 , -o2 , or -o3 Minimizes the optimization level (and types of optimizations). Use during the early stages of program development or when you will use the debugger. Section 3.72 and Section 5.8
-synchronous_exceptions Generates extra code to associate an arithmetic exception with the instruction that causes it, slowing efficient instruction execution. Use this option only when troubleshooting, such as when identifying the source of an exception. Section 3.86
-vms Controls certain VMS-related run-time defaults, including alignment. If you specify the -vms option, you may need to also specify the -align records option to obtain optimal run-time performance. Section 3.98

For More Information:

5.1.3 Process Shell Environment and Related Influences on Performance

Certain shell commands and system tuning can improve run-time performance:

For More Information:

5.2 Using the time Command to Measure Performance

Use the time command to provide information about program performance.

Run program timings when other users are not active. Your timing results can be affected by one or more CPU-intensive processes also running while doing your timings.

Try to run the program under the same conditions each time to provide the most accurate results, especially when comparing execution times of a previous version of the same program. Use the same CPU system (model, amount of memory, version of the operating system, and so on) if possible.

If you do need to change systems, you should measure the time using the same version of the program on both systems, so you know each system's effect on your timings.

For programs that run for less than a few seconds, run several timings to ensure that the results are not misleading. Overhead functions like loading shared libraries might influence short timings considerably.

Using the form of the time command that specifies the name of the executable program provides the following:

In the following example timings, the sample program being timed displays the following line:


Average of all the numbers is:    4368488960.000000

Using the Bourne shell, the following program timing reports that the program uses 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use) and 2.46 seconds of elapsed time:


$ time a.out
Average of all the numbers is:    4368488960.000000 
real    0m2.46s 
user    0m0.61s 
sys     0m0.58s 

Using the C shell, the following program timing reports 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use), about 4 seconds (0:04) of elapsed time, the use of 28% of available CPU time, and other information:


% time a.out
Average of all the numbers is:    4368488960.000000 
0.61u 0.58s 0:04 28% 78+424k 9+5io 0pf+0w

Using the bash shell (L*X ONLY), the following program timing reports that the program uses 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use) and 2.46 seconds of elapsed time:


[user@system user]$ time ./a.out
Average of all the numbers is:    4368488960.000000 
elapsed  0m2.46s 
user     0m0.61s 
sys      0m0.58s 

Timings that show a large amount of system time may indicate a lot of time spent doing I/O, which might be worth investigating.

If your program displays a lot of text, you can redirect the output from the program on the time command line. (See Section 5.1.3.) Redirecting output from the program will change the times reported because of reduced screen I/O.

For more information, see time(1).

In addition to the time command, you might consider modifying the program to call routines within the program to measure execution time. For example:

5.3 Using Profiling Tools

To generate profiling information, use the f90 compiler and the prof , gprof , and pixie (TU*X ONLY) tools.

Profiling identifies areas of code where significant program execution time is spent. Along with the f90 command, use the prof and pixie (TU*X ONLY) tools to generate the following profile information:

Once you have determined those sections of code where most of the program execution time is spent, examine these sections for coding efficiency. Suggested guidelines for improving source code efficiency are provided in Section 5.7.

Along with profiling, you can consider generating a listing file with annotations of optimizations, by specifying the -V and -annotations options.

5.3.1 Program Counter Sampling (prof)

To obtain program counter sampling data, perform the following steps:

  1. Use the f90 command option -p to compile and link the program:


    % f90 -p -O3 -o profsample profsample.f90
    

    If you specify the -c option to prevent linking, you must specify the -p option when you link the program:


    % f90 -c -O3 profsample.f90
    % f90 -p -O3 -o profsample profsample.o
    

    Consider specifying optimization level -o3 or -inline manual to minimize the inlining of procedures. Once inlined, procedures are not listed as separate routines but as part of the routine into which they have been inlined. Allowing full inlining would result in program counter sampling for a small number of (usually) large routines. This might not help you locate areas of the program where significant program execution time is spent.

  2. Execute the profiled program:


    % profsample
    

    During program execution, profiling data is written to a profile data file, whose default name is mon.out . You can execute the program multiple times to generate multiple profile data files, which can be averaged. Use the PROFDIR environment variable to request a different profile data file name.

  3. Run the prof command, which formats the profiling data and displays it in a readable format:


    % prof  profsample mon.out
    

You can limit the report created by prof by using prof command options, such as -only , -exclude , or -quit .

For example, if you only want reports on procedures calc_max and calc_min , you could use the following command line to read the profile data file named mon.out :


% prof -only calc_max -only calc_min profsample

The time spent in particular areas of code is reported by prof in the form of a percentage of the total CPU time spent by the program. To reduce the size of the report, you can either:

When you use the -only or -exclude options, the percentages are still based on all procedures of the application. To obtain percentages calculated by prof that are based on only those procedures included in the report, use the -only and -exclude options (use an uppercase initial letter in the option name).

You can use the -quit option to reduce the amount of information reported. For example, the following command prints information on only the five most time-consuming procedures:


% prof -quit 5 profsample 

The following command limits information only to those procedures using 10% or more of the total execution time:


% prof  -quit 10% profsample 

For More Information:

5.3.2 Call Graph Sampling (gprof)

To obtain call graph information, use the gprof tool. Perform the following steps:

  1. Use the command-line option -pg when you compile and link the program:


    % f90 -pg -O3 -o profsample profsample.for
    

    If you specify the -c option to prevent linking, you must then specify the -pg option both when you compile and link the program:


    % f90 -pg -c -O3 profsample.f90
    % f90 -pg -O3 -o profsample profsample.f90
    

  2. Execute the profiled program:


    % profsample
    

    During execution, profiling data is saved to the file gmon.out , unless the environment variable PROFDIR is set.

  3. Run the formatting program gprof :


    % gprof profsample gmon.out
    

The output produced by gprof includes:

For More Information:

5.3.3 Basic Block Counting (pixie and prof)

To obtain basic block counting information, perform the following steps:

  1. Compile and link the program without the -p option:


    % f90 -O3 -o profsample profsample.f90
    

    Consider specifying optimization level -o3 or -inline manual to minimize the inlining of procedures (once inlined, procedures are not listed as separate routines but as part of the routine into which they are inlined).

  2. Run the profiling command pixie : (TU*X ONLY)


    % atom -tools pixie profsample
    

    The pixie command creates: (TU*X ONLY)

  3. Execute the profiled program profsample.pixie generated by pixie :


    % profsample.pixie
    

    This program creates the file profsample.counts , which contains the basic block counts.

  4. Run prof with the -pixie option, to extract and display information from the profsample.addrs and profsample.counts files:


    % prof -pixie profsample 
    

    When you specify the -pixie option (TU*X ONLY), the prof command searches for files with a suffix of .addrs and .counts (in this case profsample.addrs and profsample.counts ).
    You can reduce the amount of information in the report created by prof by using the -only , -exclude , -quit , and related options.

To create multiple profile data files, run the program multiple times.

For More Information:

5.3.4 Source Line CPU Cycle Use (prof and pixie)

You use the same files created by the pixie command (see Section 5.3.3) for basic block counting to estimate the number of CPU cycles used to execute each source file line.

To view a report of the number of CPU cycles estimated for each source file line, use the following options with the prof command:

Depending on the level of optimization chosen, certain source lines might be optimized away.

The CPU cycle use estimates are based primarily on the instruction type and its operands and do not include memory effects such as cache misses or translation buffer fills.

For example, the following command sequence uses:


% f90 -o profsample profsample.f90
% atom -tools pixie profsample
% profsample.pixie
% prof -pixie -heavy -only calc_max profsample

5.3.5 Creating and Using Feedback Files and Optionally cord

You can create a feedback file by using a series of commands. Once created, you can specify a feedback file in a subsequent compilation with the f90 command option -feedback . You can also request that cord use the feedback file to rearrange procedures, by specifying the -cord option on the f90 command line.

To create the feedback file, complete these steps:

  1. Compile and link the program. Omit the -p option, but specify the -gen_feedback option:


    % f90 -o profsample -gen_feedback profsample.f90
    

    The -gen_feedback option changes the default optimization level to -o0 .
    To include libraries in the profiling output, specify -non_shared .

  2. Execute the profiling command pixie (TU*X ONLY):


    % pixie profsample
    

    The pixie command creates:

  3. Execute the profiled program profsample.pixie generated by pixie :


    % profsample.pixie
    

    This program creates the file profsample.counts , which contains the basic block counts.

  4. Run prof with the -pixie and -feedback options:


    % prof -pixie -feedback profsample.feedback profsample
    

    This prof command creates the feedback file profsample.feedback .

You can use the feedback file as input to the f90 compiler:


% f90 -feedback profsample.feedback -o profsample profsample.f90

The feedback file provides the compiler with actual execution information, which the compiler can use to improve such optimizations as inlining function calls.

Specify the desired optimization level ( -on option) for the f90 command with the -feedback name option (in this example the default is -o4 ).

You can use the feedback file as input to the f90 compiler and cord , as follows:


% f90 -cord -feedback profsample.feedback -o profsample profsample.f90

The -cord option invokes cord , which reorders the procedures in an executable program to improve program execution, using the information in the specified feedback file. Specify the desired optimization level ( -on option) for the f90 command with the -feedback name option (in this example -o4 ).

5.3.6 Atom Toolkit

(TU*X ONLY) The Atom toolkit includes a programmable instrumentation tool and several prepackaged tools. The prepackaged tools include:

To invoke atom tools, use the following general command syntax:


% atom -tool tool-name ...) 

Atom does not work on programs built with the -om option.

For More Information:

5.4 Data Alignment Considerations

For optimal performance on Alpha systems, make sure your data is aligned naturally.

A natural boundary is a memory address that is a multiple of the data item's size (data type sizes are described in Table 9-1). For example, a REAL (KIND=8) data item aligned on natural boundaries has an address that is a multiple of 8. An array is aligned on natural boundaries if all of its elements are.

All data items whose starting address is on a natural boundary are naturally aligned. Data not aligned on a natural boundary is called unaligned data.

Although the Compaq Fortran compiler naturally aligns individual data items when it can, certain Compaq Fortran statements (such as EQUIVALENCE) can cause data items to become unaligned (see Section 5.4.1).

Although you can use the f90 command -align keyword options to ensure naturally aligned data, you should check and consider reordering data declarations of data items within common blocks and structures. Within each common block, derived type, or record structure, carefully specify the order and sizes of data declarations to ensure naturally aligned data. Start with the largest size numeric items first, followed by smaller size numeric items, and then nonnumeric (character) data.

5.4.1 Causes of Unaligned Data and Ensuring Natural Alignment

Common blocks (COMMON statement), derived-type data, and Compaq Fortran 77 record structures (RECORD statement) usually contain multiple items within the context of the larger structure.

The following declaration statements can force data to be unaligned:

To avoid unaligned data in a common block, derived-type data, or record structure (extension), use one or both of the following:

Other possible causes of unaligned data include unaligned actual arguments and arrays that contain a derived-type structure or Compaq Fortran record structure.

When actual arguments from outside the program unit are not naturally aligned, unaligned data access will occur. Compaq Fortran assumes all passed arguments are naturally aligned and has no information at compile time about data that will be introduced by actual arguments during program execution.

For arrays where each array element contains a derived-type structure or Compaq Fortran record structure, the size of the array elements may cause some elements (but not the first) to start on an unaligned boundary.

Even if the data items are naturally aligned within a derived-type structure without the SEQUENCE statement or a record structure, the size of an array element might require use of f90 -align options to supply needed padding to avoid some array elements being unaligned.

If you specify -align norecords or specify -vms without -align records , no padding bytes are added between array elements. If array elements each contain a derived-type structure with the SEQUENCE statement, array elements are packed without padding bytes regardless of the f90 command options specified. In this case, some elements will be unaligned.

When -align records option is in effect, the number of padding bytes added by the compiler for each array element is dependent on the size of the largest data item within the structure. The compiler determines the size of the array elements as an exact multiple of the largest data item in the derived-type structure without the SEQUENCE statement or a record structure. The compiler then adds the appropriate number of padding bytes.

For instance, if a structure contains an 8-byte floating-point number followed by a 3-byte character variable, each element contains five bytes of padding (16 is an exact multiple of 8). However, if the structure contains one 4-byte floating-point number, one 4-byte integer, followed by a 3-byte character variable, each element would contain one byte of padding (12 is an exact multiple of 4).

For More Information:

5.4.2 Checking for Inefficient Unaligned Data

During compilation, the Compaq Fortran compiler naturally aligns as much data as possible. Exceptions that can result in unaligned data are described in Section 5.4.1.

Because unaligned data can slow run-time performance, it is worthwhile to:

There are two ways unaligned data might be reported:

5.4.3 Ordering Data Declarations to Avoid Unaligned Data

For new programs or when the source declarations of an existing program can be easily modified, plan the order of your data declarations carefully to ensure the data items in a common block, derived-type data, record structure, or data items made equivalent by an EQUIVALENCE statement will be naturally aligned.

Use the following rules to prevent unaligned data:

When declaring data, consider using explicit length declarations, such as specifying a KIND parameter. For example, specify INTEGER(KIND=4) (or INTEGER(4)) rather than INTEGER. If you do use a default length (such as INTEGER, LOGICAL, COMPLEX, and REAL), be aware that the compiler options -integer_size and -real_size can change the size of an individual field's data declaration size and thus can alter the data alignment of a carefully planned order of data declarations.

Using the suggested data declaration guidelines minimizes the need to use the -align keyword options to add padding bytes to ensure naturally aligned data. In cases where the -align keyword options are still needed, using the suggested data declaration guidelines can minimize the number of padding bytes added by the compiler.

5.4.3.1 Arranging Data Items in Common Blocks

The order of data items in a COMMON statement determine the order in which the data items are stored. Consider the following declaration of a common block named X:


LOGICAL (KIND=2) FLAG 
INTEGER          IARRY_I(3) 
CHARACTER(LEN=5) NAME_CH 
COMMON /X/ FLAG, IARRY_I(3), NAME_CH 

As shown in Figure 5-1, if you omit the appropriate f90 command options, the common block will contain unaligned data items beginning at the first array element of IARRY_I.

Figure 5-1 Common Block with Unaligned Data


Common Block with Unaligned Data

As shown in Figure 5-2, if you compile the program units that use the common block with the -align commons options, data items will be naturally aligned.

Figure 5-2 Common Block with Naturally Aligned Data


Common Block with Naturally Aligned Data

Because the common block X contains data items whose size is 32 bits or smaller, specify -align commons . If the common block contains data items whose size might be larger than 32 bits (such as REAL (KIND=8) data), use -align dcommons .

If you can easily modify the source files that use the common block data, define the numeric variables in the COMMON statement in descending order of size and place the character variable last. This provides more portability, ensures natural alignment without padding, and does not require the f90 command options -align commons or -align dcommons :


LOGICAL (KIND=2) FLAG 
INTEGER          IARRY_I(3) 
CHARACTER(LEN=5) NAME_CH 
COMMON /X/ IARRY_I(3), FLAG, NAME_CH 

As shown in Figure 5-3, if you arrange the order of variables from largest to smallest size and place character data last, the data items will be naturally aligned.

Figure 5-3 Common Block with Naturally Aligned Reordered Data


Common Block with Naturally Aligned Reordered Data

When modifying or creating all source files that use common block data, consider placing the common block data declarations in a module so the declarations are consistent. If the common block is not needed for compatibility (such as file storage or Compaq Fortran 77 use), you can place the data declarations in a module without using a common block.

5.4.3.2 Arranging Data Items in Derived-Type Data

Like common blocks, derived-type data may contain multiple data items (members).

Data item components within derived-type data will be naturally aligned on up to 64-bit boundaries, with certain exceptions related to the use of the SEQUENCE statement and f90 options. See Section 5.4.4 for information about these exceptions.

Compaq Fortran stores a derived data type as a linear sequence of values, as follows:

Consider the following declaration of array CATALOG_SPRING of derived-type PART_DT:


MODULE DATA_DEFS 
  TYPE PART_DT 
    INTEGER           IDENTIFIER 
    REAL              WEIGHT 
    CHARACTER(LEN=15) DESCRIPTION 
  END TYPE PART_DT 
  TYPE (PART_DT) CATALOG_SPRING(30) 
  . 
  . 
  . 
END MODULE DATA_DEFS 

As shown in Figure 5-4, the largest numeric data items are defined first and the character data type is defined last. There are no padding characters between data items and all items are naturally aligned. The trailing padding byte is needed because CATALOG_SPRING is an array; it is inserted by the compiler when the -align records option is in effect.

Figure 5-4 Derived-Type Naturally Aligned Data (in CATALOG_SPRING : ( ,))


Derived-Type Naturally Aligned Data

5.4.3.3 Arranging Data Items in Compaq Fortran Record Structures

Compaq Fortran supports record structures provided by Compaq Fortran. Compaq Fortran record structures use the RECORD statement and optionally the STRUCTURE statement, which are extensions to the FORTRAN-77 and Fortran 95/90 standards. The order of data items in a STRUCTURE statement determine the order in which the data items are stored.

Compaq Fortran stores a record in memory as a linear sequence of values, with the record's first element in the first storage location and its last element in the last storage location. Unless you specify -align norecords , padding bytes are added if needed to ensure data fields are naturally aligned.

The following example contains a structure declaration, a RECORD statement, and diagrams of the resulting records as they are stored in memory:


STRUCTURE /STRA/ 
  CHARACTER*1 CHR 
  INTEGER*4 INT 
END STRUCTURE 
   .
   .
   .
RECORD /STRA/ REC 

Figure 5-5 shows the memory diagram of record REC for naturally aligned records.

Figure 5-5 Memory Diagram of REC for Naturally Aligned Records


Memory Diagram of REC for Naturally Aligned Records

5.4.4 Options Controlling Alignment

The following options control whether the Compaq Fortran compiler adds padding (when needed) to naturally align multiple data items in common blocks, derived-type data, and Compaq Fortran record structures:

The default behavior is that multiple data items in derived-type data and record structures will be naturally aligned; data items in common blocks will not ( -align records with -align nocommons ). In derived-type data, using the SEQUENCE statement prevents -align records from adding needed padding bytes to naturally align data items.

If your command line includes the -std , -std90 , or -std95 options, then the compiler ignores -align dcommons and -align sequence . See Section 3.85.

5.5 Using Arrays Efficiently

The following sections discuss:

5.5.1 Accessing Arrays Efficiently

On Alpha systems, many of the array access efficiency techniques described in this section are applied automatically by the Compaq Fortran loop transformation optimizations (see Section 5.8.7) or by the Compaq KAP Fortran/OpenMP for Tru64 UNIX Systems performance preprocessor (described in Section 5.1.1).

Several aspects of array use can improve run-time performance:

5.5.2 Passing Array Arguments Efficiently

In Fortran 95/90, there are two general types of array arguments:

When passing arrays as arguments, either the starting (base) address of the array or the address of an array descriptor is passed:

Passing an assumed-shape array or array pointer to an explicit-shape array can slow run-time performance. This is because the compiler needs to create an array temporary for the entire array. The array temporary is created because the passed array may not be contiguous and the receiving (explicit-shape) array requires a contiguous array. When an array temporary is created, the size of the passed array determines whether the impact on slowing run-time performance is slight or severe.

Table 5-3 summarizes what happens with the various combinations of array types. The amount of run-time performance inefficiency depends on the size of the array.

Table 5-3 Output Argument Array Types
Input Arguments Array Types Explicit-Shape Arrays Deferred-Shape and Assumed-Shape Arrays
Explicit-shape arrays Very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional. Efficient. Only allowed for assumed-shape arrays (not deferred-shape arrays). Does not use an array temporary. Passes an array descriptor. Requires an interface block.
Deferred-shape and assumed-shape arrays When passing an allocatable array, very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional.

When not passing an allocatable array, not efficient. Instead use allocatable arrays whenever possible.

Uses an array temporary. Does not pass an array descriptor. Interface block optional.

Efficient. Requires an assumed-shape or array pointer as dummy argument. Does not use an array temporary. Passes an array descriptor. Requires an interface block.

5.6 Improving Overall I/O Performance

Improving overall I/O performance can minimize both device I/O and actual CPU time. The techniques listed in this section can greatly improve performance in many applications.

A bottleneck limits the maximum speed of execution by being the slowest process in an executing program. In some programs, I/O is the bottleneck that prevents an improvement in run-time performance. The key to relieving I/O bottlenecks is to reduce the actual amount of CPU and I/O device time involved in I/O.

Bottlenecks can be caused by one or more of the following:

Improved coding practices can minimize actual device I/O, as well as the actual CPU time.

Compaq offers software solutions to system-wide problems like minimizing device I/O delays (see Section 5.1.1).

5.6.1 Use Unformatted Files Instead of Formatted Files

Use unformatted files whenever possible. Unformatted I/O of numeric data is more efficient and more precise than formatted I/O. Native unformatted data does not need to be modified when transferred and will take up less space on an external file.

Conversely, when writing data to formatted files, formatted data must be converted to character strings for output, less data can transfer in a single operation, and formatted data may lose precision if read back into binary form.

To write the array A(25,25) in the following statements, S1 is more efficient than S2:


S1         WRITE (7) A 
 
S2         WRITE (7,100) A 
     100   FORMAT (25(' ',25F5.21)) 

Although formatted data files are more easily ported to other systems, Compaq Fortran can convert unformatted data in several formats (see Chapter 10).

5.6.2 Write Whole Arrays or Strings

The general guidelines about array use discussed in Section 5.5 also apply to reading or writing an array with an I/O statement.

To eliminate unnecessary overhead, write whole arrays or strings at one time rather than individual elements at multiple times. Each item in an I/O list generates its own calling sequence. This processing overhead becomes most significant in implied-DO loops. When accessing whole arrays, use the array name (Fortran 95/90 array syntax) instead of using implied-DO loops.

5.6.3 Write Array Data in the Natural Storage Order

Use the natural ascending storage order whenever possible. This is column-major order, with the leftmost subscript varying fastest and striding by 1. (See Section 5.5.1, Accessing Arrays Efficiently.) If a program must read or write data in any other order, efficient block moves are inhibited.

If the whole array is not being written, natural storage order is the best order possible.

If you must use an unnatural storage order, in certain cases it might be more efficient to transfer the data to memory and reorder the data before performing the I/O operation.

5.6.4 Use Memory for Intermediate Results

Performance can improve by storing intermediate results in memory rather than storing them in a file on a peripheral device. One situation that may not benefit from using intermediate storage is when there is a disproportionately large amount of data in relation to physical memory on your system. Excessive page faults can dramatically impede virtual memory performance.

If you are primarily concerned with the CPU performance of the system, consider using a memory file system (mfs) virtual disk to hold any files your code reads or writes (see mfs(1)).

5.6.5 Enable Implied-DO Loop Collapsing

DO loop collapsing reduces a major overhead in I/O processing. Normally, each element in an I/O list generates a separate call to the Compaq Fortran RTL. The processing overhead of these calls can be most significant in implied-DO loops.

Compaq Fortran reduces the number of calls in implied-DO loops by replacing up to seven nested implied-DO loops with a single call to an optimized run-time library I/O routine. The routine can transmit many I/O elements at once.

Loop collapsing can occur in formatted and unformatted I/O, but only if certain conditions are met:

For More Information:

5.6.6 Use of Variable Format Expressions

Variable format expressions (a Compaq Fortran extension) are almost as flexible as run-time formatting, but they are more efficient because the compiler can eliminate run-time parsing of the I/O format. Only a small amount of processing and the actual data transfer are required during run time.

On the other hand, run-time formatting can impair performance significantly. For example, in the following statements, S1 is more efficient than S2 because the formatting is done once at compile time, not at run time:


S1        WRITE (6,400) (A(I), I=1,N) 
     400   FORMAT (1X, <N> F5.2) 
                        .
                        .
                        .
S2        WRITE (CHFMT,500) '(1X,',N,'F5.2)' 
    500   FORMAT (A,I3,A) 
          WRITE (6,FMT=CHFMT) (A(I), I=1,N) 

5.6.7 Efficient Use of Record Buffers and Disk I/O

Records being read or written are transferred between the user's program buffers and one or more disk block I/O buffers, which are established when the file is opened by the Compaq Fortran RTL. Unless very large records are being read or written, multiple logical records can reside in the disk block I/O buffer when it is written to disk or read from disk, minimizing physical disk I/O.

You can specify the size of the disk block physical I/O buffer by using the OPEN statement BLOCKSIZE specifier; the default size can be obtained from fstat(2). If you omit the BLOCKSIZE specifier in the OPEN statement, it is set for optimal I/O use with the type of device the file resides on (with the exception of network access).

The OPEN statement BUFFERCOUNT specifier specifies the number of I/O buffers. The default for BUFFERCOUNT is 1. Any experiments to improve I/O performance should increase the BUFFERCOUNT value and not the BLOCKSIZE value, to increase the amount of data read by each disk I/O.

If the OPEN statement has BLOCKSIZE and BUFFERCOUNT specifiers, then the internal buffer size in bytes is the product of these specifiers. If the OPEN statement does not have these specifiers, then the default internal buffer size is 8192 bytes. This internal buffer will grow to hold the largest single record, but will never shrink.

The default for the Fortran run-time system is to use unbuffered disk writes. That is, by default, records are written to disk immediately as each record is written instead of accumulating in the buffer to be written to disk later.

To enable buffered writes (that is, to allow the disk device to fill the internal buffer before the buffer is written to disk), use one of the following:

  1. The OPEN statement BUFFERED specifier
  2. The -assume buffered_io command-line option
  3. The FORT_BUFFERED run-time environment variable

The OPEN statement BUFFERED specifier takes precedence over the -assume buffered_io option. If neither one is set (which is the default), the FORT_BUFFERED environment variable is tested at run time.

The OPEN statement BUFFERED specifier applies to a specific logical unit. In contrast, the -assume [no]buffered_io option and the FORT_BUFFERED environment variable apply to all Fortran units.

Using buffered writes usually makes disk I/O more efficient by writing larger blocks of data to the disk less often. However, a system failure when using buffered writes can cause records to be lost, since they might not yet have been written to disk. (Such records would have been written to disk with the default unbuffered writes.)

When performing I/O across a network, be aware that the size of the block of network data sent across the network can impact application efficiency. When reading network data, follow the same advice for efficient disk reads, by increasing the BUFFERCOUNT. When writing data through the network, several items should be considered:

When writing records, be aware that I/O records are written to unified buffer cache (UBC) system buffers. To request that I/O records be written from program buffers to the UBC system buffers, use the flush library routine (see flush(3f) and Chapter 12). Be aware that calling flush also discards read-ahead data in user buffer.

To request that UBC system buffers be written to disk, use the fsync library routine (see fsync(3f) and Chapter 12).

When UBC buffers are written to disk depends on UBC characteristics on the system, such as the vm-ubcbuffers attribute (see the Compaq Tru64 UNIX System Tuning and Performance guide).

For More Information:

5.6.8 Specify RECL

The sum of the record length (RECL specifier in an OPEN statement) and its overhead is a multiple or divisor of the blocksize, which is device specific. For example, if the BLOCKSIZE is 8192 then RECL might be 24576 (a multiple of 3) or 1024 (a divisor of 8).

The RECL value should fill blocks as close to capacity as possible (but not over capacity). Such values allow efficient moves, with each operation moving as much data as possible; the least amount of space in the block is wasted. Avoid using values larger than the block capacity, because they create very inefficient moves for the excess data only slightly filling a block (allocating extra memory for the buffer and writing partial blocks are inefficient).

The RECL value unit for formatted files is always 1-byte units. For unformatted files, the RECL unit is 4-byte units, unless you specify the -assume byterecl option to request 1-byte units (see Section 3.7).

When porting unformatted data files from non-Compaq systems, see Section 10.6.

5.6.9 Use the Optimal Record Type

Unless a certain record type is needed for portability reasons (see Section 7.4.3), choose the most efficient type, as follows:

5.6.10 Reading from a Redirected Standard Input File

Due to certain precautions that the Fortran run-time system takes to ensure the integrity of standard input, reads can be very slow when standard input is redirected from a file. For example, when you use a command such as myprogram.exe < myinput.data , the data is read using the READ(*) or READ(5) statement, and performance is degraded. To avoid this problem, do one of the following:

To take advantage of these methods, be sure your program does not rely on sharing the standard input file.

For More Information:

5.7 Additional Source Code Guidelines for Run-Time Efficiency

Other source coding guidelines can be implemented to improve run-time performance.

The amount of improvement in run-time performance is related to the number of times a statement is executed. For example, improving an arithmetic expression executed within a loop many times has the potential to improve performance, more than improving a similar expression executed once outside a loop.

5.7.1 Avoid Small Integer and Small Logical Data Items

Avoid using integer or logical data less than 32 bits, because the smallest unit of efficient access on Alpha systems is 32 bits.

Accessing a 16-bit (or 8-bit) data type can result in a sequence of machine instructions to access the data, rather than a single, efficient machine instruction for a 32-bit data item.

To minimize data storage and memory cache misses with arrays, use 32-bit data rather than 64-bit data, unless you require the greater numeric range of 8-byte integers or the greater range and precision of double precision floating-point numbers.

5.7.2 Avoid Mixed Data Type Arithmetic Expressions

Avoid mixing integer and floating-point (REAL) data in the same computation. Expressing all numbers in a floating-point arithmetic expression (assignment statement) as floating-point values eliminates the need to convert data between fixed and floating-point formats. Expressing all numbers in an integer arithmetic expression as integer values also achieves this. This improves run-time performance.

For example, assuming that I and J are both INTEGER variables, expressing a constant number (2.) as an integer value (2) eliminates the need to convert the data:
Original Code: INTEGER I, J
I = J / 2.
Efficient Code: INTEGER I, J
I = J / 2

For applications with numerous floating-point operations, consider using the -fp_reorder option (see Section 5.9.7) if a small difference in the result is acceptable.

You can use different sizes of the same general data type in an expression with minimal or no effect on run-time performance. For example, using REAL, DOUBLE PRECISION, and COMPLEX floating-point numbers in the same floating-point arithmetic expression has minimal or no effect on run-time performance.

5.7.3 Use Efficient Data Types

In cases where more than one data type can be used for a variable, consider selecting the data types based on the following hierarchy, listed from most to least efficient:

However, keep in mind that in an arithmetic expression, you should avoid mixing integer and floating-point (REAL) data (see Section 5.7.2).

5.7.4 Avoid Using Slow Arithmetic Operators

Before you modify source code to avoid slow arithmetic operators, be aware that optimizations convert many slow arithmetic operators to faster arithmetic operators. For example, the compiler optimizes the expression H=J**2 to be H=J*J.

Consider also whether replacing a slow arithmetic operator with a faster arithmetic operator will change the accuracy of the results or impact the maintainability (readability) of the source code.

Replacing slow arithmetic operators with faster ones should be reserved for critical code areas. The following hierarchy lists the Compaq Fortran arithmetic operators, from fastest to slowest:

5.7.5 Avoid Using EQUIVALENCE Statements

Avoid using EQUIVALENCE statements. EQUIVALENCE statements can:

5.7.6 Use Statement Functions and Internal Subprograms

Whenever the Compaq Fortran compiler has access to the use and definition of a subprogram during compilation, it may choose to inline the subprogram. Using statement functions and internal subprograms maximizes the number of subprogram references that will be inlined, especially when multiple source files are compiled together at optimization level -o4 or higher.

For More Information:

5.7.7 Code DO Loops for Efficiency

Minimize the arithmetic operations and other operations in a DO loop whenever possible. Moving unnecessary operations outside the loop will improve performance (for example, when the intermediate nonvarying values within the loop are not needed).

For More Information:

5.8 Optimization Levels: the -On Option

Compaq Fortran performs many optimizations by default. You do not have to recode your program to use them. However, understanding how optimizations work helps you remove any inhibitors to their successful function.

Generally, Compaq Fortran increases compile time in favor of decreasing run time. If an operation can be performed, eliminated, or simplified at compile time, Compaq Fortran does so, rather than have it done at run time. The time required to compile the program usually increases as more optimizations occur.

The program will likely execute faster when compiled at -o4 , but will require more compilation time than if you compile the program at a lower level of optimization.

The size of object file varies with the optimizations requested. Factors that can increase object file size include an increase of loop unrolling or procedure inlining.

Table 5-4 lists the levels of Compaq Fortran optimization with different -o options. For example: -o0 specifies no selectable optimizations (some optimizations always occur); -o5 specifies all levels of optimizations, including loop transformation.

Table 5-4 Levels of Optimization with Different -O n Options
Optimization Type --O0 --O1 --O2 --O3 --O4 --O5
Loop transformation           X
Software pipelining         X X
Automatic inlining         X X
Additional global optimizations       X X X
Global optimizations     X X X X
Local (minimal) optimizations   X X X X X

The default is -o4 (same as -o ). However, if -g2 , -g , or -gen_feedback is also specified, the default is -o0 (no optimizations).

In Table 5-4, the following terms are used to describe the levels of optimization:

5.8.1 Optimizations Performed at All Optimization Levels

The following optimizations occur at any optimization level ( -o0 through -o5 ):

5.8.2 Local (Minimal) Optimizations

To enable local optimizations, use -o1 or a higher optimization level ( -o2 , -o3 , -o4 , or -o5 ).

To prevent local optimizations, specify the -o0 option.

5.8.2.1 Common Subexpression Elimination

If the same subexpressions appear in more than one computation and the values do not change between computations, Compaq Fortran computes the result once and replaces the subexpressions with the result itself:


DIMENSION A(25,25), B(25,25) 
A(I,J) = B(I,J) 

Without optimization, these statements can be compiled as follows:


t1 = ((J-1)*25+(I-1))*4 
t2 = ((J-1)*25+(I-1))*4 
A(t1) = B(t2) 

Variables t1 and t2 represent equivalent expressions. Compaq Fortran eliminates this redundancy by producing the following:


t = ((J-1)*25+(I-1))*4 
A(t) = B(t) 

5.8.2.2 Integer Multiplication and Division Expansion

Expansion of multiplication and division refers to bit shifts that allow faster multiplication and division while producing the same result. For example, the integer expression (I*17) can be calculated as I with a 4-bit shift plus the original value of I. This can be expressed using the Compaq Fortran ISHFT intrinsic function:


J1 = I*17 
J2 = ISHFT(I,4) + I     ! equivalent expression for I*17 

The optimizer uses machine code that, like the ISHFT intrinsic function, shifts bits to expand multiplication and division by literals.

5.8.2.3 Compile-Time Operations

Compaq Fortran does as many operations as possible at compile time rather than at run time.

Constant Operations

Compaq Fortran can perform many operations on constants (including PARAMETER constants):

Algebraic Reassociation Optimizations

Compaq Fortran delays operations to see whether they have no effect or can be transformed to have no effect. If they have no effect, these operations are removed. A typical example involves unary minus and .NOT. operations:


X = -Y * -Z            ! Becomes: Y * Z 

5.8.2.4 Value Propagation

Compaq Fortran tracks the values assigned to variables and constants, including those from DATA statements, and traces them to every place they are used. Compaq Fortran uses the value itself when it is more efficient to do so.

When compiling subprograms, Compaq Fortran analyzes the program to ensure that propagation is safe if the subroutine is called more than once.

Value propagation frequently leads to more value propagation. Compaq Fortran can eliminate run-time operations, comparisons and branches, and whole statements.

In the following example, constants are propagated, eliminating multiple operations from run time:
Original Code Optimized Code
PI = 3.14 .
.
.

PIOVER2 = PI/2 .
.
.

I = 100 .
.
.

IF (I.GT.1) GOTO 10
10 A(I) = 3.0*Q
.
.
.

PIOVER2 = 1.57 .
.
.

I = 100 .
.
.

10 A(100) = 3.0*Q

5.8.2.5 Dead Store Elimination

If a variable is assigned but never used, Compaq Fortran eliminates the entire assignment statement:


X = Y*Z 
   .
   .
   .=Y*Z is eliminated. 
 
X = A(I,J)* PI 

Some programs used for performance analysis often contain such unnecessary operations. When you try to measure the performance of such programs compiled with Compaq Fortran, these programs may show unrealistically good performance results. Realistic results are possible only with program units using their results in output statements.

5.8.2.6 Register Usage

A large program usually has more data that would benefit from being held in registers than there are registers to hold the data. In such cases, Compaq Fortran typically tries to use the registers according to the following descending priority list:

  1. For temporary operation results, including array indexes
  2. For variables
  3. For addresses of arrays (base address)
  4. All other usages

Compaq Fortran uses heuristic algorithms and a modest amount of computation to attempt to determine an effective usage for the registers.

Holding Variables in Registers

Because operations using registers are much faster than using memory, Compaq Fortran generates code that uses the Alpha 64-bit integer and floating-point registers instead of memory locations. Knowing when Compaq Fortran uses registers may be helpful when doing certain forms of debugging.

Compaq Fortran uses registers to hold the values of variables whenever the Fortran language does not require them to be held in memory, such as holding the values of temporary results of subexpressions, even if -o0 (no optimization) was specified.

Compaq Fortran may hold the same variable in different registers at different points in the program:


V = 3.0*Q 
   .
   .
   .
X = SIN(Y)*V 
   .
   .
   .
V = PI*X 
   .
   .
   .
Y = COS(Y)*V 

Compaq Fortran may choose one register to hold the first use of V and another register to hold the second. Both registers can be used for other purposes at points in between. There may be times when the value of the variable does not exist anywhere in the registers. If the value of V is never needed in memory, it is never assigned.

Compaq Fortran uses registers to hold the values of I, J, and K (so long as there are no other optimization effects, such as loops involving the variables):


A(I) = B(J) + C(K) 

More typically, an expression uses the same index variable:


A(K) = B(K) + C(K) 

In this case, K is loaded into only one register and is used to index all three arrays at the same time.

5.8.2.7 Mixed Real/Complex Operations

In mixed REAL/COMPLEX operations, Compaq Fortran avoids the conversion and performs a simplified operation on:

For example, if variable R is REAL and A and B are COMPLEX, no conversion occurs with the following:


COMPLEX A, B 
   .
   .
   .
B = A + R 

5.8.3 Global Optimizations

To enable global optimizations, use -o2 or a higher optimization level ( -o3 , -o4 , or -o5 ). Using -o2 or higher also enables local optimizations ( -o1 ).

Global optimizations include:

Data flow analysis and split lifetime analysis (global data analysis) traces the values of variables and whole arrays as they are created and used in different parts of a program unit. During this analysis, Compaq Fortran assumes that any pair of array references to a given array might access the same memory location, unless a constant subscript is used in both cases.

To eliminate unnecessary recomputations of invariant expressions in loops, Compaq Fortran hoists them out of the loops so they execute only once.

Global data analysis includes which data items are selected for analysis. Some data items are analyzed as a group and some are analyzed individually. Compaq Fortran limits or may disqualify data items that participate in the following constructs, generally because it cannot fully trace their values:

5.8.4 Additional Global Optimizations

To enable additional global optimizations, use -o3 or a higher optimization level ( -o4 or -o5 ). Using -o3 or higher also enables local optimizations ( -o1 ) and global optimizations ( -o2 ).

Additional global optimizations improve speed at the cost of longer compile times and possibly extra code size.

5.8.4.1 Loop Unrolling

At optimization level -o3 or above, Compaq Fortran attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining). The best candidates for loop unrolling are innermost loops with limited control flow.

As more loops are unrolled, the average size of basic blocks increases. Loop unrolling generates multiple copies of the code for the loop body (loop code iterations) in a manner that allows efficient instruction pipelining.

The loop body is replicated a certain number of times, substituting index expressions. An initialization loop might be created to align the first reference with the main series of loops. A remainder loop might be created for leftover work.

The loop unroller also inserts data prefetches for arrays with affine subscripts. Prefetches (that is, prefetch instructions) can be inserted even if the unroller chooses not to unroll. On some architectures (21264 and later), write-hint instructions are also generated.

The number of times a loop is unrolled can be determined either by the optimizer or by using the -unroll num option, which can specify the limit for loop unrolling. Unless the user specifies a value, the optimizer will choose an unroll amount that minimizes the overhead of prefetching while also limiting code size expansion.

Array operations are often represented as a nested series of loops when expanded into instructions. The innermost loop for the array operation is the best candidate for loop unrolling (like DO loops). For example, the following array operation (once optimized) is represented by nested loops, where the innermost loop is a candidate for loop unrolling:


A(1:100,2:30) = B(1:100,1:29) * 2.0 

For More Information:

5.8.4.2 Code Replication to Eliminate Branches

In addition to loop unrolling and other optimizations, the number of branches are reduced by replicating code that will eliminate branches. Code replication decreases the number of basic blocks and increases instruction-scheduling opportunities.

Code replication normally occurs when a branch is at the end of a flow of control, such as a routine with multiple, short exit sequences. The code at the exit sequence gets replicated at the various places where a branch to it might occur.

For example, consider the following unoptimized routine and its optimized equivalent that uses code replication (R0 is register 0):
Unoptimized Instructions Optimized (Replicated) Instructions
 .

.
.
branch to exit1
.
.
.
branch to exit1
.
.
.
exit1: move 1 into R0
return
 .

.
.
move 1 into R0
return
.
.
.
move 1 into R0
return
.
.
.
move 1 into R0
return

Similarly, code replication can also occur within a loop that contains a small amount of shared code at the bottom of a loop and a case-type dispatch within the loop. The loop-end test-and-branch code might be replicated at the end of each case to create efficient instruction pipelining within the code for each case.

5.8.5 Automatic Inlining

To enable optimizations that perform automatic inlining, use -o4 or a higher optimization level ( -o5 ). Using -o4 also enables local optimizations ( -o1 ), global optimizations ( -o2 ), and additional global optimizations ( -o3 ).

The default is -o4 (unless -g2 , -g , or -gen_feedback is specified).

5.8.5.1 Interprocedure Analysis

Compiling multiple source files at optimization level -o4 or higher lets the compiler examine more code for possible optimizations, including multiple program units. This results in:

As more procedures are inlined, the size of the executable program and compile times may increase, but execution time should decrease.

5.8.5.2 Inlining Procedures

Inlining refers to replacing a subprogram reference (such as a CALL statement or function invocation) with the replicated code of the subprogram. As more procedures are inlined, global optimizations often become more effective.

The optimizer inlines small procedures, limiting inlining candidates based on such criteria as:

You can specify:

5.8.6 Software Pipelining

Software pipelining and additional software dependence analysis are enabled by using the -pipeline option, the -o4 option, or the -o5 option. Software pipelining in certain cases improves run-time performance.

Software pipelining applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.

Software pipelining also includes associated additional software dependence analysis and enables the prefetching of data to reduce the impact of cache misses.

Loop unrolling (enabled at -o3 or above) cannot schedule across iterations of a loop. Because software pipelining can schedule across loop iterations, it can perform more efficient scheduling to eliminate instruction stalls within loops.

For instance, if software dependence analysis of data flow reveals that certain calculations can be done before or after that iteration of the loop, software pipelining reschedules those instructions ahead of or behind that loop iteration, at places where their execution can prevent instruction stalls or otherwise improve performance.

Software pipelining can be more effective when you combine -pipeline (or -o4 or -o5 ) with the appropriate -tune keyword for the target Alpha processor generation (see Section 5.9.4).

To specify software pipelining without loop transformation optimizations, do one of the following:

This optimization is not performed at optimization levels below -o2 .

Loops chosen for software pipelining:

By modifying the unrolled loop and inserting instructions as needed before and/or after the unrolled loop, software pipelining generally improves run-time performance, except where the loops contain a large number of instructions with many existing overlapped operations. In this case, software pipelining may not have enough registers available to effectively improve execution performance. Run-time performance using -o4 or -o5 (or -pipeline ) may not improve performance, as compared to using -o3 .

This option might increase compilation time and/or program size. For programs that contain loops that exhaust available registers, longer execution times may occur. In this case, specify options -unroll 1 or -unroll 2 with the -pipeline option.

To determine whether using -pipeline benefits your particular program, you should time program execution for the same program (or subprogram) compiled with and without software pipelining (such as with -pipeline and -nopipeline ).

For programs that contain loops that exhaust available registers, longer execution times may result with -o4 or -o5 , requiring use of -unroll n to limit loop unrolling (see Section 3.94).

For More Information:

5.8.7 Loop Transformation

The loop transformation optimizations are enabled by using the -transform_loops option or the -o5 option. Loop transformation attempts to improve performance by rewriting loops to make better use of the memory system. By rewriting loops, the loop transformation optimizations can increase the number of instructions executed, which can degrade the run-time performance of some programs.

To request loop transformation optimizations without software pipelining, do one of the following:

This optimization is not performed at optimization levels below -o2 .

You must specify -notransform_loops if you want this type of optimization disabled and you are also specifying -o5 .

The loop transformation optimizations apply to array references within loops. These optimizations can improve the performance of the memory system and usually apply to multiple nested loops.

The loops chosen for loop transformation optimizations are always counted loops. Counted loops use a variable to count iterations, thereby determining the number of iterations before entering the loop. For example, DO and IF loops are normally counted loops, but uncounted DO WHILE loops are not.

Conditions that typically prevent the loop transformation optimizations from occurring include subprogram references that are not inlined (such as an external function call), complicated exit conditions, and uncounted loops.

The types of optimizations associated with -transform_loops include the following:

To determine whether using -transform_loops benefits your particular program, you should time program execution for the same program (or subprogram) compiled with and without loop transformation optimizations (such as with -transform_loops and -notransform_loops ).

For More Information:

5.9 Other Options Related to Optimization

In addition to the -on options (discussed in Section 5.8), several other f90 command options can prevent or facilitate improved optimizations.

5.9.1 Setting Multiple Options with the -fast Option

Specifying the -fast option sets many performance options. For details, see Section 3.40, -fast --- Set Options to Improve Run-Time Performance.

5.9.2 Controlling the Number of Times a Loop Is Unrolled

You can specify the number of times a loop is unrolled by using the -unroll num option (see Section 3.94).

The -unroll num option can also influence the run-time results of software pipelining optimizations performed when you specify one of the following:

Although unrolling loops usually improves run-time performance, the size of the executable program may increase.

For More Information:

5.9.3 Controlling the Inlining of Procedures

To specify the types of procedures to be inlined, use the -inline keyword option. Also, compile multiple source files together and specify an adequate optimization level, such as -o4 .

If you omit -noinline and the -inline keyword options, the optimization level -on option used determines the types of procedures that are inlined.

Maximizing the types of procedures that are inlined usually improves run-time performance, but compile-time memory usage and the size of the executable program may increase.

To determine whether using -inline all benefits your particular program, time program execution for the same program compiled with and without -inline all .

For More Information:

5.9.4 Requesting Optimized Code for a Specific Processor Generation

You can specify the types of optimized code to be generated by using the -tune keyword and -arch keyword options. Regardless of the specified keyword, the generated code will run correctly on all implementations of the Alpha architecture. Tuning for a specific implementation can improve run-time performance; it is also possible that code tuned for a specific target may run slower on another target.

Specifying the correct keyword for -tune keyword for the target processor generation type usually slightly improves run-time performance. Unless you request software pipelining, the run-time performance difference for using the wrong keyword for -tune keyword (such as using -tune ev4 for an ev5 processor) is usually less than 5%. When using software pipelining (using -o4 or -o5 ) with -tune keyword , the difference can be more than 5%.

The combination of the specified keyword for -tune keyword and the type of processor generation used has no effect on producing the expected correct program results.

For More Information:

5.9.5 Requesting the Speculative Execution Optimization

(TU*X ONLY) Speculative execution reduces instruction latency stalls to improve run-time performance for certain programs or routines. Speculative execution evaluates conditional code (including exceptions) and moves instructions that would otherwise be executed conditionally to a position before the test, so they are executed unconditionally.

The default, -speculate none , means that the speculative execution code scheduling optimization is not used and exceptions are reported as expected. You can specify -speculate all or -speculate by_routine to request the speculative execution optimization.

Performance improvements may be reduced because the run-time system must dismiss exceptions caused by speculative instructions. For certain programs, longer execution times may result when using the speculative execution optimization. To determine whether using -speculate all or -speculate by_routine benefits your particular program, you should time the program execution with one of these options for the same program compiled with -speculate none (default).

Speculative execution does not support some run-time error checking, since exception and signal processing (including SIGSEGV, SIGBUS, and SIGFPE) is conditional. When the program needs to be debugged or while you are testing for errors, only use -speculate none .

For More Information:

5.9.6 Request Nonshared Object Optimizations

When you specify -non_shared to request a nonshared object file, you can specify the -om option to request code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window.

For More Information:

5.9.7 Arithmetic Reordering Optimizations

If you use the -fp_reorder option (or -assume noaccuracy_sensitive , which are equivalent), Compaq Fortran may reorder code (based on algebraic identities) to improve performance.

For example, the following expressions are mathematically equivalent but may not compute the same value using finite precision arithmetic:


X = (A + B) + C 
 
X = A + (B + C) 

The results can be slightly different from the default -no_fp_reorder because of the way intermediate results are rounded. However, the -no_fp_reorder results are not categorically less accurate than those gained by the default. In fact, dot product summations using -fp_reorder can produce more accurate results than those using -no_fp_reorder .

The effect of -fp_reorder is important when Compaq Fortran hoists divide operations out of a loop. If -fp_reorder is in effect, the unoptimized loop becomes the optimized loop:
Unoptimized Code Optimized Code
  T = 1/V
DO I=1,N DO I=1,N
. .
. .
. .
B(I) = A(I)/V B(I) = A(I)*T
END DO END DO

The transformation in the optimized loop increases performance significantly, and loses little or no accuracy. However, it does have the potential for raising overflow or underflow arithmetic exceptions.

The compiler can also reorder code based on algebraic identities to improve performance if you specify -fast .

5.9.8 Dummy Aliasing Assumption

Some programs compiled with Compaq Fortran (or Compaq Fortran 77) may have results that differ from the results of other Fortran compilers. Such programs may be aliasing dummy arguments to each other or to a variable in a common block or shared through use association, and at least one variable access is a store.

This program behavior is prohibited in programs conforming to the Fortran 95/90 standards, but not by Compaq Fortran. Other versions of Fortran allow dummy aliases and check for them to ensure correct results. However, Compaq Fortran assumes that no dummy aliasing will occur, and it can ignore potential data dependencies from this source in favor of faster execution.

The Compaq Fortran default is safe for programs conforming to the Fortran 95/90 standards. It will improve performance of these programs, because the standard prohibits such programs from passing overlapped variables or arrays as actual arguments if either is assigned in the execution of the program unit.

The -assume dummy_aliases option allows dummy aliasing. It ensures correct results by assuming the exact order of the references to dummy and common variables is required. Program units taking advantage of this behavior can produce inaccurate results if compiled with -assume nodummy_aliases .

Example 5-1 is taken from the DAXPY routine in the Fortran-77 version of the Basic Linear Algebra Subroutines (BLAS).

Example 5-1 Using the -assume dummy_aliases Option

      SUBROUTINE DAXPY(N,DA,DX,INCX,DY,INCY) 
 
C     Constant times a vector plus a vector. 
C     uses unrolled loops for increments equal to 1. 
 
      DOUBLE PRECISION DX(1), DY(1), DA 
      INTEGER I,INCX,INCY,IX,IY,M,MP1,N 
C 
      IF (N.LE.0) RETURN 
      IF (DA.EQ.0.0) RETURN 
      IF (INCX.EQ.1.AND.INCY.EQ.1) GOTO 20 
 
C     Code for unequal increments or equal increments 
C     not equal to 1. 
      . 
      . 
      . 
      RETURN 
C     Code for both increments equal to 1. 
C     Clean-up loop 
 
20    M = MOD(N,4) 
      IF (M.EQ.0) GOTO 40 
      DO I=1,M 
          DY(I) = DY(I) + DA*DX(I) 
      END DO 
 
      IF (N.LT.4) RETURN 
40    MP1 = M + 1 
      DO I = MP1, N, 4 
          DY(I) = DY(I) + DA*DX(I) 
          DY(I + 1) = DY(I + 1) + DA*DX(I + 1) 
          DY(I + 2) = DY(I + 2) + DA*DX(I + 2) 
          DY(I + 3) = DY(I + 3) + DA*DX(I + 3) 
      END DO 
 
      RETURN 
      END SUBROUTINE 

The second DO loop contains assignments to DY. If DY is overlapped with DA, any of the assignments to DY might give DA a new value, and this overlap would affect the results. If this overlap is desired, then DA must be fetched from memory each time it is referenced. The repetitious fetching of DA degrades performance.

Linking Routines with Opposite Settings

You can link routines compiled with the -assume dummy_aliases option to routines compiled with -assume nodummy_aliases . For example, if only one routine is called with dummy aliases, you can use -assume dummy_aliases when compiling that routine, and compile all the other routines with -assume nodummy_aliases to gain the performance value of that option.

Programs calling DAXPY with DA overlapping DY do not conform to the FORTRAN-77 and Fortran 95/90 standards. However, they are supported if -assume dummy_aliases was used to compile the DAXPY routine.


Previous Next Contents Index