Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems

Chapter 5
Performance: Making Programs Run Faster

As you read this chapter, remember that the f90 command invokes the Compaq Fortran compiler on Tru64 UNIX Alpha systems while the fort command invokes the Compaq Fortran compiler on Linux Alpha systems. This chapter uses the f90 command to indicate invoking the Compaq Fortran compiler on both systems, so replace this command with fort if you are working on a Linux Alpha system.

Also, remember that the cc command invokes the Compaq C compiler on Tru64 UNIX Alpha systems while the ccc command invokes the Compaq C compiler on Linux Alpha systems. This chapter uses the cc command to indicate invoking the Compaq C compiler on both systems, so replace this command with ccc if you are working on a Linux Alpha system.

This chapter discusses the following topics related to improving run-time performance of Compaq Fortran programs:

Important software environment suggestions that apply to nearly all applications, including using the most recent version of the compiler, related performance tools, and efficient ways to compile using the f90 command ( Section 5.1)
Analyzing program performance, including using Compaq Tru64 UNIX time measurement and profiling tools ( Section 5.2)
Guidelines related to avoiding unaligned data ( Section 5.3)
Guidelines for efficient array use ( Section 5.4)
Guidelines related to improving overall I/O performance ( Section 5.5)
Additional performance guidelines related to source code ( Section 5.6)
Understanding the f90 -On optimization level options and the types of optimizations performed ( Section 5.7)
Understanding other f90 optimization options (besides the -On options) ( Section 5.8)

This chapter does not address the performance and profiling of programs that execute in parallel using the Compaq Parallel Software Environment. For information about performance and profiling of parallel HPF programs, see the Compaq Parallel Software Environment documentation.

5.1 Software Environment and Efficient Compilation

Before you attempt to analyze and improve program performance, you should:

Obtain and install the latest version of Compaq Fortran, along with performance products that can improve application performance, such as the Compaq Extended Mathematical Library (CXML).
Use the f90 command (or, on Linux systems, the fort command) and its options in a manner that lets the Compaq Fortran compiler perform as many optimizations as possible to improve run-time performance.
Use certain performance capabilities provided by the Compaq Tru64 UNIX operating system.

5.1.1 Install the Latest Version of Compaq Fortran and Performance Products

To ensure that your software development environment can significantly improve the run-time performance of your applications, obtain and install the following optional software products:

The latest version of Compaq Fortran
New releases of the Compaq Fortran compiler and its associated run-time libraries may provide new features that improve run-time performance.
The Compaq Fortran run-time libraries shipped with Compaq Fortran are also shipped with the Compaq Tru64 UNIX operating system. Always install the Compaq Fortran subset with the highest subset number. This number is always available, for both Tru64 and Linux operating systems, at or under the Web page whose URL is:
http://www.compaq.com/fortran
If your application will be run on a Compaq Tru64 UNIX system other than your program development system, be sure to install the same (or later) version of the Compaq Fortran run-time environment on those systems.
You can obtain the appropriate Compaq Services software product maintenance contract to automatically receive new versions of Compaq Fortran. For information on more recent Compaq Fortran releases, contact the Compaq Customer Support Center (CSC) if you have the appropriate support contract, or contact your local Compaq sales representative.
When using a shared memory, multiprocessor system, you can choose either directed parallel processing (see Chapter 6) or KAP for Compaq Fortran for Compaq Tru64 UNIX Systems (described in a following paragraph in this section).
Compaq Extended Mathematical Library (CXML) for Compaq Tru64 UNIX Systems
See Chapter 13 for a summary of the CXML product.
Compaq Parallel Software Environment
(TU*X ONLY) Allows parallel execution of Compaq Fortran programs on multiple Alpha systems running the Compaq Tru64 UNIX operating system (parallel Alpha Cluster). To indicate parallel execution characteristics, use the f90 options described in Section 3.92 and the High Performance Fortran ( !HPF$ ) directives described in the Compaq Fortran Language Reference Manual and the Compaq Parallel Software Environment documentation.
Before compiling a Compaq Fortran program for parallel HPF execution using the Parallel Software Environment, you should develop and debug it as a Compaq Fortran nonparallel program. For information on parallel HPF programming, see the Compaq Parallel Software Environment documentation.
KAP for Compaq Fortran for Compaq Tru64 UNIX Systems (performance preprocessor)
Allows preprocessing of Compaq Fortran source files to improve their run-time performance. You can purchase the KAP for Compaq Fortran performance preprocessor from Compaq.
The KAP performance preprocessor also supports parallel processing using automatic and directed decomposition for a shared memory multiprocessor Alpha system.
You can do one of the following:
- Use the preprocessor-only kapf90 command to produce improved Fortran 95/90 source files before compiling them with the f90 command.
- Use the kf90 command to invoke the preprocessor, compiler, and linker to create an executable program.
For example, the following kf90 command:
- Specifies the KAP preprocessor be run for the free-form file for_cal.f90
- Recognizes the BLAS level 2 and 3 routines
- Searches the CXML library for unresolved references
- Compiles and links the resulting preprocessed source file:
  % kf90 -fkapargs='-lc=blas' for_cal.f90 -lcxml
For more information, see the KAP for Compaq Fortran for Tru64 UNIX Systems User Guide.
Performance profiling and feedback tools provided with Compaq Tru64 UNIX Version 4.0 or later
The standard set of U*X profiling and performance tools include prof , gprof , pixie (TU*X ONLY), cord , and the use of feedback files. For more information on profiling, see Section 5.2.
Compaq Tru64 UNIX Version 4.0 or later also includes the Atom tool and a collection of prepackaged Atom-based program-analysis tools:
- The Atom tool consists of a set of routines for creating custom-designed program-analysis tools.
- The prepackaged Atom-based program-analysis tools include the profiling tools pixie (TU*X ONLY) and hiprof . For more information, see atom(1) and the Compaq Tru64 UNIX Programmer's Guide.
System-wide performance products
Other products are not specific to a particular programming language or application, but can improve system-wide performance, such as minimizing disk device I/O and handling capacity planning. Such Tru64 UNIX products include DECRaid (shadowing and striping) and such POLYCENTER products as the Capacity Planner, Performance Solution, and Performance Advisor.
Adequate process limits and virtual memory space as well as proper system tuning are especially important when running large programs, such as those accessing large arrays.

For More Information:

About system-wide tuning and suggestions for other performance enhancements on Compaq Tru64 UNIX systems, see the manual Compaq Tru64 UNIX System Tuning and Performance.

5.1.2 Compile Using Multiple Source Files and Appropriate f90 Options

During the earlier stages of program development, you can use incremental compilation with minimal optimization. For example:

% f90 -c -O1 sub2.f90 % f90 -c -O1 sub3.f90 % f90 -o main.out -g -O0 main.f90 sub2.o sub3.o

During the later stages of program development, you should specify multiple source files together and use an optimization level of at least -O4 on the f90 command line to allow more interprocedure optimizations to occur. For instance, the following command compiles all three source files together using the default level of optimization ( -O4 ):

% f90 -o main.out main.f90 sub2.f90 sub3.f90

Compiling multiple source files lets the compiler examine more code for possible optimizations, which results in:

Inlining more procedures
More complete data flow analysis
Reducing the number of external references to be resolved during linking

For very large programs, compiling all source files together may not be practical. In such instances, consider compiling source files containing related routines together using multiple f90 commands, rather than compiling source files individually.

Table 5-1 shows f90 options that can improve performance. Most of these options do not affect the accuracy of the results, while others improve run-time performance but can change some numeric results.

Compaq Fortran performs certain optimizations unless you specify the appropriate f90 command options. Additional optimizations can be enabled or disabled using f90 command options.

Table 5-1 lists the f90 options that can directly improve run-time performance.

Table 5-1 Options Related to Run-Time Performance
Option Names Description For More Information

-align keyword Controls whether padding bytes are added between data items within common blocks, derived-type data, and Compaq Fortran record structures to make the data items naturally aligned. Section 5.3

-architecture keyword Determines the type of Alpha architecture code instructions to be generated for the program unit being compiled. All Alpha processors implement a core set of instructions; certain processor versions include additional instruction extensions. Section 3.4

-cord and -feedback file Uses a feedback file created during a previous compilation by specifying the -gen_feedback option. These options use the feedback file to improve run-time performance, optionally using cord to rearrange procedures. Section 5.2.3

-fast Sets the following performance-related options:
-align dcommons
-align sequence
-arch host
-assume bigarrays (TU*X ONLY)
-assume nozsize (TU*X ONLY)
-assume noaccuracy_sensitive (same as -fp_reorder )
-math_library fast
-tune host
See description of each option

-fp_reorder Allows the compiler to reorder code based on algebraic identities to improve performance, enabling certain optimizations. The numeric results can be slightly different from the default ( -no_fp_reorder ) because of the way intermediate results are rounded. This slight difference in numeric results is acceptable to most programs. Section 5.8.9

-gen_feedback Requests generated code that allows accurate feedback information for subsequent use of the -feedback file option (optionally with cord ). Using -gen_feedback changes the default optimization level from -O4 to -O0 . Section 5.2.3

-inline all Inlines every call that can possibly be inlined while generating correct code. Certain recursive routines are not inlined to prevent infinite loops. Section 5.8.5

-inline speed Inlines procedures that will improve run-time performance with a likely significant increase in program size. Section 5.8.5

-inline size Inlines procedures that will improve run-time performance without a significant increase in program size. This type of inlining occurs at optimization level -O4 and -O5 . Section 5.8.5

-math_library fast Requests the use of certain math library routines (used by intrinsic functions) that provide faster speed. Using this option causes a slight loss of accuracy and provides less reliable arithmetic exception checking to get significant performance improvements in those functions. Section 3.53

-mp (TU*X ONLY) Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems Section 3.56

-O n ( -O0 to -O5 ) Controls the optimization level and thus the types of optimization performed. The default optimization level is -O4 , unless you specify -g2 , -g , or -gen_feedback , which changes the default to -O0 (no optimizations). Use -O5 to activate loop transformation optimizations and the software pipelining optimization. Section 5.7

-om (TU*X ONLY) Used with the -non_shared option to request certain code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window. Section 3.63

-omp (TU*X ONLY) Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems Section 3.64

-p , -p1 Requests profiling information, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. Section 5.2.2

-pg Requests profiling information for the gprof tool, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. Section 5.2.2

-pipeline Activates the software pipelining optimization (a subset of -O5 ). Section 3.66

-speculate keyword (TU*X ONLY) Enables the speculative execution optimization, a form of instruction scheduling for conditional expressions. Section 3.74

-transform_loops Activates a group of loop transformation optimizations (a subset of -O5 ). Section 3.79

-tune keyword Specifies the target processor generation (chip) architecture on which the program will be run, allowing the optimizer to make decisions about instruction tuning optimizations needed to create the most efficient code. Keywords allow specifying one particular Alpha processor generation type, multiple processor generation types, or the processor generation type currently in use during compilation. Regardless of the setting of -tune keyword , the generated code will run correctly on all implementations of the Alpha architecture. Section 5.8.6

-unroll num Specifies the number of times a loop is unrolled ( num) when specified with optimization level -O3 or higher. If you omit -unroll num , the optimizer determines how many times loops are unrolled. Section 5.7.4.1

-wsf num and related options (TU*X ONLY) Specifies that the code generated for this program will allow parallel execution on multiple processors using the Compaq Parallel Software Environment Section 3.92 and the Compaq Parallel Software Environment documentation

**Table 5-1 Options Related to Run-Time Performance**
Option Names	Description	For More Information
`-align keyword`	Controls whether padding bytes are added between data items within common blocks, derived-type data, and Compaq Fortran record structures to make the data items naturally aligned.	Section 5.3
`-architecture keyword`	Determines the type of Alpha architecture code instructions to be generated for the program unit being compiled. All Alpha processors implement a core set of instructions; certain processor versions include additional instruction extensions.	Section 3.4
`-cord` and `-feedback file`	Uses a feedback file created during a previous compilation by specifying the `-gen_feedback` option. These options use the feedback file to improve run-time performance, optionally using `cord` to rearrange procedures.	Section 5.2.3
`-fast`	Sets the following performance-related options: `-align dcommons` `-align sequence` `-arch host` `-assume bigarrays` (TUX ONLY)* `-assume nozsize` (TUX ONLY)* `-assume noaccuracy_sensitive` (same as `-fp_reorder` ) `-math_library fast` `-tune host`	See description of each option
`-fp_reorder`	Allows the compiler to reorder code based on algebraic identities to improve performance, enabling certain optimizations. The numeric results can be slightly different from the default ( `-no_fp_reorder` ) because of the way intermediate results are rounded. This slight difference in numeric results is acceptable to most programs.	Section 5.8.9
`-gen_feedback`	Requests generated code that allows accurate feedback information for subsequent use of the `-feedback` file option (optionally with `cord` ). Using `-gen_feedback` changes the default optimization level from `-O4` to `-O0` .	Section 5.2.3
`-inline all`	Inlines every call that can possibly be inlined while generating correct code. Certain recursive routines are not inlined to prevent infinite loops.	Section 5.8.5
`-inline speed`	Inlines procedures that will improve run-time performance with a likely significant increase in program size.	Section 5.8.5
`-inline size`	Inlines procedures that will improve run-time performance without a significant increase in program size. This type of inlining occurs at optimization level `-O4` and `-O5` .	Section 5.8.5
`-math_library fast`	Requests the use of certain math library routines (used by intrinsic functions) that provide faster speed. Using this option causes a slight loss of accuracy and provides less reliable arithmetic exception checking to get significant performance improvements in those functions.	Section 3.53
`-mp` (TUX ONLY)*	Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems	Section 3.56
`-O n` ( `-O0` to `-O5` )	Controls the optimization level and thus the types of optimization performed. The default optimization level is `-O4` , unless you specify `-g2` , `-g` , or `-gen_feedback` , which changes the default to `-O0` (no optimizations). Use `-O5` to activate loop transformation optimizations and the software pipelining optimization.	Section 5.7
`-om` (TUX ONLY)*	Used with the `-non_shared` option to request certain code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window.	Section 3.63
`-omp` (TUX ONLY)*	Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems	Section 3.64
`-p` , `-p1`	Requests profiling information, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance.	Section 5.2.2
`-pg`	Requests profiling information for the `gprof` tool, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance.	Section 5.2.2
`-pipeline`	Activates the software pipelining optimization (a subset of `-O5` ).	Section 3.66
`-speculate keyword` (TUX ONLY)*	Enables the speculative execution optimization, a form of instruction scheduling for conditional expressions.	Section 3.74
`-transform_loops`	Activates a group of loop transformation optimizations (a subset of `-O5` ).	Section 3.79
`-tune keyword`	Specifies the target processor generation (chip) architecture on which the program will be run, allowing the optimizer to make decisions about instruction tuning optimizations needed to create the most efficient code. Keywords allow specifying one particular Alpha processor generation type, multiple processor generation types, or the processor generation type currently in use during compilation. Regardless of the setting of `-tune keyword` , the generated code will run correctly on all implementations of the Alpha architecture.	Section 5.8.6
`-unroll num`	Specifies the number of times a loop is unrolled ( num) when specified with optimization level `-O3` or higher. If you omit `-unroll num` , the optimizer determines how many times loops are unrolled.	Section 5.7.4.1
`-wsf num` and related options (TUX ONLY)*	Specifies that the code generated for this program will allow parallel execution on multiple processors using the Compaq Parallel Software Environment	Section 3.92 and the Compaq Parallel Software Environment documentation

Table 5-2 lists options that can slow program performance. Some applications that require floating-point exception handling or rounding might need to use the -fpen and -fprm dynamic options. Other applications might need to use the -assume dummy_aliases or -vms options for compatibility reasons. Other options listed in Table 5-2 are primarily for troubleshooting or debugging purposes.

Table 5-2 Options that Slow Run-Time Performance
Option Names Description For More Information

-assume dummy_aliases Forces the compiler to assume that dummy (formal) arguments to procedures share memory locations with other dummy arguments or with variables shared through use association, host association, or common block use. These program semantics slow performance, so you should specify -assume dummy_aliases only for the called subprograms that depend on such aliases.
The use of dummy aliases violates the FORTRAN-77 and Fortran 95/90 standards but occurs in some older programs.
Section 5.8.10

-c If you use -c when compiling multiple source files, also specify -o output to compile many source files together into one object file. Separate compilations prevent certain interprocedure optimizations, such as when using multiple f90 commands or using -c without the -o output option. Section 2.1.7

-check bounds Generates extra code for array bounds checking at run time. Section 3.18

-check omp_bindings (TU*X ONLY) Provides run-time checking to enforce the binding rules for OpenMP Fortran API (parallel processing) compiler directives inserted in source code. Section 3.22

-check overflow Generates extra code to check integer calculations for arithmetic overflow at run time. Once the program is debugged, omit this option to reduce executable program size and slightly improve run-time performance. Section 3.23

-fpe n values greater than -fpe0 Using -fpe1 (TU*X ONLY) , -fpe2 (TU*X ONLY) , -fpe3 , or -fpe4 (TU*X ONLY) (or using the for_set_fpe routine to set equivalent exception handling) slows program execution. For programs that specify -fpe3 or -fpe4 (TU*X ONLY) , the impact on run-time performance can be significant. Section 3.37

-fprm dynamic (TU*X ONLY) Certain rounding modes and changing the rounding mode can slow program execution slightly. Section 3.39

-g , -g2 , -g3 Generates extra symbol table information in the object file. Specifying -g or -g2 also reduces the default level of optimization to -O0 . Section 3.41

-inline none
-inline manual Prevents the inlining of all procedures (except statement functions). Section 5.8.5

-O0 , -O1 , -O2 , or -O3 Minimizes the optimization level (and types of optimizations). Use during the early stages of program development or when you will use the debugger. Section 3.62 and Section 5.7

-synchronous_exceptions Generates extra code to associate an arithmetic exception with the instruction that causes it, slowing efficient instruction execution. Use this option only when troubleshooting, such as when identifying the source of an exception. Section 3.76

-vms Controls certain VMS-related run-time defaults, including alignment. If you specify the -vms option, you may need to also specify the -align records option to obtain optimal run-time performance. Section 3.87

**Table 5-2 Options that Slow Run-Time Performance**
Option Names	Description	For More Information
`-assume dummy_aliases`	Forces the compiler to assume that dummy (formal) arguments to procedures share memory locations with other dummy arguments or with variables shared through use association, host association, or common block use. These program semantics slow performance, so you should specify `-assume dummy_aliases` only for the called subprograms that depend on such aliases. The use of dummy aliases violates the FORTRAN-77 and Fortran 95/90 standards but occurs in some older programs.	Section 5.8.10
`-c`	If you use `-c` when compiling multiple source files, also specify `-o` output to compile many source files together into one object file. Separate compilations prevent certain interprocedure optimizations, such as when using multiple `f90` commands or using `-c` without the `-o` output option.	Section 2.1.7
`-check bounds`	Generates extra code for array bounds checking at run time.	Section 3.18
`-check omp_bindings` (TUX ONLY)*	Provides run-time checking to enforce the binding rules for OpenMP Fortran API (parallel processing) compiler directives inserted in source code.	Section 3.22
`-check overflow`	Generates extra code to check integer calculations for arithmetic overflow at run time. Once the program is debugged, omit this option to reduce executable program size and slightly improve run-time performance.	Section 3.23
`-fpe n` values greater than `-fpe0`	Using `-fpe1` (TUX ONLY)* , `-fpe2` (TUX ONLY)* , `-fpe3` , or `-fpe4` (TUX ONLY)* (or using the `for_set_fpe` routine to set equivalent exception handling) slows program execution. For programs that specify `-fpe3` or `-fpe4` (TUX ONLY)* , the impact on run-time performance can be significant.	Section 3.37
`-fprm dynamic` (TUX ONLY)*	Certain rounding modes and changing the rounding mode can slow program execution slightly.	Section 3.39
`-g` , `-g2` , `-g3`	Generates extra symbol table information in the object file. Specifying `-g` or `-g2` also reduces the default level of optimization to `-O0` .	Section 3.41
`-inline none` `-inline manual`	Prevents the inlining of all procedures (except statement functions).	Section 5.8.5
`-O0` , `-O1` , `-O2` , or `-O3`	Minimizes the optimization level (and types of optimizations). Use during the early stages of program development or when you will use the debugger.	Section 3.62 and Section 5.7
`-synchronous_exceptions`	Generates extra code to associate an arithmetic exception with the instruction that causes it, slowing efficient instruction execution. Use this option only when troubleshooting, such as when identifying the source of an exception.	Section 3.76
`-vms`	Controls certain VMS-related run-time defaults, including alignment. If you specify the `-vms` option, you may need to also specify the `-align records` option to obtain optimal run-time performance.	Section 3.87

For More Information:

On compiling multiple files, see Section 2.1.7.
On minimizing external references, see Section 11.1.1.

Contents

Index

Compaq FortranUser Manual for Tru64 UNIX and Linux Alpha Systems

Chapter 5Performance: Making Programs Run Faster

5.1 Software Environment and Efficient Compilation

5.1.1 Install the Latest Version of Compaq Fortran and Performance Products

5.1.2 Compile Using Multiple Source Files and Appropriate f90 Options

Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems

Chapter 5
Performance: Making Programs Run Faster