Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems


Previous Contents Index


Chapter 5
Performance: Making Programs Run Faster

As you read this chapter, remember that the f90 command invokes the Compaq Fortran compiler on Tru64 UNIX Alpha systems while the fort command invokes the Compaq Fortran compiler on Linux Alpha systems. This chapter uses the f90 command to indicate invoking the Compaq Fortran compiler on both systems, so replace this command with fort if you are working on a Linux Alpha system.

Also, remember that the cc command invokes the Compaq C compiler on Tru64 UNIX Alpha systems while the ccc command invokes the Compaq C compiler on Linux Alpha systems. This chapter uses the cc command to indicate invoking the Compaq C compiler on both systems, so replace this command with ccc if you are working on a Linux Alpha system.

This chapter discusses the following topics related to improving run-time performance of Compaq Fortran programs:

This chapter does not address the performance and profiling of programs that execute in parallel using the Compaq Parallel Software Environment. For information about performance and profiling of parallel HPF programs, see the Compaq Parallel Software Environment documentation.

5.1 Software Environment and Efficient Compilation

Before you attempt to analyze and improve program performance, you should:

5.1.1 Install the Latest Version of Compaq Fortran and Performance Products

To ensure that your software development environment can significantly improve the run-time performance of your applications, obtain and install the following optional software products:

For More Information:

About system-wide tuning and suggestions for other performance enhancements on Compaq Tru64 UNIX systems, see the manual Compaq Tru64 UNIX System Tuning and Performance.

5.1.2 Compile Using Multiple Source Files and Appropriate f90 Options

During the earlier stages of program development, you can use incremental compilation with minimal optimization. For example:


% f90 -c -O1 sub2.f90
% f90 -c -O1 sub3.f90
% f90 -o main.out -g -O0 main.f90  sub2.o  sub3.o

During the later stages of program development, you should specify multiple source files together and use an optimization level of at least -O4 on the f90 command line to allow more interprocedure optimizations to occur. For instance, the following command compiles all three source files together using the default level of optimization ( -O4 ):


% f90 -o main.out main.f90  sub2.f90  sub3.f90

Compiling multiple source files lets the compiler examine more code for possible optimizations, which results in:

For very large programs, compiling all source files together may not be practical. In such instances, consider compiling source files containing related routines together using multiple f90 commands, rather than compiling source files individually.

Table 5-1 shows f90 options that can improve performance. Most of these options do not affect the accuracy of the results, while others improve run-time performance but can change some numeric results.

Compaq Fortran performs certain optimizations unless you specify the appropriate f90 command options. Additional optimizations can be enabled or disabled using f90 command options.

Table 5-1 lists the f90 options that can directly improve run-time performance.

Table 5-1 Options Related to Run-Time Performance
Option Names Description For More Information
-align keyword Controls whether padding bytes are added between data items within common blocks, derived-type data, and Compaq Fortran record structures to make the data items naturally aligned. Section 5.3
-architecture keyword Determines the type of Alpha architecture code instructions to be generated for the program unit being compiled. All Alpha processors implement a core set of instructions; certain processor versions include additional instruction extensions. Section 3.4
-cord and -feedback file Uses a feedback file created during a previous compilation by specifying the -gen_feedback option. These options use the feedback file to improve run-time performance, optionally using cord to rearrange procedures. Section 5.2.3
-fast Sets the following performance-related options:
-align dcommons
-align sequence
-arch host
-assume bigarrays (TU*X ONLY)
-assume nozsize (TU*X ONLY)
-assume noaccuracy_sensitive (same as -fp_reorder )
-math_library fast
-tune host
See description of each option
-fp_reorder Allows the compiler to reorder code based on algebraic identities to improve performance, enabling certain optimizations. The numeric results can be slightly different from the default ( -no_fp_reorder ) because of the way intermediate results are rounded. This slight difference in numeric results is acceptable to most programs. Section 5.8.9
-gen_feedback Requests generated code that allows accurate feedback information for subsequent use of the -feedback file option (optionally with cord ). Using -gen_feedback changes the default optimization level from -O4 to -O0 . Section 5.2.3
-inline all Inlines every call that can possibly be inlined while generating correct code. Certain recursive routines are not inlined to prevent infinite loops. Section 5.8.5
-inline speed Inlines procedures that will improve run-time performance with a likely significant increase in program size. Section 5.8.5
-inline size Inlines procedures that will improve run-time performance without a significant increase in program size. This type of inlining occurs at optimization level -O4 and -O5 . Section 5.8.5
-math_library fast Requests the use of certain math library routines (used by intrinsic functions) that provide faster speed. Using this option causes a slight loss of accuracy and provides less reliable arithmetic exception checking to get significant performance improvements in those functions. Section 3.53
-mp (TU*X ONLY) Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems Section 3.56
-O n ( -O0 to -O5 ) Controls the optimization level and thus the types of optimization performed. The default optimization level is -O4 , unless you specify -g2 , -g , or -gen_feedback , which changes the default to -O0 (no optimizations). Use -O5 to activate loop transformation optimizations and the software pipelining optimization. Section 5.7
-om (TU*X ONLY) Used with the -non_shared option to request certain code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window. Section 3.63
-omp (TU*X ONLY) Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems Section 3.64
-p , -p1 Requests profiling information, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. Section 5.2.2
-pg Requests profiling information for the gprof tool, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. Section 5.2.2
-pipeline Activates the software pipelining optimization (a subset of -O5 ). Section 3.66
-speculate keyword (TU*X ONLY) Enables the speculative execution optimization, a form of instruction scheduling for conditional expressions. Section 3.74
-transform_loops Activates a group of loop transformation optimizations (a subset of -O5 ). Section 3.79
-tune keyword Specifies the target processor generation (chip) architecture on which the program will be run, allowing the optimizer to make decisions about instruction tuning optimizations needed to create the most efficient code. Keywords allow specifying one particular Alpha processor generation type, multiple processor generation types, or the processor generation type currently in use during compilation. Regardless of the setting of -tune keyword , the generated code will run correctly on all implementations of the Alpha architecture. Section 5.8.6
-unroll num Specifies the number of times a loop is unrolled ( num) when specified with optimization level -O3 or higher. If you omit -unroll num , the optimizer determines how many times loops are unrolled. Section 5.7.4.1
-wsf num and related options (TU*X ONLY) Specifies that the code generated for this program will allow parallel execution on multiple processors using the Compaq Parallel Software Environment Section 3.92 and the Compaq Parallel Software Environment documentation

Table 5-2 lists options that can slow program performance. Some applications that require floating-point exception handling or rounding might need to use the -fpen and -fprm dynamic options. Other applications might need to use the -assume dummy_aliases or -vms options for compatibility reasons. Other options listed in Table 5-2 are primarily for troubleshooting or debugging purposes.

Table 5-2 Options that Slow Run-Time Performance
Option Names Description For More Information
-assume dummy_aliases Forces the compiler to assume that dummy (formal) arguments to procedures share memory locations with other dummy arguments or with variables shared through use association, host association, or common block use. These program semantics slow performance, so you should specify -assume dummy_aliases only for the called subprograms that depend on such aliases.

The use of dummy aliases violates the FORTRAN-77 and Fortran 95/90 standards but occurs in some older programs.

Section 5.8.10
-c If you use -c when compiling multiple source files, also specify -o output to compile many source files together into one object file. Separate compilations prevent certain interprocedure optimizations, such as when using multiple f90 commands or using -c without the -o output option. Section 2.1.7
-check bounds Generates extra code for array bounds checking at run time. Section 3.18
-check omp_bindings (TU*X ONLY) Provides run-time checking to enforce the binding rules for OpenMP Fortran API (parallel processing) compiler directives inserted in source code. Section 3.22
-check overflow Generates extra code to check integer calculations for arithmetic overflow at run time. Once the program is debugged, omit this option to reduce executable program size and slightly improve run-time performance. Section 3.23
-fpe n values greater than -fpe0 Using -fpe1 (TU*X ONLY) , -fpe2 (TU*X ONLY) , -fpe3 , or -fpe4 (TU*X ONLY) (or using the for_set_fpe routine to set equivalent exception handling) slows program execution. For programs that specify -fpe3 or -fpe4 (TU*X ONLY) , the impact on run-time performance can be significant. Section 3.37
-fprm dynamic (TU*X ONLY) Certain rounding modes and changing the rounding mode can slow program execution slightly. Section 3.39
-g , -g2 , -g3 Generates extra symbol table information in the object file. Specifying -g or -g2 also reduces the default level of optimization to -O0 . Section 3.41
-inline none
-inline manual
Prevents the inlining of all procedures (except statement functions). Section 5.8.5
-O0 , -O1 , -O2 , or -O3 Minimizes the optimization level (and types of optimizations). Use during the early stages of program development or when you will use the debugger. Section 3.62 and Section 5.7
-synchronous_exceptions Generates extra code to associate an arithmetic exception with the instruction that causes it, slowing efficient instruction execution. Use this option only when troubleshooting, such as when identifying the source of an exception. Section 3.76
-vms Controls certain VMS-related run-time defaults, including alignment. If you specify the -vms option, you may need to also specify the -align records option to obtain optimal run-time performance. Section 3.87

For More Information:


Previous Next Contents Index