DIGITAL Fortran 90
User Manual for
DIGITAL UNIX Systems

5.7.4 Additional Global Optimizations

To enable additional global optimizations, use -o3 or a higher optimization level ( -o4 or -o5 ). Using -o3 or higher also enables local optimizations ( -o1 ) and global optimizations ( -o2 ).

Additional global optimizations improve speed at the cost of longer compile times and possibly extra code size.

5.7.4.1 Loop Unrolling

At optimization level -o3 or above, DIGITAL Fortran 90 attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining). The best candidates for loop unrolling are innermost loops with limited control flow.

As more loops are unrolled, the average size of basic blocks increases. Loop unrolling generates multiple copies of the code for the loop body (loop code iterations) in a manner that allows efficient instruction pipelining.

The loop body is replicated a certain number of times, substituting index expressions. An initialization loop might be created to align the first reference with the main series of loops. A remainder loop might be created for leftover work.

The number of times a loop is unrolled can be determined either by the optimizer or by using the -unroll option, which can specify the limit for loop unrolling. Unless the user specifies a value, the optimizer unrolls a loop four times for most loops or two times for certain loops (large estimated code size or branches out of the loop).

Array operations are often represented as a nested series of loops when expanded into instructions. The innermost loop for the array operation is the best candidate for loop unrolling (like DO loops). For example, the following array operation (once optimized) is represented by nested loops, where the innermost loop is a candidate for loop unrolling:

A(1:100,2:30) = B(1:100,1:29) * 2.0

5.7.4.2 Code Replication to Eliminate Branches

In addition to loop unrolling and other optimizations, the number of branches are reduced by replicating code that will eliminate branches. Code replication decreases the number of basic blocks and increases instruction-scheduling opportunities.

Code replication normally occurs when a branch is at the end of a flow of control, such as a routine with multiple, short exit sequences. The code at the exit sequence gets replicated at the various places where a branch to it might occur.

For example, consider the following unoptimized routine and its optimized equivalent that uses code replication (R0 is register 0):

Unoptimized Instructions Optimized (Replicated) Instructions

.
.
.
branch to exit1
.
.
.
branch to exit1
.
.
.
exit1: move 1 into R0
return

.
.
.
move 1 into R0
return
.
.
.
move 1 into R0
return
.
.
.
move 1 into R0
return

Unoptimized Instructions	Optimized (Replicated) Instructions
. . . branch to exit1 . . . branch to exit1 . . . exit1: move 1 into R0 return	. . . move 1 into R0 return . . . move 1 into R0 return . . . move 1 into R0 return

Similarly, code replication can also occur within a loop that contains a small amount of shared code at the bottom of a loop and a case-type dispatch within the loop. The loop-end test-and-branch code might be replicated at the end of each case to create efficient instruction pipelining within the code for each case.

5.7.5 Automatic Inlining

To enable optimizations that perform automatic inlining, use -o4 or a higher optimization level ( -o5 ). Using -o4 also enables local optimizations ( -o1 ), global optimizations ( -o2 ), and additional global optimizations ( -o3 ).

The default is -o4 (unless -g2 , -g , or -gen_feedback is specified).

5.7.5.1 Interprocedure Analysis

Compiling multiple source files at optimization level -o4 or higher lets the compiler examine more code for possible optimizations, including multiple program units. This results in:

Inlining more procedures
More complete global data analysis
Reducing the number of external references to be resolved during linking

As more procedures are inlined, the size of the executable program and compile times may increase, but execution time should decrease.

5.7.5.2 Inlining Procedures

Inlining refers to replacing a subprogram reference (such as a CALL statement or function invocation) with the replicated code of the subprogram. As more procedures are inlined, global optimizations often become more effective.

The optimizer inlines small procedures, limiting inlining candidates based on such criteria as:

Estimated size of code
Number of call sites
Use of constant arguments

You can specify:

One of the -onum options to control the optimization level. For example, specifying -o4 or higher enables interprocedure optimizations.
Different -onum options set -inline xxxx options. For example, -o4 sets -inline speed .
One of the -inline xxxx options to directly control the inlining of procedures (see Section 5.8.5). For example, -inline speed inlines more procedures than -inline size .

5.7.6 Loop Transformation and Software Pipelining

A group of optimizations known as loop transformation optimizations and software pipelining with its associated additional software dependence analysis are enabled by using the -o5 option. In certain cases, this improves run-time performance.

The loop transformation optimizations apply to array references within loops and can apply to multiple nested loops. These optimizations can improve the performance of the memory system.

Software pipelining applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.

Software pipelining also enables the prefetching of data to reduce the impact of cache misses.

For More Information:

On loop transformations, see Section 5.8.1.
On software pipelining, see Section 5.8.2.

5.8 Other Options Related to Optimization

In addition to the -onum options (discussed in Section 5.7), several other f90 command options can prevent or facilitate improved optimizations.

5.8.1 Loop Transformation

The loop transformation optimizations are enabled by using the -transform_loops option or the -o5 option. Loop transformation attempts to improve performance by rewriting loops to make better use of the memory system. By rewriting loops, the loop transformation optimizations can increase the number of instructions executed, which can degrade the run-time performance of some programs.

To request loop transformation optimizations without software pipelining, do one of the following:

Specify -o5 with -nopipeline (preferred method)
Specify -transform_loops with -o4 , -o3 , or -o2 . This optimization is not performed at optimization levels below -o2 .

The loop transformation optimizations apply to array references within loops. These optimizations can improve the performance of the memory system and usually apply to multiple nested loops. The loops chosen for loop transformation optimizations are always counted loops. Counted loops use a variable to count iterations, thereby determining the number of iterations before entering the loop. For example, most DO loops are counted loops.

Conditions that typically prevent the loop transformation optimizations from occurring include subprogram references that are not inlined (such as an external function call), complicated exit conditions, and uncounted loops.

The types of optimizations associated with -transform_loops include the following:

Loop blocking---Can minimize memory system use with multidimensional array elements by completing as many operations as possible on array elements currently in the cache. Also known as loop tiling.
Loop distribution---Moves instructions from one loop into separate, new loops. This can reduce the amount of memory used during one loop so that the remaining memory may fit in the cache. It can also create improved opportunities for loop blocking.
Loop fusion---Combines instructions from two or more adjacent loops that use some of the same memory locations into a single loop. This can avoid the need to load those memory locations into the cache multiple times and improves opportunities for instruction scheduling.
Loop interchange---Changes the nesting order of some or all loops. This can minimize the stride of array element access during loop execution and reduce the number of memory accesses needed. Also known as loop permutation.
Scalar replacement---Replaces the use of an array element with a scalar variable under certain conditions.
Outer loop unrolling---Unrolls the outer loop inside the inner loop under certain conditions to minimize the number of instructions and memory accesses needed. This also improves opportunities for instruction scheduling and scalar replacement.

For More Information:

On the interaction of command-line options and timing programs compiled with the loop transformation optimizations, see Section 3.75.

5.8.2 Software Pipelining

Software pipelining and additional software dependence analysis are enabled by using the -pipeline option or by the -o5 option. Software pipelining in certain cases improves run-time performance.

The software pipelining optimization applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.

Loop unrolling (enabled at -o3 or above) cannot schedule across iterations of a loop. Because software pipelining can schedule across loop iterations, it can perform more efficient scheduling to eliminate instruction stalls within loops.

For instance, if software dependence analysis of data flow reveals that certain calculations can be done before or after that iteration of the loop, software pipelining reschedules those instructions ahead of or behind that loop iteration, at places where their execution can prevent instruction stalls or otherwise improve performance.

Software pipelining also enables the prefetching of data to reduce the impact of cache misses.

Software pipelining can be more effective when you combine -pipeline (or -o5 ) with the appropriate -tune keyword for the target Alpha processor generation (see Section 5.8.6).

To specify software pipelining without loop transformation optimizations, do one of the following:

Specify -o5 with -notransform_loops (preferred method)
Specify -pipeline with -o4 , -o3 , or -o2 . This optimization is not performed at optimization levels below -o2 .

For this version of DIGITAL Fortran 90, loops chosen for software pipelining:

Are always innermost loops (those executed the most).
Do not contain branches or procedure calls.
Do not use COMPLEX floating-point data.

By modifying the unrolled loop and inserting instructions as needed before and/or after the unrolled loop, software pipelining generally improves run-time performance, except where the loops contain a large number of instructions with many existing overlapped operations. In this case, software pipelining may not have enough registers available to effectively improve execution performance. Run-time performance using -o5 (or -pipeline ) may not improve performance, as compared to using -o4 ).

For programs that contain loops that exhaust available registers, longer execution times may result with -o5 or -pipeline . In cases where performance does not improve, consider compiling with the -unroll 1 option along with -o5 or -pipeline , to possibly improve the effects of software pipelining.

For More Information:

On the interaction of command-line options and timing programs compiled with software pipelining, see Section 3.62.

5.8.3 Setting Multiple Options with the -fast Option

Specifying the -fast option sets the following options:

-align dcommons (see Section 5.3)
-fp_reorder , same as -assume noaccuracy_sensitive (see Section 5.8.9)
-assume bigarrays (see Section 3.87.1)
-assume nozsize (see Section 3.87)
-math_library fast (see Section 3.50)

5.8.4 Controlling Loop Unrolling

You can specify the number of times a loop is unrolled by using the -unroll num option (see Section 3.78).

The -unroll num option can also influence the run-time results of software pipelining optimizations performed when you specify -o5 .

Although unrolling loops usually improves run-time performance, the size of the executable program may increase.

For More Information:

On loop unrolling, see Section 5.7.4.1.

5.8.5 Controlling the Inlining of Procedures

To specify the types of procedures to be inlined, use the -inline keyword options. Also, compile multiple source files together and specify an adequate optimization level, such as -o4 .

If you omit -noinline and the -inline keyword options, the optimization level -onum option used determines the types of procedures that are inlined.

The -inline option keywords are as follows:

-inline none (same as -noinline ) inlines statement functions but not other procedures. This type of inlining occurs if you specify -o0 or -o1 and omit -inline keyword options.
-inline manual inlines statement functions but not other procedures. This type of inlining occurs if you specify -o2 or -o3 and omit -inline keyword options.
In addition to inlining statement functions, -inline size inlines any procedures that the DIGITAL Fortran 90 optimizer expects will improve run-time performance with no likely significant increase in program size.
In addition to inlining statement functions, -inline speed inlines any procedures that the DIGITAL Fortran 90 optimizer expects will improve run-time performance with a likely significant increase in program size. This type of inlining occurs if you specify -o4 or -o5 and omit -inline keyword options.
-inline all inlines every call that can possibly be inlined while generating correct code, including the following:
- Statement functions (always inlined)
- Any procedures that DIGITAL Fortran 90 expects will improve run-time performance with a likely significant increase in program size.
- Any other procedures that can possibly be inlined and generate correct code. Certain recursive routines are not inlined to prevent infinite expansion.

For information on the inlining of other procedures (inlined at optimization level -o4 or higher), see Section 5.7.5.2.

Maximizing the types of procedures that are inlined usually improves run-time performance, but compile-time memory usage and the size of the executable program may increase.

To determine whether using -inline all benefits your particular program, time program execution for the same program compiled with and without -inline all .

5.8.6 Requesting Optimized Code for a Specific Processor Generation

You can specify the types of optimized code to be generated by using the -tune keyword option. Regardless of the specified keyword, the generated code will run correctly on all implementations of the Alpha architecture. Tuning for a specific implementation can improve run-time performance; it is also possible that code tuned for a specific target may run slower on another target.

Specifying the correct keyword for -tune keyword for the target processor generation type usually slightly improves run-time performance. Unless you request software pipelining, the run-time performance difference for using the wrong keyword for -tune keyword (such as using -tune ev4 for an ev5 processor) is usually less than 5%. When using software pipelining (using -o5 ) with -tune keyword , the difference can be more than 5%.

The combination of the specified keyword for -tune keyword and the type of processor generation used has no effect on producing the expected correct program results.

The -tune keyword keywords are as follows:

-tune generic generates and schedules code that will execute well for all types of Alpha processor generations. This provides generally efficient code for those applications that will be run on systems using all types of processor generations (an alternative to providing multiple versions of the application compiled for each processor generation type).
-tune host generates and schedules code optimized for the type of processor generation in use on the system being used for compilation.
-tune ev4 generates and schedules code optimized for the EV4 (21064) processor generation.
-tune ev5 generates and schedules code optimized for the EV5 (21164) processor generation. This processor generation is faster than EV4.

If you omit -tune keyword , -tune generic is used.

5.8.7 Requesting the Speculative Execution Optimization

Speculative execution reduces instruction latency stalls to improve run-time performance for certain programs or routines. Speculative execution evaluates conditional code (including exceptions) and moves instructions that would otherwise be executed conditionally to a position before the test, so they are executed unconditionally.

The default, -speculate none , means that the speculative execution code scheduling optimization is not used and exceptions are reported as expected. You can specify -speculate all or -speculate by_routine to request the speculative execution optimization.

Performance improvements may be reduced because the run-time system must dismiss exceptions caused by speculative instructions. For certain programs, longer execution times may result when using the speculative execution optimization. To determine whether using -speculate all or -speculate by_routine benefits your particular program, you should time the program execution with one of these options for the same program compiled with -speculate none (default).

Speculative execution does not support some run-time error checking, since exception and signal processing (including SIGSEGV, SIGBUS, and SIGFPE) is conditional. When the program needs to be debugged or while you are testing for errors, only use -speculate none .

For More Information:

On -speculate all or -speculate by_routine and the interaction with other command-line options, see Section 3.70.

Contents

Index

DIGITAL Fortran 90User Manual for DIGITAL UNIX Systems

5.7.4 Additional Global Optimizations

5.7.6 Loop Transformation and Software Pipelining

5.8 Other Options Related to Optimization

5.8.4 Controlling Loop Unrolling

DIGITAL Fortran 90
User Manual for
DIGITAL UNIX Systems