DIGITAL Fortran 90
User Manual for
DIGITAL UNIX Systems

6.3 Decomposing Loops for Parallel Processing

Note

The following sections contain information that applies to both the OpenMP Fortran API and the DIGITAL Fortran parallel compiler directives. The code examples use the OpenMP API directive format.

The term loop decomposition is used to specify the process of dividing the iterations of an iterated DO loop and running them on two or more threads of a shared-memory multi-processor computer system.

To run in parallel, the source code in iterated DO loops must be decomposed by the user, and adequate system resources must be made available. Decomposition is the process of analyzing code for data dependences, dividing up the workload, and ensuring correct results when iterations run concurrently. The only type of decomposition available with DIGITAL Fortran is directed decomposition using a set of parallel compiler directives.

The following sections describe how to decompose loops and how to use the OpenMP Fortran API and the DIGITAL Fortran parallel compiler directives to achieve parallel processing.

6.3.1 Directed Decomposition

When a program is compiled using the -omp or the -mp option, the compiler parses the parallel compiler directives. However, you must transform the source code to resolve any loop-carried dependences and improve run-time performance. ¹

To use directed decomposition effectively, take the following steps:

Identify the loops that benefit most from parallel processing.
- Consider whether another algorithm might achieve more parallelism in general.
- Evaluate any caller or called loops and decompose the most CPU-intensive loops in the application (as long as there are no interfering dependences).
  If a parallel DO loop invokes a subprogram containing another parallel DO loop, only the parallel DO loop of the calling program will be run in parallel. Each of the threads executing the outermost parallel DO loop will execute all of the iterations in the innermost parallel DO loop in a serial, nonparallel fashion.
- Make sure the loop contains enough CPU work to outweigh the parallel-processing startup overhead.
Analyze the loop and resolve dependences as needed (see Section 6.3.1.1). If you cannot resolve loop-carried dependences, you cannot safely decompose the loop.
Make sure the shared or private attributes inside the loop are consistent with corresponding use outside the loop. By default, common blocks and individual variables are shared, except for the loop control variable and variables referenced in a subprogram called from within a parallel loop (in which case they are private by default).
Precede the loop with the PARALLEL directive followed by the DO directive. You can combine the two directives by using the PARALLEL DO directive.
As needed, manually optimize the loop.
Make sure the loop complies with restrictions of the parallel-processing environment.
Without using the -omp option or the -mp option, compile, test, and debug the program.
Using -omp (or -mp ), repeat the previous step.
Evaluate the parallel run:
- If you reach an acceptable level of performance and if the results are correct, stop.
- If the results are inaccurate, analyze the manually decomposed loops for dependences, apply a method to resolve them, and retest the parallel run.
- If performance is inadequate, consider adjusting the run-time environment (see Section 6.3.1.4) or performing other manual optimizations, or consider other alternatives discussed in this manual. Then reenter the cycle by retesting the parallel program.

6.3.1.1 Resolving Dependences Manually

In directed decomposition, you must resolve loop-carried dependences and dependences involving temporary variables to ensure safe parallel execution. Only cycles of dependences are nearly impossible to resolve.

Do one of the following:

Let the loop execute serially (possibly decompose an outer loop level)
Use a lock (CRITICAL) to force the critical section to execute serially
Recode or restructure the loop
Find another algorithm that does not have cycles of dependences

There are several methods for resolving dependences manually:

For dependences on variables used as temporaries, declare them PRIVATE; this effectively makes separate copies of temporary values for each thread.
Recode the loop so that the loop-carried dependence becomes loop independent, with each thread having the involved store and fetch operation contained in a single iteration.
Insert locks (CRITICAL) around the critical section containing the dependence.
Use this technique only for very CPU-intensive loops, when no other method is possible, and for the smallest amount of code possible. The locks extend processing time by making individual threads wait while only one executes the critical region at a time.
Recode loops with cycles of dependences (these are typically linear recurrences).

Resolving Dependences Involving Temporary Variables

Declare temporary variables PRIVATE to resolve dependences involving them. Temporary variables are used in intermediate calculations. If they are used in more than one iteration of a parallel loop, the program can produce incorrect results.

One thread might define a value and another thread use that value instead of the one it defined for a particular iteration. Loop control variables are prime examples of temporary variables that are declared PRIVATE by default within a parallel region. For example:

DO I = 1,100 TVAR = A(I) + 2 D(I) = TVAR + Y(I-1) END DO

As long as certain criteria are met, you can resolve this kind of dependence by declaring the temporary variable (TVAR, in the example) PRIVATE. That way, each thread keeps its own copy of the variable.

For the criteria to be met, the values of the temporary variable must be all of the following:

Defined in each iteration, inside the loop
Meant to be used inside the same iteration that established it
Used nowhere outside the loop unless it is redefined outside the loop before subsequent use

The default for variables in a parallel loop is SHARED, so you must explicitly declare these variables PRIVATE to resolve this kind of dependence.

Resolving Loop-Carried Dependences

You can often resolve loop-carried dependences using one or more of the following loop transformations:

Loop alignment
Code replication
Loop distribution
Restructure the loop into an inner and outer loop

These techniques also resolve dependences that inhibit autodecomposition.

Loop Alignment

Loop alignment offsets memory references in the loop so that the dependence is no longer loop carried. The following example shows a loop that is aligned to resolve the dependence in array A.

Loop with Dependence Aligned Statements

DO I = 2,N A(I) = B(I) C(I) = A(I+1) END DO
C(I-1) = A(I) A(I) = B(I)

Loop with Dependence	Aligned Statements
`DO I = 2,N A(I) = B(I) C(I) = A(I+1) END DO`	`C(I-1) = A(I) A(I) = B(I)`

To compensate for the alignment and achieve the same calculations as the original loop, you probably have to perform one or more of the following:

Change the loop control variable.
Add IF constructs.
Switch the order of the statements (this preserves the relative store-fetch order of the original loop).

Example 6-1 shows two possible forms of the final loop.

Example 6-1 Aligned Loop

! First possible form: !$OMP PARALLEL PRIVATE (I) !$OMP DO DO I = 2,N+1 IF (I .GT. 2) C(I-1) = A(I) IF (I .LE. N) A(I) = B(I) END DO !$OMP END DO !$OMP END PARALLEL ! ! Second possible form; more efficient because the tests are ! performed outside the loop: ! !$OMP PARALLEL !$OMP DO DO I = 3,N C(I-1) = A(I) A(I) = B(I) END DO !$OMP END DO !$OMP END PARALLEL IF (N .GE. 2) A(2) = B(2) C(N) = A(N+1) END IF

Code Replication

When a loop contains a loop-independent dependence as well as a loop-carried dependence, loop alignment alone is usually not adequate. By resolving the loop-carried dependence, you often misalign another dependence. Code replication creates temporary variables that duplicate operations and keep the loop-independent dependences inside each iteration.

In S₂ of the following loop, aligning the A(I-1) reference without code replication would misalign the A(I) reference:

Loop with Multiple Dependences Misaligned Dependence

DO I = 2,100 S ₁ A(I) = B(I) + C(I) S ₂ D(I) = A(I) + A(I-1) END DO
D(I-1) = A(I-1) + A(I) A(I) = B(I) + C(I)

Loop with Multiple Dependences	Misaligned Dependence
`DO I = 2,100 S ₁ A(I) = B(I) + C(I) S ₂ D(I) = A(I) + A(I-1) END DO`	`D(I-1) = A(I-1) + A(I) A(I) = B(I) + C(I)`

Example 6-2 uses code replication to keep the loop-independent dependence inside each iteration. The temporary variable, TA, must be declared PRIVATE.

Example 6-2 Transformed Loop Using Code Replication

!$OMP PARALLEL PRIVATE (I,TA) A(2) = B(2) + C(2) D(2) = A(2) + A(1) !$OMP DO DO I = 3,100 A(I) = B(I) + C(I) TA = B(I-1) + C(I-1) D(I) = A(I) + TA END DO !$OMP END DO !$OMP END PARALLEL

Loop Distribution

Loop distribution allows more parallelism when neither loop alignment nor code replication can resolve the dependences. Loop distribution divides the contents of loops into multiple loops so that dependences cross between two separate loops. The loops run serially in relation to each other, even if they both run in parallel.

The following loop contains multiple dependences that cannot be resolved by either loop alignment or code replication:

DO I = 1,100 S₁ A(I) = A(I-1) + B(I) S₂ C(I) = B(I) - A(I) END DO

Example 6-3 resolves the dependences by distributing the loop. S₂ can run in parallel despite the data recurrence in S₁.

Example 6-3 Distributed Loop

DO I 1,100 S₁ A(I) = A(I-1) + B(I) END DO DO I 1,100 S₂ C(I) = B(I) - A(I) END DO

Restructuring a Loop into an Inner and Outer Nest

Restructuring a loop into an inner and outer loop nest can resolve some recurrences that are used as rapid approximations of a function of the loop control variable. For example, the following loop uses sines and cosines:

THETA = 2.*PI/N DO I=0,N-1 S = SIN(I*THETA) C = COS(I*THETA) . . ! use S and C . END DO

Using a recurrence to approximate the sines and cosines can make the serial loop run faster (with some loss of accuracy), but it prevents the loop from running in parallel:

THETA = 2.*PI/N STH = SIN(THETA) CTH = COS(THETA) S = 0.0 C = 1.0 DO I=0,N-1 . . ! use S and C . S = S*CTH + C*STH C = C*CTH - S*STH END DO

To resolve the dependences, substitute the SIN and COS calls (however, this loses the performance improvement gained from using the recurrence). You can also restructure the loop into an outer parallel loop and an inner serial loop. Each iteration of the outer loop reinitializes the recurrence, and the inner loop uses the value:

!$OMP PARALLEL SHARED (THETA,STH,CTH,LCHUNK) PRIVATE (ISTART,I,S,C) THETA = 2.*PI/N STH = SIN(THETA) CTH = COS(THETA) LCHUNK = (N + NWORKERS()-1) / NWORKERS !$OMP DO DO ISTART = 0,N-1,LCHUNK S = SIN(ISTART*THETA) C = COS(ISTART*THETA) DO I = ISTART, MIN(N,ISTART+LCHUNK-1) . . ! use S and C . S = S*CTH + C*STH C = C*CTH - S*STH END DO END DO !$OMP END DO !$OMP END PARALLEL

Dependences Requiring Locks

When no other method can resolve a dependence, you can put locks around the critical section that contains them. Locks force threads to execute the critical section serially, while allowing the rest of the loop to run in parallel.

However, locks degrade performance because they force the critical section to run serially and increase the overhead. They are best used only when no other technique resolves the dependence, and only in CPU-intensive loops.

To create locks in a loop, enclose the critical section between the CRITICAL and END CRITICAL directives. When a thread executes the CRITICAL directive and the latch variable is open, it takes possession of the latch variable, and other threads must wait to execute the section. The latch variable becomes open when the thread executing the section executes the END CRITICAL directive.

The latch variable is closed when a thread has possession of it and open when the latch variable is free.

In Example 6-4, the statement updating the sum is locked for safe parallel execution of the loop.

Example 6-4 Decomposed Loop Using Locks

INTEGER(4) LCK !$OMP PARALLEL PRIVATE (I,Y) SHARED (LCK,SUM) LCK = 0 . . . !$OMP DO DO I = 1,1000 Y = some_calculation !$OMP CRITICAL (LCK) SUM = SUM + Y !$OMP END CRITICAL (LCK) END DO !$OMP END DO !$OMP END PARALLEL

This particular example is better solved using a reduction clause as shown in Example 6-5.

Example 6-5 Decomposed Loop Using a Reduction Clause

INTEGER(4) LCK !$OMP PARALLEL PRIVATE (I,Y) SHARED (LCK,SUM) LCK = 0 . . . !$OMP DO REDUCTION (SUM) DO I = 1,1000 Y = some_calculation SUM = SUM + Y END DO !$OMP END DO !$OMP END PARALLEL

6.3.1.2 Coding Restrictions

Because iterations in a parallel DO loop execute in an indeterminate order and in different threads, certain constructs in these loops can cause unpredictable run-time behavior.

The following restrictions are flagged:

The loop control variable for a parallel loop must be declared an integer.
Only comment lines and blank lines can exist between a DO directive and the DO loop statement.
The loop body must not contain any RETURN statements.
A loop with a branch (GOTO) into or out of its body, or having an EXIT statement cannot be run in parallel.

The following restrictions are not flagged:

Loop-carried dependences involving shared variables must not exist between iterations of a parallel loop.
Dependences involving private variables must not exist between code within a parallel loop and code executed before entry into or after the completion of the loop.
System services or run-time library routines that change the context of a thread (such as a change in privileges, priority, access mode, or environment variables) must not be called from within a parallel loop.
I/O statements and the control statements PAUSE and STOP must not be used in a routine called at any call level from within a parallel loop.
Private symbols must not be referenced in a SAVE statement in a routine called at any call level from within a parallel loop.
If a dummy argument is referenced within a parallel DO loop, the corresponding actual argument must reside in shared memory.
Random number generators must be used carefully inside parallel loops, because parallel processing affects how numbers are generated.

6.3.1.3 Manual Optimization

To manually optimize structures containing parallel loops:

Interchange loops so that the parallel loop has the most CPU work and the caches can perform efficiently.
Balance the parallel work among threads when it is unusually unbalanced.

Interchanging Loops

The following example shows a case in which an inner loop can run in parallel and an outer loop cannot, because of a loop-carried dependence. The inner loop also has a more effective memory-referencing pattern for parallel processing than the outer loop. By interchanging the loops, more work executes in parallel and the cache can perform more efficiently.

Original Structure Interchanged Structure

!$OMP PARALLEL PRIVATE (J,I) SHARED (A)

!$OMP DO

DO I = 1,100 DO J = 1,300

DO J = 1,300 DO I = 1,100

A(I,J) = A(I+1,J) + 1 A(I,J) = A(I+1,J) + 1

END DO END DO

END DO END DO

!$OMP END DO

!$OMP END PARALLEL

Original Structure	Interchanged Structure
	`!$OMP PARALLEL PRIVATE (J,I) SHARED (A)`
	`!$OMP DO`
`DO I = 1,100`	`DO J = 1,300`
`DO J = 1,300`	`DO I = 1,100`
`A(I,J) = A(I+1,J) + 1`	`A(I,J) = A(I+1,J) + 1`
`END DO`	`END DO`
`END DO`	`END DO`
	`!$OMP END DO`
	`!$OMP END PARALLEL`

Balancing the Workload

On the DO directive, you can specify the SCHEDULE(GUIDED) clause to use guided self-scheduling in manually decomposed loops, which is effective for most loops. However, when the iterations contain a predictably unbalanced workload, you can obtain better performance by manually balancing the workload. To do this, specify the chunk size in the SCHEDULE clause of the DO directive.

In the following loop, it might be very inefficient to divide the iterations into chunks of 50. A chunk size of 25 would probably be much more efficient on a system with two processors, depending on the amount of work being done by the routine SUB.

DO I = 1,100 . . . IF (I .LT. 50) THEN CALL SUB(I) END IF . . . END DO

6.3.1.4 Adjusting the Run-Time Environment

The OpenMP Fortran API and the DIGITAL Fortran parallel compiler directive sets also provide environment variables that adjust the run-time environment in unusual situations.

Regardless of whether you used the -omp or the -mp compiler option, when the compiler needs information supplied by an environment variable, the compiler first looks for an OpenMP Fortran API environment variable and then for a DIGITAL Fortran parallel compiler environment variable. If neither one is found, the compiler uses a default.

The compiler looks for environment variable information in the following situations:

When entering a parallel region, it looks for the number of threads (OMP_NUM_THREADS or MP_THREAD_COUNT),the spin count (MP_SPIN_COUNT), the yield count (MP_YIELD_COUNT), and the stack size (MP_STACK_SIZE).
When entering a DO or PARALLEL DO directive that has RUNTIME specified, it looks at schedule type (OMP_SCHEDULE.
When entering a worksharing directive, it looks at chunk size (MP_CHUNK_SIZE).

The OpenMP Fortran API environment variables are listed in Table 6-4.

.
Table 6-4 OpenMP Fortran API Environment Variables
Environment Variable¹ Interpretation

OMP_SCHEDULE

This variable applies only to DO and PARALLEL DO directives that have the schedule type of RUNTIME. You can set the schedule type and an optional chunk size for these loops at run time. The schedule types are STATIC, DYNAMIC, GUIDED, and RUNTIME.
For directives that have a schedule type other than RUNTIME, this variable is ignored. The compiler default schedule type is STATIC. If the optional chunk size is not set, a chunk size of one is assumed, except for the STATIC schedule type. For this schedule type, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.

OMP_NUM_THREADS

Use this environment variable to set the number of threads to use during execution. This number applies unless you explicitly change it by calling the OMP_SET_NUM_THREADS run-time library routine.
When you have enabled dynamic thread adjustment, the value assigned to this environment variable represents the maximum number of threads that can be used. The default value is the number of processors in the current system. For more information about dynamic thread adjustment, see the online release notes.

OMP_DYNAMIC

Use this environment variable to enable or disable dynamic thread adjustment for the execution of parallel regions. When set to TRUE, the number of threads used can be adjusted by the run-time environment to best utilize system resources. When set to FALSE, dynamic adjustment is disabled. The default is FALSE. For more information about dynamic thread adjustment, see the online release notes.

OMP_NESTED

Use this environment variable to enable or disable nested parallelism. When set to TRUE, nested parallelism is enabled. When set to FALSE, it is disabled. The default is FALSE. For more information about nested parallelism, see the online release notes.

**Table 6-4 OpenMP Fortran API Environment Variables**
Environment Variable¹	Interpretation
OMP_SCHEDULE
	This variable applies only to DO and PARALLEL DO directives that have the schedule type of RUNTIME. You can set the schedule type and an optional chunk size for these loops at run time. The schedule types are STATIC, DYNAMIC, GUIDED, and RUNTIME. For directives that have a schedule type other than RUNTIME, this variable is ignored. The compiler default schedule type is STATIC. If the optional chunk size is not set, a chunk size of one is assumed, except for the STATIC schedule type. For this schedule type, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.
OMP_NUM_THREADS
	Use this environment variable to set the number of threads to use during execution. This number applies unless you explicitly change it by calling the OMP_SET_NUM_THREADS run-time library routine. When you have enabled dynamic thread adjustment, the value assigned to this environment variable represents the maximum number of threads that can be used. The default value is the number of processors in the current system. For more information about dynamic thread adjustment, see the online release notes.
OMP_DYNAMIC
	Use this environment variable to enable or disable dynamic thread adjustment for the execution of parallel regions. When set to TRUE, the number of threads used can be adjusted by the run-time environment to best utilize system resources. When set to FALSE, dynamic adjustment is disabled. The default is FALSE. For more information about dynamic thread adjustment, see the online release notes.
OMP_NESTED
	Use this environment variable to enable or disable nested parallelism. When set to TRUE, nested parallelism is enabled. When set to FALSE, it is disabled. The default is FALSE. For more information about nested parallelism, see the online release notes.

¹Environment variable names must be in uppercase; the assigned values are not case-sensitive.

The DIGITAL Fortran parallel compiler environment variables are listed in Table 6-5.

Table 6-5 DIGITAL Fortran Environment Variables
Environment Variable Interpretation

MP_THREAD_COUNT

Specifies the number of threads the run-time system is to create. The default is the number of processors available to your process.

MP_CHUNK_SIZE

Specifies the chunk size the run-time system uses when dispatching loop iterations to threads if the program specified the RUNTIME schedule type or specified another schedule type requiring a chunk size, but omitted the chunk size. The default chunk size is 1.

MP_STACK_SIZE

Specifies how many bytes of stack space the runtime system allocates for each thread when creating it. If you specify zero, the runtime system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable at all, the runtime system allocates 5 MB.

MP_SPIN_COUNT

Specifies how many times the runtime system spins while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.

MP_YIELD_COUNT

Specifies how many times the runtime system alternates between calling sched_yield and testing the condition before going to sleep by waiting for a thread condition variable. The default is 10.

**Table 6-5 DIGITAL Fortran Environment Variables**
Environment Variable	Interpretation
MP_THREAD_COUNT
	Specifies the number of threads the run-time system is to create. The default is the number of processors available to your process.
MP_CHUNK_SIZE
	Specifies the chunk size the run-time system uses when dispatching loop iterations to threads if the program specified the RUNTIME schedule type or specified another schedule type requiring a chunk size, but omitted the chunk size. The default chunk size is 1.
MP_STACK_SIZE
	Specifies how many bytes of stack space the runtime system allocates for each thread when creating it. If you specify zero, the runtime system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable at all, the runtime system allocates 5 MB.
MP_SPIN_COUNT
	Specifies how many times the runtime system spins while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.
MP_YIELD_COUNT
	Specifies how many times the runtime system alternates between calling sched_yield and testing the condition before going to sleep by waiting for a thread condition variable. The default is 10.

Note

¹ Another method of supporting parallel processing does not involve iterated DO loops. Instead, it allows large amounts of independent code to be run in parallel using the SECTIONS and SECTION directives.

Contents

Index

DIGITAL Fortran 90User Manual for DIGITAL UNIX Systems