DIGITAL Fortran 90
User Manual for
DIGITAL UNIX Systems

5.4 Using Arrays Efficiently

The following sections discuss these: topics:

Accessing arrays efficiently
Passing arrays efficiently

5.4.1 Accessing Arrays Efficiently

On Alpha systems, many of the array access efficiency techniques described in this section are applied automatically by the DIGITAL Fortran 90 loop transformation optimizations (see Section 5.8.1) or by the KAP for DIGITAL Fortran 90 for DIGITAL UNIX Systems performance preprocessor (described in Section 5.1.1).

Several aspects of array use can improve run-time performance:

The fastest array access occurs when contiguous access to the whole array or most of an array occurs. Perform one or a few array operations that access all of the array or major parts of an array instead of numerous operations on scattered array elements.
Rather than use explicit loops for array access, use elemental array operations, such as the following line that increments all elements of array variable A:

A = A + 1.

When reading or writing an array, use the array name and not a DO loop or an implied DO-loop that specifies each element number. Fortran 90 array syntax allows you to reference a whole array by using its name in an expression. For example:

REAL :: A(100,100) A = 0.0 A = A + 1. ! Increment all elements of A by 1 . . . WRITE (8) A ! Fast whole array use

Similarly, you can use derived-type array structure components, such as:

TYPE X INTEGER A(5) END TYPE X . . . TYPE (X) Z WRITE (8) Z%A ! Fast array structure component use

Make sure multidimensional arrays are referenced using proper array syntax and are traversed in the "natural" ascending order column major for Fortran and Fortran 90. With column-major order, the leftmost subscript varies most rapidly with a stride of one. Whole array access uses column-major order.
Avoid row-major order, as is done by C, where the rightmost subscript varies most rapidly.
For example, consider the nested DO loops that access a two-dimension array with the J loop as the innermost loop:

INTEGER X(3,5), Y(3,5), I, J Y = 0 DO I=1,3 ! I outer loop varies slowest DO J=1,5 ! J inner loop varies fastest X (I,J) = Y(I,J) + 1 ! Inefficient row-major storage order END DO ! (rightmost subscript varies fastest) END DO . . . END PROGRAM

Since J varies the fastest and is the second array subscript in the expression X (I,J), the array is accessed in row-major order.
To make the array accessed in natural column-major order, examine the array algorithm and data being modified.
Using arrays X and Y, the array can be accessed in natural column-major order by changing the nesting order of the DO loops so the innermost loop variable corresponds to the leftmost array dimension:

INTEGER X(3,5), Y(3,5), I, J Y = 0 DO J=1,5 ! J outer loop varies slowest DO I=1,3 ! I inner loop varies fastest X (I,J) = Y(I,J) + 1 ! Efficient column-major storage order END DO ! (leftmost subscript varies fastest) END DO . . . END PROGRAM

The DIGITAL Fortran 90 whole array access (X = Y + 1) uses efficient column major order. However, if the application requires that J vary the fastest or if you cannot modify the loop order without changing the results, consider modifying the application program to use a rearranged order of array dimensions. Program modifications include rearranging the order of:

Dimensions in the declaration of the arrays X(5,3) and Y(5,3)
The assignment of X(J,I) and Y(J,I) within the DO loops
All other references to arrays X and Y

In this case, the original DO loop nesting is used where J is the innermost loop:

INTEGER X(5,3), Y(5,3), I, J Y = 0 DO I=1,3 ! I outer loop varies slowest DO J=1,5 ! J inner loop varies fastest X (J,I) = Y(J,I) + 1 ! Efficient column-major storage order END DO ! (leftmost subscript varies fastest) END DO . . . END PROGRAM

Code written to access multidimensional arrays in row-major order (like C) or random order can often make inefficient use of the CPU memory cache. For more information on using natural storage order during record I/O operations, see Section 5.5.3.

Use the available Fortran 90 array intrinsic procedures rather than create your own.
Whenever possible, use Fortran 90 array intrinsic procedures instead of creating your own routines to accomplish the same task. Fortran 90 array intrinsic procedures are designed for efficient use with the various DIGITAL Fortran 90 run-time components.
Using the standard-conforming array intrinsics can also make your program more portable.
With multidimensional arrays where access to array elements will be noncontiguous, avoid left-most array dimensions that are a power of two (such as 256, 512).
Since the cache sizes are a power of two, array dimensions that are also a power of two may make inefficient use of cache when array access is noncontiguous. If the cache size is an exact multiple of the leftmost dimension, your program will probably make little use of the cache. This does not apply to contiguous sequential access or whole array access.
One work-around is to increase the dimension to allow some unused elements, making the leftmost dimension larger than actually needed. For example, increasing the leftmost dimension of A from 512 to 520 would make better use of cache:
REAL A (512,100) DO I = 2,511 DO J = 2,99 A(I,J)=(A(I+1,J-1) + A(I-1, J+1)) * 0.5 END DO END DO
In this code, array A has a leftmost dimension of 512, a power of two. The innermost loop accesses the rightmost dimension (row major), causing inefficient access. Increasing the leftmost dimension of A to 520 (REAL A (520,100)) allows the loop to provide better performance, but at the expense of some unused elements.
Because loop index variables I and J are used in the calculation, changing the nesting order of the DO loops changes the results.
For More Information:

On arrays and their data declaration statements, see the DIGITAL Fortran Language Reference Manual.

5.4.2 Passing Array Arguments Efficiently

In Fortran 90, there are two general types of array arguments:

Explicit-shape arrays used with FORTRAN 77.
These arrays have a fixed rank and extent that is known at compile time. Other dummy argument (receiving) arrays that are not deferred-shape (such as assumed-size arrays) can be grouped with explicit-shape array arguments.
Deferred-shape arrays introduced with Fortran 90.
Types of deferred-shape arrays include array pointers and allocatable arrays. Assumed-shape array arguments generally follow the rules about passing deferred-shape array arguments.

When passing arrays as arguments, either the starting (base) address of the array or the address of an array descriptor is passed:

When using explicit-shape (or assumed-size) arrays to receive an array, the starting address of the array is passed.
When using deferred-shape or assumed-shape arrays to receive an array, the address of the array descriptor is passed (the compiler creates the array descriptor).

Passing an assumed-shape array or array pointer to an explicit-shape array can slow run-time performance. This is because the compiler needs to create an array temporary for the entire array. The array temporary is created because the passed array may not be contiguous and the receiving (explicit-shape) array requires a contiguous array. When an array temporary is created, the size of the passed array determines whether the impact on slowing run-time performance is slight or severe.

Table 5-3 summarizes what happens with the various combinations of array types. The amount of run-time performance inefficiency depends on the size of the array.

Table 5-3 Output Argument Array Types
Input Arguments Array Types Explicit-Shape Arrays Deferred-Shape and Assumed-Shape Arrays

Explicit-Shape Arrays Very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional. Efficient. Only allowed for assumed-shape arrays (not deferred-shape arrays). Does not use an array temporary. Passes an array descriptor. Requires an interface block.

Deferred-Shape and Assumed-Shape Arrays When passing an allocatable array, very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional.
When not passing an allocatable array, not efficient. Instead use allocatable arrays whenever possible.
Uses an array temporary. Does not pass an array descriptor. Interface block optional.
Efficient. Requires an assumed-shape or array pointer as dummy argument. Does not use an array temporary. Passes an array descriptor. Requires an interface block.

**Table 5-3 Output Argument Array Types**
Input Arguments Array Types	Explicit-Shape Arrays	Deferred-Shape and Assumed-Shape Arrays
Explicit-Shape Arrays	Very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional.	Efficient. Only allowed for assumed-shape arrays (not deferred-shape arrays). Does not use an array temporary. Passes an array descriptor. Requires an interface block.
Deferred-Shape and Assumed-Shape Arrays	When passing an allocatable array, very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional. When not passing an allocatable array, not efficient. Instead use allocatable arrays whenever possible. Uses an array temporary. Does not pass an array descriptor. Interface block optional.	Efficient. Requires an assumed-shape or array pointer as dummy argument. Does not use an array temporary. Passes an array descriptor. Requires an interface block.

5.5 Improving Overall I/O Performance

Improving overall I/O performance can minimize both device I/O and actual CPU time. The techniques listed in this section can greatly improve performance in many applications.

A bottleneck limits the maximum speed of execution by being the slowest process in an executing program. In some programs, I/O is the bottleneck that prevents an improvement in run-time performance. The key to relieving I/O bottlenecks is to reduce the actual amount of CPU and I/O device time involved in I/O. Bottlenecks may be caused by one or more of the following:

A dramatic reduction in CPU time without a corresponding improvement in I/O time results in an I/O bottleneck.
By such coding practices as:
- Unnecessary formatting of data and other CPU-intensive processing
- Unnecessary transfers of intermediate results
- Inefficient transfers of small amounts of data
- Application requirements

Improved coding practices can minimize actual device I/O, as well as the actual CPU time.

DIGITAL offers software solutions to system-wide problems like minimizing device I/O delays (see Section 5.1.1).

5.5.1 Use Unformatted Files Instead of Formatted Files

Use unformatted files whenever possible. Unformatted I/O of numeric data is more efficient and more precise than formatted I/O. Native unformatted data does not need to be modified when transferred and will take up less space on an external file.

Conversely, when writing data to formatted files, formatted data must be converted to character strings for output, less data can transfer in a single operation, and formatted data may lose precision if read back into binary form.

To write the array A(25,25) in the following statements, S₁ is more efficient than S₂:

S₁ WRITE (7) A S₂ WRITE (7,100) A 100 FORMAT (25(' ',25F5.21))

Although formatted data files are more easily ported to other systems, DIGITAL Fortran 90 can convert unformatted data in several formats (see Chapter 10).

5.5.2 Write Whole Arrays or Strings

The general guidelines about array use discussed in Section 5.4 also apply to reading or writing an array with an I/O statement.

To eliminate unnecessary overhead, write whole arrays or strings at one time rather than individual elements at multiple times. Each item in an I/O list generates its own calling sequence. This processing overhead becomes most significant in implied-DO loops. When accessing whole arrays, use the array name (Fortran 90 array syntax) instead of using implied-DO loops.

5.5.3 Write Array Data in the Natural Storage Order

Use the natural ascending storage order whenever possible. This is column-major order, with the leftmost subscript varying fastest and striding by 1 (see Section 5.4). If a program must read or write data in any other order, efficient block moves are inhibited.

If the whole array is not being written, natural storage order is the best order possible.

If you must use an unnatural storage order, in certain cases it might be more efficient to transfer the data to memory and reorder the data before performing the I/O operation.

5.5.4 Use Memory for Intermediate Results

Performance can improve by storing intermediate results in memory rather than storing them in a file on a peripheral device. One situation that may not benefit from using intermediate storage is when there is a disproportionately large amount of data in relation to physical memory on your system. Excessive page faults can dramatically impede virtual memory performance.

If you are primarily concerned with the CPU performance of the system, consider using a memory file system (mfs) virtual disk to hold any files your code reads or writes (see mfs(1)).

5.5.5 Enable Implied-DO Loop Collapsing

DO loop collapsing reduces a major overhead in I/O processing. Normally, each element in an I/O list generates a separate call to the DIGITAL Fortran 90 RTL. The processing overhead of these calls can be most significant in implied-DO loops.

DIGITAL Fortran 90 reduces the number of calls in implied-DO loops by replacing up to seven nested implied-DO loops with a single call to an optimized run-time library I/O routine. The routine can transmit many I/O elements at once.

Loop collapsing can occur in formatted and unformatted I/O, but only if certain conditions are met:

The control variable must be an integer. The control variable cannot be a dummy argument or contained in an EQUIVALENCE or VOLATILE statement. DIGITAL Fortran 90 must be able to determine that the control variable does not change unexpectedly at run time.
The format must not contain a variable format expression.

For More Information:

On VOLATILE attribute and statement, see the DIGITAL Fortran Language Reference Manual.
On loop optimizations, see Section 5.7.

5.5.6 Use of Variable Format Expressions

Variable format expressions (a DIGITAL Fortran extension) are almost as flexible as run-time formatting, but they are more efficient because the compiler can eliminate run-time parsing of the I/O format. Only a small amount of processing and the actual data transfer are required during run time.

On the other hand, run-time formatting can impair performance significantly. For example, in the following statements, S₁ is more efficient than S₂ because the formatting is done once at compile time, not at run time:

S₁ WRITE (6,400) (A(I), I=1,N) 400 FORMAT (1X, <N> F5.2) . . . S₂ WRITE (CHFMT,500) '(1X,',N,'F5.2)' 500 FORMAT (A,I3,A) WRITE (6,FMT=CHFMT) (A(I), I=1,N)

5.5.7 Efficient Use of Record Buffers and Disk I/O

Records being read or written are transferred between the user's program buffers and one or more disk block I/O buffers, which are established when the file is opened by the DIGITAL Fortran 90 RTL. Unless very large records are being read or written, multiple logical records can reside in the disk block I/O buffer when it is written to disk or read from disk, minimizing physical disk I/O.

You can specify the size of the disk block I/O buffer by using the OPEN statement BLOCKSIZE specifier; the default size can be obtained from fstat(2). If you omit the BLOCKSIZE specifier in the OPEN statement, it is set for optimal I/O use with the type of device the file resides on.

The default for BUFFERCOUNT is 1. Any experiments to improve I/O performance should increase the BUFFERCOUNT value and not the BLOCKSIZE value, to increase the amount of data read by each disk I/O.

When writing records, be aware that I/O records are written to unified buffer cache (UBC) system buffers. To request that I/O records be written from program buffers to the UBC system buffers, use the flush library routine (see flush(3f) and Chapter 12). Be aware that calling flush also discards read-ahead data in user buffer.

To request that UBC system buffers be written to disk, use the fsync library routine (see fsync(3f) and Chapter 12).

When UBC buffers are written to disk depends on UBC characteristics on the system, such as the vm-ubcbuffers attribute (see the DIGITAL UNIX System Tuning and Performance guide).

5.5.8 Specify RECL

The sum of the record length (RECL specifier in an OPEN statement) and its overhead is a multiple or divisor of the blocksize, which is device specific. For example, if the BLOCKSIZE is 8192 then RECL might be 24576 (a multiple of 3) or 1024 (a divisor of 8).

The RECL value should fill blocks as close to capacity as possible (but not over capacity). Such values allow efficient moves, with each operation moving as much data as possible; the least amount of space in the block is wasted. Avoid using values larger than the block capacity, because they create very inefficient moves for the excess data only slightly filling a block (allocating extra memory for the buffer and writing partial blocks are inefficient).

The RECL value unit for formatted files is always 1-byte units. For unformatted files, the RECL unit is 4-byte units, unless you specify the -assume byterecl option to request 1-byte units (see Section 3.5).

When porting unformatted data files from non-DIGITAL systems, see Section 10.4.5.

5.5.9 Use the Optimal Record Type

Unless a certain record type is needed for portability reasons (see Section 7.4.3), choose the most efficient type, as follows:

For sequential files of a consistent record size, the fixed-length record type gives the best performance.
For sequential unformatted files when records are not fixed in size, the variable-length record type gives the best performance---particularly for BACKSPACE operations.
For sequential formatted files when records are not fixed in size, the Stream_LF record type gives the best performance.

For More Information:

On DIGITAL Fortran 90 data files and I/O, see Chapter 7.
On OPEN statement specifiers and defaults, see Section 7.5 and the DIGITAL Fortran Language Reference Manual.

5.6 Additional Source Code Guidelines for Run-Time Efficiency

Other source coding guidelines can be implemented to improve run-time performance.

The amount of improvement in run-time performance is related to the number of times a statement is executed. For example, improving an arithmetic expression executed within a loop many times has the potential to improve performance, more than improving a similar expression executed once outside a loop.

5.6.1 Avoid Small Integer and Small Logical Data Items

Avoid using integer or logical data less than 32 bits, because the smallest unit of efficient access on Alpha systems is 32 bits.

Accessing a 16-bit (or 8-bit) data type can result in a sequence of machine instructions to access the data, rather than a single, efficient machine instruction for a 32-bit data item.

To minimize data storage and memory cache misses with arrays, use 32-bit data rather than 64-bit data, unless you require the greater numeric range of 8-byte integers or the greater range and precision of double precision floating-point numbers.

5.6.2 Avoid Mixed Data Type Arithmetic Expressions

Avoid mixing integer and floating-point (REAL) data in the same computation. Expressing all numbers in a floating-point arithmetic expression (assignment statement) as floating-point values eliminates the need to convert data between fixed and floating-point formats. Expressing all numbers in an integer arithmetic expression as integer values also achieves this. This improves run-time performance.

For example, assuming that I and J are both INTEGER variables, expressing a constant number (2.) as an integer value (2) eliminates the need to convert the data:

Original Code: INTEGER I, J
I = J / 2.

Efficient Code: INTEGER I, J
I = J / 2

For applications with numerous floating-point operations, consider using the -fp_reorder option (see Section 5.8.9) if a small difference in the result is acceptable.

You can use different sizes of the same general data type in an expression with minimal or no effect on run-time performance. For example, using REAL, DOUBLE PRECISION, and COMPLEX floating-point numbers in the same floating-point arithmetic expression has minimal or no effect on run-time performance.

Contents

Index