Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems


Previous Contents Index

5.5 Improving Overall I/O Performance

Improving overall I/O performance can minimize both device I/O and actual CPU time. The techniques listed in this section can greatly improve performance in many applications.

A bottleneck limits the maximum speed of execution by being the slowest process in an executing program. In some programs, I/O is the bottleneck that prevents an improvement in run-time performance. The key to relieving I/O bottlenecks is to reduce the actual amount of CPU and I/O device time involved in I/O. Bottlenecks may be caused by one or more of the following:

Improved coding practices can minimize actual device I/O, as well as the actual CPU time.

Compaq offers software solutions to system-wide problems like minimizing device I/O delays (see Section 5.1.1).

5.5.1 Use Unformatted Files Instead of Formatted Files

Use unformatted files whenever possible. Unformatted I/O of numeric data is more efficient and more precise than formatted I/O. Native unformatted data does not need to be modified when transferred and will take up less space on an external file.

Conversely, when writing data to formatted files, formatted data must be converted to character strings for output, less data can transfer in a single operation, and formatted data may lose precision if read back into binary form.

To write the array A(25,25) in the following statements, S1 is more efficient than S2:


S1         WRITE (7) A 
 
S2         WRITE (7,100) A 
     100   FORMAT (25(' ',25F5.21)) 

Although formatted data files are more easily ported to other systems, Compaq Fortran can convert unformatted data in several formats (see Chapter 10).

5.5.2 Write Whole Arrays or Strings

The general guidelines about array use discussed in Section 5.4 also apply to reading or writing an array with an I/O statement.

To eliminate unnecessary overhead, write whole arrays or strings at one time rather than individual elements at multiple times. Each item in an I/O list generates its own calling sequence. This processing overhead becomes most significant in implied-DO loops. When accessing whole arrays, use the array name (Fortran 95/90 array syntax) instead of using implied-DO loops.

5.5.3 Write Array Data in the Natural Storage Order

Use the natural ascending storage order whenever possible. This is column-major order, with the leftmost subscript varying fastest and striding by 1 (see Section 5.4). If a program must read or write data in any other order, efficient block moves are inhibited.

If the whole array is not being written, natural storage order is the best order possible.

If you must use an unnatural storage order, in certain cases it might be more efficient to transfer the data to memory and reorder the data before performing the I/O operation.

5.5.4 Use Memory for Intermediate Results

Performance can improve by storing intermediate results in memory rather than storing them in a file on a peripheral device. One situation that may not benefit from using intermediate storage is when there is a disproportionately large amount of data in relation to physical memory on your system. Excessive page faults can dramatically impede virtual memory performance.

If you are primarily concerned with the CPU performance of the system, consider using a memory file system (mfs) virtual disk to hold any files your code reads or writes (see mfs(1)).

5.5.5 Enable Implied-DO Loop Collapsing

DO loop collapsing reduces a major overhead in I/O processing. Normally, each element in an I/O list generates a separate call to the Compaq Fortran RTL. The processing overhead of these calls can be most significant in implied-DO loops.

Compaq Fortran reduces the number of calls in implied-DO loops by replacing up to seven nested implied-DO loops with a single call to an optimized run-time library I/O routine. The routine can transmit many I/O elements at once.

Loop collapsing can occur in formatted and unformatted I/O, but only if certain conditions are met:

For More Information:

5.5.6 Use of Variable Format Expressions

Variable format expressions (a Compaq Fortran extension) are almost as flexible as run-time formatting, but they are more efficient because the compiler can eliminate run-time parsing of the I/O format. Only a small amount of processing and the actual data transfer are required during run time.

On the other hand, run-time formatting can impair performance significantly. For example, in the following statements, S1 is more efficient than S2 because the formatting is done once at compile time, not at run time:


 S1        WRITE (6,400) (A(I), I=1,N) 
      400   FORMAT (1X, <N> F5.2) 
                         .
                         .
                         .
 S2        WRITE (CHFMT,500) '(1X,',N,'F5.2)' 
     500   FORMAT (A,I3,A) 
           WRITE (6,FMT=CHFMT) (A(I), I=1,N) 

5.5.7 Efficient Use of Record Buffers and Disk I/O

Records being read or written are transferred between the user's program buffers and one or more disk block I/O buffers, which are established when the file is opened by the Compaq Fortran RTL. Unless very large records are being read or written, multiple logical records can reside in the disk block I/O buffer when it is written to disk or read from disk, minimizing physical disk I/O.

You can specify the size of the disk block I/O buffer by using the OPEN statement BLOCKSIZE specifier; the default size can be obtained from fstat(2). If you omit the BLOCKSIZE specifier in the OPEN statement, it is set for optimal I/O use with the type of device the file resides on.

The default for BUFFERCOUNT is 1. Any experiments to improve I/O performance should increase the BUFFERCOUNT value and not the BLOCKSIZE value, to increase the amount of data read by each disk I/O.

When writing records, be aware that I/O records are written to unified buffer cache (UBC) system buffers. To request that I/O records be written from program buffers to the UBC system buffers, use the flush library routine (see flush(3f) and Chapter 12). Be aware that calling flush also discards read-ahead data in user buffer.

To request that UBC system buffers be written to disk, use the fsync library routine (see fsync(3f) and Chapter 12).

When UBC buffers are written to disk depends on UBC characteristics on the system, such as the vm-ubcbuffers attribute (see the Compaq Tru64 UNIX System Tuning and Performance guide).

The BUFFERED= keyword is part of the OPEN and INQUIRE statements. The default value of BUFFERED='NO' for all input/output causes the runtime library to empty its internal buffer for each WRITE statement. If BUFFERED='YES' and the device is a disk opened for output, then the runtime library fills its internal buffer, possibly requiring many WRITE statements, before emptying it.

If the OPEN statement has BUFFERCOUNT and BLOCKSIZE arguments, then the internal buffer size in bytes is the product of these arguments. If the OPEN statement does not have these arguments, then the default internal buffer size is 8192 bytes. This internal buffer will grow to hold the largest single record, but it will never shrink.

5.5.8 Specify RECL

The sum of the record length (RECL specifier in an OPEN statement) and its overhead is a multiple or divisor of the blocksize, which is device specific. For example, if the BLOCKSIZE is 8192 then RECL might be 24576 (a multiple of 3) or 1024 (a divisor of 8).

The RECL value should fill blocks as close to capacity as possible (but not over capacity). Such values allow efficient moves, with each operation moving as much data as possible; the least amount of space in the block is wasted. Avoid using values larger than the block capacity, because they create very inefficient moves for the excess data only slightly filling a block (allocating extra memory for the buffer and writing partial blocks are inefficient).

The RECL value unit for formatted files is always 1-byte units. For unformatted files, the RECL unit is 4-byte units, unless you specify the -assume byterecl option to request 1-byte units (see Section 3.6).

When porting unformatted data files from non-Compaq systems, see Section 10.4.6.

5.5.9 Use the Optimal Record Type

Unless a certain record type is needed for portability reasons (see Section 7.4.3), choose the most efficient type, as follows:

For More Information:

5.6 Additional Source Code Guidelines for Run-Time Efficiency

Other source coding guidelines can be implemented to improve run-time performance.

The amount of improvement in run-time performance is related to the number of times a statement is executed. For example, improving an arithmetic expression executed within a loop many times has the potential to improve performance, more than improving a similar expression executed once outside a loop.

5.6.1 Avoid Small Integer and Small Logical Data Items

Avoid using integer or logical data less than 32 bits, because the smallest unit of efficient access on Alpha systems is 32 bits.

Accessing a 16-bit (or 8-bit) data type can result in a sequence of machine instructions to access the data, rather than a single, efficient machine instruction for a 32-bit data item.

To minimize data storage and memory cache misses with arrays, use 32-bit data rather than 64-bit data, unless you require the greater numeric range of 8-byte integers or the greater range and precision of double precision floating-point numbers.

5.6.2 Avoid Mixed Data Type Arithmetic Expressions

Avoid mixing integer and floating-point (REAL) data in the same computation. Expressing all numbers in a floating-point arithmetic expression (assignment statement) as floating-point values eliminates the need to convert data between fixed and floating-point formats. Expressing all numbers in an integer arithmetic expression as integer values also achieves this. This improves run-time performance.

For example, assuming that I and J are both INTEGER variables, expressing a constant number (2.) as an integer value (2) eliminates the need to convert the data:
Original Code: INTEGER I, J
I = J / 2.
Efficient Code: INTEGER I, J
I = J / 2

For applications with numerous floating-point operations, consider using the -fp_reorder option (see Section 5.8.9) if a small difference in the result is acceptable.

You can use different sizes of the same general data type in an expression with minimal or no effect on run-time performance. For example, using REAL, DOUBLE PRECISION, and COMPLEX floating-point numbers in the same floating-point arithmetic expression has minimal or no effect on run-time performance.

5.6.3 Use Efficient Data Types

In cases where more than one data type can be used for a variable, consider selecting the data types based on the following hierarchy, listed from most to least efficient:

However, keep in mind that in an arithmetic expression, you should avoid mixing integer and floating-point (REAL) data (see Section 5.6.2).

5.6.4 Avoid Using Slow Arithmetic Operators

Before you modify source code to avoid slow arithmetic operators, be aware that optimizations convert many slow arithmetic operators to faster arithmetic operators. For example, the compiler optimizes the expression H=J**2 to be H=J*J.

Consider also whether replacing a slow arithmetic operator with a faster arithmetic operator will change the accuracy of the results or impact the maintainability (readability) of the source code.

Replacing slow arithmetic operators with faster ones should be reserved for critical code areas. The following hierarchy lists the Compaq Fortran arithmetic operators, from fastest to slowest:

5.6.5 Avoid EQUIVALENCE Statement Use

Avoid using EQUIVALENCE statements. EQUIVALENCE statements can:

5.6.6 Use Statement Functions and Internal Subprograms

Whenever the Compaq Fortran compiler has access to the use and definition of a subprogram during compilation, it may choose to inline the subprogram. Using statement functions and internal subprograms maximizes the number of subprogram references that will be inlined, especially when multiple source files are compiled together at optimization level -O4 or higher.

For more information, see Section 5.1.2.

5.6.7 Code DO Loops for Efficiency

Minimize the arithmetic operations and other operations in a DO loop whenever possible. Moving unnecessary operations outside the loop will improve performance (for example, when the intermediate nonvarying values within the loop are not needed).

For More Information:

5.7 Optimization Levels: the -On Option

Compaq Fortran performs many optimizations by default. You do not have to recode your program to use them. However, understanding how optimizations work helps you remove any inhibitors to their successful function.

Generally, Compaq Fortran increases compile time in favor of decreasing run time. If an operation can be performed, eliminated, or simplified at compile time, Compaq Fortran does so, rather than have it done at run time. The time required to compile the program usually increases as more optimizations occur.

The program will likely execute faster when compiled at -O4 , but will require more compilation time than if you compile the program at a lower level of optimization.

The size of object file varies with the optimizations requested. Factors that can increase object file size include an increase of loop unrolling or procedure inlining.

Table 5-4 lists the levels of Compaq Fortran optimization with different -O options. For example: -O0 specifies no selectable optimizations (some optimizations always occur); -O5 specifies all levels of optimizations, including loop transformation and software pipelining.

Table 5-4 Levels of Optimization with Different -O n Options
  Option
Optimization Type --O0 --O1 --O2 --O3 --O4 --O5
Loop transformation and software pipelining           X
Automatic inlining         X X
Additional global optimizations       X X X
Global optimizations     X X X X
Local (minimal) optimizations   X X X X X

The default is -O4 (same as -O ). However, if -g2 , -g , or -gen_feedback is also specified, the default is -O0 (no optimizations).

In Table 5-4, the following terms are used to describe the levels of optimization (described in detail in Section 5.7.1 to Section 5.7.6):

5.7.1 Optimizations Performed at All Optimization Levels

The following optimizations occur at any optimization level ( -O0 through -O5 ):

5.7.2 Local (Minimal) Optimizations

To enable local optimizations, use -O1 or a higher optimization level ( -O2 , -O3 , -O4 , or -O5 ).

To prevent local optimizations, specify the -O0 option.

5.7.2.1 Common Subexpression Elimination

If the same subexpressions appear in more than one computation and the values do not change between computations, Compaq Fortran computes the result once and replaces the subexpressions with the result itself:


DIMENSION A(25,25), B(25,25) 
A(I,J) = B(I,J) 

Without optimization, these statements can be compiled as follows:


t1 = ((J-1)*25+(I-1))*4 
t2 = ((J-1)*25+(I-1))*4 
A(t1) = B(t2) 

Variables t1 and t2 represent equivalent expressions. Compaq Fortran eliminates this redundancy by producing the following:


t = ((J-1)*25+(I-1)*4 
A(t) = B(t) 

5.7.2.2 Integer Multiplication and Division Expansion

Expansion of multiplication and division refers to bit shifts that allow faster multiplication and division while producing the same result. For example, the integer expression (I*17) can be calculated as I with a 4-bit shift plus the original value of I. This can be expressed using the Compaq Fortran ISHFT intrinsic function:


J1 = I*17 
J2 = ISHFT(I,4) + I     ! equivalent expression for I*17 

The optimizer uses machine code that, like the ISHFT intrinsic function, shifts bits to expand multiplication and division by literals.


Previous Next Contents Index