Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems

5.1.3 Process Shell Environment and Related Influences on Performance

Certain shell commands and system tuning can improve run-time performance:

Specify adequate process limits and do system tuning.
Especially when compiling or running large programs, check to make sure that process limits are adequate.
With the C shell ( csh ), use the limits command to display the limits of your process and increase specified limits. For more information, see csh(1).
With the Bourne, Korn, and bash (L*X ONLY) shells, use the ulimit command to display the limits of your process and increase specified limits. For more information, see sh(1) (Bourne shell), ksh(1) (Korn shell), or bash(1) (bash shell) (L*X ONLY).
Your system manager can tune the system for efficient use. For example, to monitor system use during program execution or compilation, a system manager can use vmstat .
For more information on system tuning, see your operating system documentation.
Redirect scrolled text.
For programs that display a lot of text, consider redirecting text that is usually displayed on stdout to a file. Displaying a lot of text will slow down execution; scrolling text in a terminal window on a workstation can cause an I/O bottleneck (increased elapsed time) and use some CPU time.
The following commands show how to run the program more efficiently by redirecting output to a file and then displaying the program output:
# myprog > results.lis # more results.lis
When compiling a program that contains a substantial amount of C language code, be aware that you can specify most cc options on the f90 command line, including several that can improve performance. You can also compile C code using the cc -c option, and then use the f90 command to compile and link the Compaq Fortran source files with the C language object files.
Recall from Chapter 2 and Chapter 3 that the f90 and cc commands invoke the Compaq Fortran compiler and Compaq C compiler, respectively, on Tru64 UNIX Alpha systems. The corresponding commands on Linux Alpha systems are fort and ccc .

On system tuning and cc options related to performance, see your operating system documentation and the appropriate reference pages.

5.2 Analyzing Program Performance

This section describes how you can:

Analyze program performance using shell commands like time ( Section 5.2.1)
Analyze program performance using profiling tools prof , gprof , and pixie (TU*X ONLY) ( Section 5.2.2)
Use feedback files and optionally cord to provide feedback for a subsequent compilation ( Section 5.2.3)

Before you analyze program performance, make sure any errors you might have encountered during the early stages of program development have been corrected.

For information about parallel profiling techniques and the pprof profiler on Tru64 UNIX systems, see the Compaq Parallel Software Environment documentation.

5.2.1 Use the time Command to Measure Performance

Use the time command to provide information about program performance.

Run program timings when other users are not active. Your timing results can be affected by one or more CPU-intensive processes also running while doing your timings.

Try to run the program under the same conditions each time to provide the most accurate results, especially when comparing execution times of a previous version of the same program. Use the same CPU system (model, amount of memory, version of the operating system, and so on) if possible.

If you do need to change systems, you should measure the time using the same version of the program on both systems, so you know each system's effect on your timings.

For programs that run for less than a few seconds, run several timings to ensure that the results are not misleading. Overhead functions like loading shared libraries might influence short timings considerably.

Using the form of the time command that specifies the name of the executable program provides the following:

The elapsed, real, or "wall clock" time, which will be greater than the total charged actual CPU time.
Charged actual CPU time, shown for both system and user execution. The total actual CPU time is the sum of the actual user CPU time and actual system CPU time.

In the following example timings, the sample program being timed displays the following line:

Average of all the numbers is: 4368488960.000000

Using the Bourne shell, the following program timing reports that the program uses 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use) and 2.46 seconds of elapsed time:

$ time a.out Average of all the numbers is: 4368488960.000000 real 0m2.46s user 0m0.61s sys 0m0.58s

Using the C shell, the following program timing reports 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use), about 4 seconds (0:04) of elapsed time, the use of 28% of available CPU time, and other information:

% time a.out Average of all the numbers is: 4368488960.000000 0.61u 0.58s 0:04 28% 78+424k 9+5io 0pf+0w

Using the bash shell (L*X ONLY), the following program timing reports that the program uses 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use) and 2.46 seconds of elapsed time:

[user@system user]$ time ./a.out Average of all the numbers is: 4368488960.000000 elapsed 0m2.46s user 0m0.61s sys 0m0.58s

Timings that show a large amount of system time may indicate a lot of time spent doing I/O, which might be worth investigating.

If your program displays a lot of text, you can redirect the output from the program on the time command line (see Section 5.1.3). Redirecting output from the program will change the times reported because of reduced screen I/O.

For more information, see time(1).

In addition to the time command, you might consider modifying the program to call routines within the program to measure execution time. For example:

Compaq Fortran intrinsic procedures, such as SYSTEM_CLOCK, DATE_AND_TIME, and TIME (see the Compaq Fortran Language Reference Manual)
Library routines, such as etime or time (see Section 12.4 or intro(3f)).

5.2.2 Use Profiling Tools

To generate profiling information, use the f90 compiler and the prof , gprof , and pixie (TU*X ONLY) tools.

(TU*X ONLY) If you have installed the Parallel Software Environment (PSE) and need to profile a parallel HPF program, you can use the pprof profiler. For information about parallel profiling techniques and pprof, see the Compaq Parallel Software Environment documentation. The remainder of this section discusses nonparallel profiling.

Profiling identifies areas of code where significant program execution time is spent. Along with the f90 command, use the prof and pixie (TU*X ONLY) tools to generate the following profile information:

The CPU time spent in the different routines of the program, or program counter sampling. This type of profiling uses prof .
The manner in which routines are called by other routines, or call graph information. This type of profiling uses gprof .
The execution of basic blocks, called basic block counting. A basic block is a sequence of instructions entered only at the beginning and exited only at the end (no branches). This provides statistics on individual lines of code and is influenced by such optimizations as loop unrolling. This type of profiling uses prof and pixie (TU*X ONLY).
The estimated number of CPU cycles spent for each source line in one or more procedures, or source line CPU cycle use. This type of profiling uses prof and pixie (TU*X ONLY).

Once you have determined those sections of code where most of the program execution time is spent, examine these sections for coding efficiency. Suggested guidelines for improving source code efficiency are provided in Section 5.6.

You can also use the profiler facility provided by the optional DEC FUSE product, which provides an integrated development environment and windowing interface to many Compaq Tru64 UNIX program development facilities (see the DEC Fuse Handbook).

5.2.2.1 Program Counter Sampling (prof)

To obtain program counter sampling data, perform the following steps:

Use the f90 command option -p to compile and link the program:
% f90 -p -O3 -o profsample profsample.f90
If you specify the -c option to prevent linking, you must specify the -p option when you link the program:
% f90 -c -O3 profsample.f90 % f90 -p -O3 -o profsample profsample.o
Consider specifying optimization level -O3 or -inline manual to minimize the inlining of procedures. Once inlined, procedures are not listed as separate routines but as part of the routine into which they have been inlined. Allowing full inlining would result in program counter sampling for a small number of (usually) large routines. This might not help you locate areas of the program where significant program execution time is spent.
Execute the profiled program:
% profsample
During program execution, profiling data is written to a profile data file, whose default name is mon.out . You can execute the program multiple times to generate multiple profile data files, which can be averaged. Use the PROFDIR environment variable to request a different profile data file name.
Run the prof command, which formats the profiling data and displays it in a readable format:
% prof profsample mon.out

You can limit the report created by prof by using prof command options, such as -only , -exclude , or -quit .

For example, if you only want reports on procedures calc_max and calc_min, you could use the following command line to read the profile data file named mon.out :

% prof -only calc_max -only calc_min profsample

The time spent in particular areas of code is reported by prof in the form of a percentage of the total CPU time spent by the program. To reduce the size of the report, you can either:

Request that only certain procedures be included (by using the -only option).
Exclude certain procedures (by using the -exclude option).

When you use the -only or -exclude options, the percentages are still based on all procedures of the application. To obtain percentages calculated by prof that are based on only those procedures included in the report, use the -Only and -Exclude options (use an uppercase initial letter in the option name).

You can use the -quit option to reduce the amount of information reported. For example, the following command prints information on only the five most time-consuming procedures:

% prof -quit 5 profsample

The following command limits information only to those procedures using 10% or more of the total execution time:

% prof -quit 10% profsample

For more information on prof , see prof(1) and the Compaq Tru64 UNIX Programmer's Guide.

5.2.2.2 Call Graph Sampling (gprof)

To obtain call graph information, use the gprof tool. Perform the following steps:

Use the f90 command option -pg when you compile and link the program:
% f90 -pg -O3 -o profsample profsample.for
If you specify the -c option to prevent linking, you must then specify the -pg option both when you compile and link the program:
% f90 -pg -c -O3 profsample.f90 % f90 -pg -O3 -o profsample profsample.f90
Execute the profiled program:
% profsample
During execution, profiling data is saved to the file gmon.out , unless the environment variable PROFDIR is set.
Run the formatting program gprof :
% gprof profsample gmon.out

The output produced by gprof includes:

Call graph profile
Timing profile (similar to that produced by prof )
Index

For more information on using gprof and its output, see the Compaq Tru64 UNIX Programmer's Guide.

5.2.2.3 Basic Block Counting (pixie and prof)

To obtain basic block counting information, perform the following steps:

Compile and link the program without the -p option:
% f90 -O3 -o profsample profsample.f90
Consider specifying optimization level -O3 or -inline manual to minimize the inlining of procedures (once inlined, procedures are not listed as separate routines but as part of the routine into which they are inlined).
Run the profiling command pixie : (TU*X ONLY)
% atom -tools pixie profsample
The pixie command creates: (TU*X ONLY)
- A program named profsample.pixie that is equivalent to profsample but contains additional code for counting the execution of each basic block.
- A file named profsample.Addrs , which contains the address of each basic block.
Execute the profiled program profsample.pixie generated by pixie :
% profsample.pixie
This program creates the file profsample.Counts , which contains the basic block counts.
Run prof with the -pixie option, to extract and display information from the profsample.Addrs and profsample.Counts files:
% prof -pixie profsample
When you specify the -pixie option (TU*X ONLY), the prof command searches for files with a suffix of .Addrs and .Counts (in this case profsample.Addrs and profsample.Counts ).
You can reduce the amount of information in the report created by prof by using the -only , -exclude , -quit , and related options.

To create multiple profile data files, run the program multiple times.

For more information on prof , gprof , and pixie (TU*X ONLY), see prof(1), gprof(1), pixie(1), and the Compaq Tru64 UNIX Programmer's Guide.

5.2.2.4 Source Line CPU Cycle Use (prof and pixie)

You use the same files created by the pixie command (see Section 5.2.2.3) for basic block counting to estimate the number of CPU cycles used to execute each source file line.

To view a report of the number of CPU cycles estimated for each source file line, use the following options with the prof command:

The -pixie (TU*X ONLY) option is required to obtain source line information.
The -heavy option prints an entry for each source code line, including the number of CPU cycles used by that line. Entries are sorted in descending order of CPU cycles and should be limited by using the prof command options that limit the report size, such as -quit , -only , or -exclude .
The -lines option requests source line information, but in the order in which the lines occur in the program (not sorted in descending order of CPU cycles).

Depending on the level of optimization chosen, certain source lines might be optimized away.

The CPU cycle use estimates are based primarily on the instruction type and its operands and do not include memory effects such as cache misses or translation buffer fills.

For example, the following command sequence uses:

The f90 and pixie (TU*X ONLY) commands to create the necessary files.
The prof command to request source line CPU cycle use information for the procedure named calc_max ( -only option), sorted in descending order of CPU cycles ( -heavy option):

% f90 -o profsample profsample.f90 % atom -tools pixie profsample % profsample.pixie % prof -pixie -heavy -only calc_max profsample

5.2.3 Creating and Using Feedback Files and Optionally cord

You can create a feedback file by using a series of commands. Once created, you can specify a feedback file in a subsequent compilation with the f90 command option -feedback . You can also request that cord use the feedback file to rearrange procedures, by specifying the -cord option on the f90 command line.

To create the feedback file, complete these steps:

Compile and link the program. Omit the -p option, but specify the -gen_feedback option:
% f90 -o profsample -gen_feedback profsample.f90
The -gen_feedback option changes the default optimization level to -O0 .
To include libraries in the profiling output, specify -non_shared .
Execute the profiling command pixie (TU*X ONLY):
% pixie profsample
The pixie command creates:
- A program named profsample.pixie that is equivalent to profsample but contains additional code for counting the execution of each basic block.
- A file named profsample.Addrs , which contains the address of each basic block.
Execute the profiled program profsample.pixie generated by pixie :
% profsample.pixie
This program creates the file profsample.Counts , which contains the basic block counts.
Run prof with the -pixie and -feedback options:
% prof -pixie -feedback profsample.feedback profsample
This prof command creates the feedback file profsample.feedback .

You can use the feedback file as input to the f90 compiler:

% f90 -feedback profsample.feedback -o profsample profsample.f90

The feedback file provides the compiler with actual execution information, which the compiler can use to improve such optimizations as inlining function calls.

Specify the desired optimization level ( -On option) for the f90 command with the -feedback name option (in this example the default is -O4 ).

You can use the feedback file as input to the f90 compiler and cord , as follows:

% f90 -cord -feedback profsample.feedback -o profsample profsample.f90

The -cord option invokes cord , which reorders the procedures in an executable program to improve program execution, using the information in the specified feedback file. Specify the desired optimization level ( -On option) for the f90 command with the -feedback name option (in this example -O4 ).

5.2.4 Atom Toolkit

(TU*X ONLY) The Atom toolkit includes a programmable instrumentation tool and several prepackaged tools. The prepackaged tools include:

hiprof
Produces a flat profile of an application that shows the execution time spent in a given procedure, and a hierarchical profile that shows the execution time spent in a given procedure and all of its descendents.
pixie
Produces a profile of an application, by procedure, source line, or instruction. It partitions the application into basic blocks and counts the number of times each basic block is executed.
third
Performs memory access checks and detects memory leaks in an application.

To invoke atom tools, use the following general command syntax:

% atom -tool tool-name ...)

For more information, see the Compaq Tru64 UNIX Programmers Guide, atom(1), hiprof(5), pixie(5), and third(5).

Atom does not work on programs built with the -om option.

5.3 Data Alignment Considerations

For optimal performance on Alpha systems, make sure your data is aligned naturally.

A natural boundary is a memory address that is a multiple of the data item's size (data type sizes are described in Table 9-1). For example, a REAL (KIND=8) data item aligned on natural boundaries has an address that is a multiple of 8. An array is aligned on natural boundaries if all of its elements are.

All data items whose starting address is on a natural boundary are naturally aligned. Data not aligned on a natural boundary is called unaligned data.

Although the Compaq Fortran compiler naturally aligns individual data items when it can, certain Compaq Fortran statements (such as EQUIVALENCE) can cause data items to become unaligned (see Section 5.3.1).

Although you can use the f90 command -align keyword options to ensure naturally aligned data, you should check and consider reordering data declarations of data items within common blocks and structures. Within each common block, derived type, or record structure, carefully specify the order and sizes of data declarations to ensure naturally aligned data. Start with the largest size numeric items first, followed by smaller size numeric items, and then nonnumeric (character) data.

Contents

Index

Compaq FortranUser Manual for Tru64 UNIX and Linux Alpha Systems

5.1.3 Process Shell Environment and Related Influences on Performance

5.2.2 Use Profiling Tools

Compaq Fortran
User Manual for
Tru64 UNIX and Linux Alpha Systems