Luiz DeRose
Advanced Computing Technology Center
IBM Research
laderose@us.ibm.com
Phone: +1-904-945-2828
Fax: +1-914-945-4269
Version 2.4.2 – July 8, 2002
LICENSE TERMS:
The Hardware Performance Monitor (HPM) Toolkit is distributed under a nontransferable, nonexclusive, and revocable license. The HPM software is provided "AS IS". IBM MAKES NO WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IBM has no obligation to defend or indemnify against any claim of infringement, including, but not limited to, patents, copyright, trade secret, or intellectual property rights of any kind. IBM is under no obligation to maintain, correct, or otherwise support this software. IBM does not represent that the HPM Toolkit will be made generally available. IBM does not represent that any software made generally available will be similar to or compatible with the HPM Toolkit.
1.
The HPM Toolkit
2. HPMCOUNT
3. LIBHPM
3.1. Functions
3.2.
Output
3.2.1
Overhead and Measurement Error Issues
3.3.
Examples of Use
3.3.1. C and C++
3.3.2. Fortran
3.3.3. Multi-threaded Program Instrumentation
Issues
3.3.4. Compiling and Linking
4. Summary of Environment Flags and Files
5.
HPMVIZ
6. Derived Metrics
7 Release history
The HPM Toolkit was developed for performance measurement of applications running on IBM systems (Power 3 and Power 4). It consists of:
On AIX 4.3.3, the HPM Toolkit requires the PMAPI kernel extensions to be
loaded.
For more information on the resource utilization statistics, please refer to the “getrusage” man pages.
Usage:
Sequential programs:
> hpmcount [-o <filename>] [-n] [-s <set>] [-g <group] [-e ev[,ev]*] <program>
Parallel programs (MPI):
>
poe hpmcount [-o <filename>] [-n] [-s <set>] [-g <group] [-e
ev[,ev]*] <program>
or:
>
hpmcount [-h][-c][-l]
where:
<program> is the program to be executed.
-h displays a help message.
-c list events from all counters.
-l list all
groups (POWER 4 only).
-o <filename> generates an output file:
<filename>.<pid>.
On parallel programs, this flags
creates one file for each process.
By default, the output goes to stdout.
-n for "no output to stdout".
This flag is only active when the -o flag is used.
-e ev0,ev1,... (POWER 3 only)
list of event numbers, separated by commas.
ev<i> corresponds to event selected for counter <i>.
-g <group> (POWER 4 only).
Valid groups are from 0 to 60. The description of groups is available in /usr/pmapi/lib/POWER4.gps. The default group is 60. Groups considered interesting for application performance analysis are:
· 60, for counts of cycles, instructions, and FP operations (including divides, FMA, loads, and stores).
· 59, for counts of cycles, instructions, TLB misses, loads, stores, and L1 misses
· 5, for counts of loads from L2, L3, and memory.
· 58, for counts of cycles, instructions, loads from L3, and loads from memory.
· 53, for counts of cycles, instructions, fixed-point operations, and FP operations (includes divides, SQRT, FMA, and FMOV or FEST).
-s predefined set of events.
On Power 4 systems, -s is the same as -g.
On Power 3 systems, the available sets are:
Event set 1 (def.) |
Event set 2 |
Event set 3 |
Event set 4 |
Cycles |
Cycles |
Cycles |
Cycles |
Inst. completed |
Inst. completed |
Loads dispatched |
Instr. dispatched |
TLB misses |
TLB misses |
L1 load misses |
Inst. completed |
Stores completed |
Stores dispatched |
L2 misses |
Cycles w/ 0 inst. completed |
Loads completed |
L1 store misses |
Stores dispatched |
I cache misses |
FPU0 ops |
Loads dispatched |
L2 store misses |
FXU0 ops |
FPU1 ops |
L1 load misses |
Number of write back |
FXU1 ops |
FMAs executed |
LSU idle |
LSU idle |
FXU2 ops |
Notice that parallel programs will generate output for each task. Thus, if the “-o” flag is not used, it is recommended that the environment variable: MP_LABELIO be set to YES, in order to correlate each line of the output with the corresponding task. Another option is to set the environment variable MP_STDOUTMODE to one of the task IDs (e.g., 0), to discard output from the other tasks. In this latter case, only the output from the selected task will appear in stdout.
Also notice that sequential programs when compiled with the “mp” prefix (e.g., mpxlf) are MPI programs and will need to be executed as “poe hpmcount … program”.
Libhpm supports multiple instrumentation sections, nested instrumentation, and each instrumented section can be called multiple times. When nested instrumentation is used, exclusive duration is generated for the outer sections. Average and standard deviation is provided when an instrumented section is activated multiple times.
Libhpm supports OpenMP and threaded applications. In this case, the thread safe version of the library (libhpm_r) should be used. Also, 64 bit applications can be linked with the 64 bit versions of the library (libhpm64 and libhpm64_r).
Notice that libhpm collects information and performs summarization during run-time. Thus, there could be a considerable overhead if instrumentation sections are inserted inside inner loops.
Libhpm uses the same set of hardware counters events used by hpmcount. The event set to be used can be selected via the environment variable: HPM_EVENT_SET.
On Power 4 systems, HPM_EVENT_SET should be set to a group from 0 to 60. The default is group 60. The description of groups is available in /usr/pmapi/lib/POWER4.gps. Groups considered interesting for application performance analysis are:
· 60, for counts of cycles, instructions, and FP operations (including divides, FMA, loads, and stores).
· 59, for counts of cycles, instructions, TLB misses, loads, stores, and L1 misses
· 5, for counts of loads from L2, L3, and memory.
· 58, for counts of cycles, instructions, loads from L3, and loads from memory.
· 53, for counts of cycles, instructions, fixed-point operations, and FP operations (includes divides, SQRT, FMA, and FMOV or FEST).
On Power 3 systems, HPM_EVENT_SET can be set to a value between 1 and 4. The default is 1. The four event sets on the Power3 are:
HPM_EVENT_SET 1 |
HPM_EVENT_SET 2 |
HPM_EVENT_SET 3 |
HPM_EVENT_SET 4 |
Cycles |
Cycles |
Cycles |
Cycles |
Inst. completed |
Inst. completed |
Loads dispatched |
Inst. dispatched |
TLB misses |
TLB misses |
L1 load misses |
Inst. completed |
Stores completed |
Stores dispatched |
L2 misses |
Cycles w/ 0 inst. completed |
Loads completed |
L1 store misses |
Stores dispatched |
I cache misses |
FPU0 ops |
Loads dispatched |
L2 store misses |
FXU0 ops |
FPU1 ops |
L1 load misses |
Number of write back |
FXU1 ops |
FMAs executed |
LSU idle |
LSU idle |
FXU2 ops |
The following instrumentation functions are provided:
hpmInit( taskID, progName
)
f_hpminit( taskID, progName )
hpmStart( instID,
label )
f_hpmstart( instID, label )
hpmStop(
instID )
f_hpmstop( instID )
hpmTstart( instID, label )
f_hpmtstart( instID, label )
hpmTstop( instID )
f_hpmtstop( instID )
hpmTerminate( taskID )
f_hpmterminate( taskID )
A summary report for each task will be written by default in the file: perfhpm<taskID>.<pid>. Additionally, a set of performance files named hpm<taskID>_<progName>_<pid>.viz will be generated to be used as input for hpmviz. The generation of the “.viz” file can be avoided with the environment flag: HPM_VIZ_OUTPUT = FALSE.
Users can define the output file name with the environment flag: HPM_OUTPUT_NAME. libhpm will still add the extensions: _<taskID>.hpm and _<taskID>.viz for the performance files and visualization files respectively. Using this environment flag, one can for example setup the output file to have date and time. For example, using ksh:
MYDATE=$(date +"%Y%m%d:%H%M%S")
export HPM_OUTPUT_NAME=myprogram_$MYDATE
In this example, the output file for task 27 will have the name: myprogram_yyyymmdd:HHMMSS_0027.hpm
3.2.1. Overhead and Measurement Error Issues
Any software instrumentation is
expected to incur in some overhead. Thus, since it is not possible to eliminate
the overhead, the goal was to minimize it. In the HPM Toolkit, most of the
overhead is due to time measurement, which unfortunately tends to be an
expensive operation in most systems. A second source of overhead is due to
run-time accumulation and storage of performance data. Notice, that libhpm
collects information and performs summarization during run-time. Hence, there
could be a considerable overhead if instrumentation sections are inserted
inside inner loops.
Several issues were considered in order to reduce measurement error. First, most of the library operations are executed before starting the counters, when returning the control to the program, or after stopping the counters, when the program calls a “stop” function. However, even at the library level, there are a few operations that must be executed within the counting process, as for example, releasing a lock. Second, since timing collection and capture of hardware counters information are two distinctive operations, the order of these operations had to be set. Basically, it had to be decided between timing the counters, or counting the timer. Since the cost of timing is about one order of magnitude more expensive than the cost of counting, the timer call precedes the PMAPI call to start the counters in the HPM “start” function, while the first two operations executed by the HPM “stop” function are stopping the counters, followed by calling the timer function. Thus, there is a small error in the time measurement, but there is minimal error in the counting process. Finally, to access and read the counters, the library calls lower level routines from the operating system. Hence, there are always some instructions executed by the kernel that are accounted as part of the program. So, in order to compensate for this measurement error, the HPM toolkit uses the hardware counters during the initialization and finalization of the library to estimate the cost of one call to the start and stop functions. This estimated overhead is subtracted from the values obtained on each instrumented code section. With this approach, the error of measurement becomes close to zero. However, since this is a statistical approximation, in some situations, this approach fails. In this case, the following message is printed on stderr: “WARNING: Measurement error for <event name> not removed”, which indicates that the estimated overhead was not subtracted from the measured values. One can deactivate the procedure that attempts to remove measurement errors by setting the environment variable: HPM_WITH_MEASUREMENT_ERROR to TRUE (1).
declaration:
#include “libhpm.h”
use:
hpmInit( tasked, “my program” );
hpmStart( 1, “outer call” );
do_work();
hpmStart( 2, “computing meaning of
life” );
do_more_work();
hpmStop( 2 );
hpmStop( 1 );
hpmTerminate( taskID );
The syntax for C and C++ is the same. However, the include files are different, since the libhpm routines must be declared as having extern "C" linkage in C++.
Fortran programs should call the functions with prefix “f_”. Also, notice that the following declaration is required on all source files that have instrumentation calls.
declaration:
#include “f_hpm.h”
use:
call f_hpminit( taskID, “my program”
)
call f_hpmstart( 1, “Do Loop”
)
do …
call do_work()
call f_hpmstart( 5, “computing
meaning of life” );
call do_more_work();
call f_hpmstop( 5 );
end do
call f_hpmstop( 1 )
call f_hpmterminate( taskID )
3.3.3. Multi-Threaded Program Instrumentation Issues
When placing instrumentation inside of parallel
regions, one should use different ID numbers for each thread, as shown in the
following Fortran example:
!$OMP PARALLEL
!$OMP&PRIVATE (instID)
instID = 30+omp_get_thread_num()
call f_hpmtstart( instID, "computing
meaning of life" )
!$OMP DO
do ...
do_work()
end do
call f_hpmtstop( instID )
!$OMP END PARALLEL
Notice that the functions hpmTstart and hpmTstop are required for threaded
programs. Also, the parameter instID should always be a variable or a number,
it cannot be an expression. This is due to the include file that contains a set
of "define" statements that are used during the pre-processing phase
that collects line numbers and file names. Finally, notice that the library
accepts the use of the same instID for different threads. However, the counters
will be accumulated for all instances with the same instID.
In order to use libhpm, one should add libpmapi.a, libhpm.a (or libhpm_r.a), and liblm to the link step:
#
HPM_DIR =
<<<ENTER HPM Home directory>>>
HPM_INC =
-I$(HPM_DIR)/include
HPM_LIB =
-L$(HPM_DIR)/lib -lhpm_r -lpmapi -lm
FFLAGS =
-qsuffix=cpp=f <<<Other Flags>>>
my.x : my.f
$(FF) $(HPM_INC) $(FFLAGS) my.f
$(HPM_LIB) -o my.x
The flag “-qsuffix=cpp=f” is only required for the compilation of Fortran programs with extension “.f”.
HPM_MEM_LATENCY 400
HPM_L3_LATENCY 102
HPM_L35_LATENCY 150
HPM_L2_LATENCY 12
HPM_L25_LATENCY 72
HPM_L275_LATENCY 108
HPM_TLB_LATENCY 700
HPM_EVENT_SET 5
On Power 3 systems, users can also specify an event set with the file: libHPMevents. This file takes precedence over the environment variable. Each line in the file specifies one event from the hardware counters. Only one event from each counter can be used (the Power3 (630) has 8 counters, the 604e has 4 counters). Each line should contain:
libHPMevents example:
3 1
PM_CYC# Cycles#
4 5
PM_FPU0_CMPL# FPU 0 instructions#
1 35
PM_FPU1_CMPL# FPU 1 instructions#
0 5
PM_IC_MISS# I cache misses#
2 5
PM_LD_MISS_L1# Load misses in L1#
7 0
PM_TLB_MISS# TLB misses#
5 5
PM_CBR_DISP# Branches#
6 3
PM_MPRED_BR# Misspredicted branches#
There are some consistence checks for this file,
but in general, it is expected the user to know enough information regarding
the hardware counters in order to create and use this file.
Usage:
> hpmviz [<performance files>]
Hpmviz takes as input the performance files (“.viz”) generated by libhpm. If the performance files are not provided in the command line, hpmviz will display a dialog box for user input. Users can select a single file by left clicking on a file name, or multiple files, by using the <Shift> or/and <Ctrl> keys. The <Shift> key allows the selection of a range of files (from the last one selected till the current selection), while the <Ctrl> key allows the selection of multiple files in any order.
The main window of the hpmviz graphical user interface is divided in two panes. The left pane displays for each instrumented section, identified by its label, the inclusive duration (i.e., the total wall clock time executing the corresponding code region), exclusive duration (i.e., the wall clock time of the instrumented code region, excluding the time from inner instrumented regions), and count. The instrumented sections are sorted by “Label”. Left clicking on any of the columns tab will sort the data in the corresponding column. The first click will sort in ascending order, while the second will sort in descending order.
Right clicking on an instrumentation section brings a “metrics” window displaying the node ID, Thread ID, count, exclusive duration, inclusive duration, and the derived hardware metrics. This window can be closed by typing “<Ctrl>W” or by clicking the “Close” button. There are also two menu options in the metrics window: Metrics Options, and Precision. The “Metrics Options” menu brings a metrics list that allows the user to select the metrics to be displayed. Clicking on the top of this list will make it into a “X Windows” dialog box. The “Precision” menu allows the user to indicate to hpmviz the precision used when running the program (double or single). Some values in the metrics displayed may be highlighted with red, indicating that the metric value is below a threshold value in a predefined range of average values for the metric. Similarly, a number in a light gray indicate that the metric value is above a threshold value in a predefined range of average values for the metric. Notice that some of the predefined range depends on the precision used in the program. The default precision assumed is "double", but the user can replace it to "single", with the menu option described above. Any of the columns in the metrics display can be sorted by clicking the corresponding tab. The first click will sort the values in ascending order, while the second will sort in descending order.
Left clicking on an instrumentation section in the main window brings the corresponding section of the source code in the right pane, highlighted. If the corresponding source file is not available in the directory where hpmviz is being executed, a dialog box will be displayed, so the user can select the source file. On the top of the source code pane, there are a set of tabs, one for each instrumented module. The user can select a module to be displayed by clicking on the corresponding tab.
The “File” menu options provided in the main window
allows one to open a new set of performance files, close the current data,
close all data, or quit hpmviz. The “open data”, “close data”, and “quit”
operations can also be selected with the keys <Ctrl>O, <Ctrl>C, and
<Ctrl>Q respectively. The “open” command will bring the dialog box for
the selection of the performance files.
In addition to presenting the raw counter data, the HPM toolkit also computes derive metrics, depending on the hardware events that are selected to be counted. The following derived metrics are supported:
User time = Cycles / Processor frequency
User time / Wall clock time
Instructions completed / Cycles
0.000001 * Instructions completed / Wall clock time
Instructions completed / Instructions cache misses
100 * Instructions completed / Instructions dispatched
100 * Zero instructions completed / Cycles
Total LS = Loads + Stores
100 * LSU idle / Cycles
Instructions completed / Total LS
Loads / Load misses in L1
Load misses in L1 / Load misses in L2
Stores / Store misses in L1
Loads / Master generated load op not retried
Loads / Master generated load op not retried
Loads / TLB misses
Total LS / TLB misses
Total LS / (Load misses in L1 + Store misses in L1)
Total LS / (Load misses in L2 + Store misses in L2)
100 * ( 1 - ( (Load misses in L1 + Store misses in L1) / Total LS )
100 * ( 1 - ( (Load misses in L2 + Store misses in L2) / ( Total L1 Misses)
Power3: 0.000001 * (L2 misses + Write backs) * Cache Line Size
Power4: 0.000001 * Data loaded from memory * 512
Memory traffic / Wall clock time
100 * Snoop hit occurred / Snoop requests
( FPU 0 + FPU 1 ) / Cycles
Power3: flip = FPU 0 instructions + FPU 1 instructions + FMAs executed
Power4: flip = FPU 0 instructions + FPU 1 instructions + FMAs executed – FPU Stores
Mflip/s = 0.000001 * flip / Wall clock time
wflip = flip + (HPM_DIV_WEIGHT – 1) * Divides
M Wflip/s = 0.000001 * wflip / Wall clock time
flip / Total LS
100 * FMAs executed * 2 / flip
FXU 0 instructions + FXU 1 Instructions + FXU 2 Instructions
100 * Branches Misspredicted / Branches
100 * TLB Misses / Cycle
User estimated TLB Miss latency * TLB Misses / Processor frequency
§ Percentage of loads from memory per cycle:
100 * Total loads from memory / Cycles
§ Estimated latency from loads from memory:
Memory latency * loads from memory / Processor frequency
§ Total loads from L3 (L3 loads):
Data loaded from L3 + Data loaded from L3.5
§ L3 traffic:
L3 loads * Cache Line Size * 0.000001
§ L3 bandwidth:
L3 traffic / wall clock time
§ L3 Load miss rate:
Total loads from memory / (total loads from L3 + total loads from memory)
§ Percentage of L3 loads per cycle:
100 * L3 loads / Cycle
§ Estimated latency from loads from L3:
(L3 latency * L3 data loads) + (L3.5 latency * L3.5 data loads) / Processor frequency
or
Average L3 latency * Total loads from L3 / Processor frequency
§ Total loads from L2 (L2 loads):
Sum (data loaded from (L2, L2.5(shared), L2.5(mod), L2.75(shared), and L2.75(mod))).
§ L2 traffic:
L2 loads * Cache Line Size * 0.000001
§ L2 bandwidth:
L2 traffic / wall clock time
§ L2 Load miss rate:
(loads from memory + L3 loads) / (L2 loads + L3 loads + loads from memory)
§ Percentage of L2 loads per cycle:
100 * L2 loads / Cycle
§ Estimated latency from loads from L2:
(L2 lat. * L2 loads) + (L2.5 lat. * L2.5 loads) + (L2.75 lat. * L2.75 loads) / Processor frequency
or
Average L2 latency * Total loads from L2 / Processor frequency
Version 2.4.2 (07/08/2002)
Version 2.4.1 (04/28/2002)
Version 2.3.1 (11/01/2001)
Version 2.2.3: (02/09/2001)
Version 2.2.1:
Version 2.1:
Version 1.1: