Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
This example introduces
rabbit command-line options --compare m,n --invert i,j
an experiment to estimate the Pentium Pro/II/III micro-operation processing rate
Download this example: uops, uops.awk, foo.c
Return to Main Menu
Command-line options
--compare m,n
--invert i,j
Pentium Pro/II/III and Athlon only
m, n = 8-bit unsigned integer, default 0; "counter mask".
i, j = 1-bit flag, default 0; "invert flag".
Notes
In the default case (m = 0), the event counters are incremented by the
number of events in each cycle. These options set a threshold value, and
increment the counters by 1 if the threshold is exceeded, or not reached.
Formally, event counter 0 is incremented by this method:
e = events that occurred in this cycle;
if (m == 0)
{ counter_0 += e; }
else
{ if (i == 0)
{ if (e >= m) { counter_0 += 1; } }
else
{ if (e < m) { counter_0 += 1; } }
}
and similarly for counter 1 using n and j. On the Athlon, four counters
are available.
As a shorthand when using a non-zero counter mask, the symbols 'lt',
'le', 'gt', 'ge' can be used according to these examples:
--compare lt2 equivalent to --compare 2 --invert 1
--compare le2 equivalent to --compare 3 --invert 1
--compare ge2 equivalent to --compare 2 --invert 0
--compare gt2 equivalent to --compare 3 --invert 0
Example - Pentium Pro micro-operations
The processor decomposes x86 instructions into micro-operations; the
instruction decoder can produce up to 6 uops per cycle, the execution units
can consume up to 5 uops/cycle, and the retirement unit can retire (accept
as completed) up to 3 uops/cycle. The number of uops/instruction is
typically 1 for integer instructions, 1-4 for floating-point instructions,
or more as obtained from the microcode instruction sequencer. Load and
store are 1 and 2 uops, respectively.
First, select the events to investigate uops retired per cycle.
h% rabbit --e 121,194 -d
0x79 121 cycles processor is not halted
0xc2 194 instruction decode and retire unit, micro-operations retired
Since only two events are used, and we are only looking for an average, we
do not need to sample frequently. The test uses a 200 MHz Pentium Pro, and
a test program 'foo'. 'foo' does integer calculations, while 'foo f' and
'foo d' do single and double precision floating-point calculations.
h% rabbit -s 1 --e 121,194 foo
...
Event Events Events/sec
---------------------------------- ---------------- ----------------
0x79 121 cpu_clk_unhalted 30066740927 200000023.29
0xc2 194 uops_retired 50040030583 332859730.50
The average number of uops retired/cycle for this program (foo) is thus
50040030583 / 30066740927 = 1.664. Now we ask, how often do we retire 0, 1,
2 or 3 uops/cycle?
h% rabbit -s 1 --e 194 --compare le0,gt0 foo
...
Event Events Events/sec
---------------------------------- ---------------- ----------------
0xc2 194 uops_retired 4381857244 29149528.45
0xc2 194 uops_retired 25682833187 170850494.44
h% rabbit -s 1 --e 194 --compare le1,gt1 foo
...
Event Events Events/sec
---------------------------------- ---------------- ----------------
0xc2 194 uops_retired 14258801992 94851503.51
0xc2 194 uops_retired 15806728109 105148520.02
First, check the data:
29149528.45 + 170850494.44 = 200000022.89, and
94851503.51 + 105148520.02 = 200000023.53,
as expected. So, about 47.4% of the cycles (94851503.51 / 200000023.53)
retire at most one micro-operation. This percentage is subject to some
variation, as it measures all system activity during the course of the
program foo, and the program itself has some small variations.
h% rabbit -s 1 --e 194 --compare le2,gt2 foo
...
Event Events Events/sec
---------------------------------- ---------------- ----------------
0xc2 194 uops_retired 21503833770 143051213.05
0xc2 194 uops_retired 8560694505 56948809.53
h% rabbit -s 1 --e 194 --compare le3,gt3 foo
...
Event Events Events/sec
---------------------------------- ---------------- ----------------
0xc2 194 uops_retired 30067703726 200000022.54
0xc2 194 uops_retired 0 0.00
h% rabbit -s 1 --e 194 --compare le4,gt4 foo
...
Event Events Events/sec
---------------------------------- ---------------- ----------------
0xc2 194 uops_retired 30062615214 200000022.91
0xc2 194 uops_retired 0 0.00
From the table
le 0 uops retired / sec: 29149528.45
le 1 uops retired / sec: 94851503.51
le 2 uops retired / sec: 143051213.05
le 3 uops retired / sec: 200000022.54
we derive
0 uops retired / sec: 29149528.45 = 14.57% of cycles
1 uops retired / sec: 65701975.06 = 32.85% of cycles
2 uops retired / sec: 48199709.54 = 24.10% of cycles
3 uops retired / sec: 56948809.49 = 28.47% of cycles
which agrees with the previous average of 1.664 uops retired / cycle.
The following shell and awk scripts (uops and uops.awk) help to automate this
calculation.
#!/bin/sh
prog="$*"
trials="0 1 2 3"
if [ "X$prog" = "X" -o "$prog" = "-h" ]
then
echo "Testing Intel Pentium Pro/II/III micro-operations"
echo "usage: uops prog [args]"
echo " intermediate output to uops.user, uops.system, uops.both"
echo " summary to stdout and uops.report"
exit 1
fi
$prog 1> /dev/null
echo "Testing Intel Pentium Pro/II/III micro-operations" > uops.report
echo -e "$prog\n\nuser and system modes combined" >> uops.report
rabbit -s 1 --e 121,194 $prog 2> uops.both 1> /dev/null
for i in $trials
do
rabbit -s 1 --e 194 --compare le$i,gt$i $prog 2>> uops.both 1> /dev/null
done
awk -f uops.awk uops.both >> uops.report
echo -e "\nuser mode only" >> uops.report
rabbit -s 1 --u 1 --o 0 --e 121,194 $prog 2> uops.user 1> /dev/null
for i in $trials
do
rabbit -s 1 --u 1 --o 0 --e 194 --compare le$i,gt$i $prog 2>> uops.user 1> /dev/null
done
awk -f uops.awk uops.user >> uops.report
echo -e "\nsystem mode only" >> uops.report
rabbit -s 1 --u 0 --o 1 --e 121,194 $prog 2> uops.system 1> /dev/null
for i in $trials
do
rabbit -s 1 --u 0 --o 1 --e 194 --compare le$i,gt$i $prog 2>> uops.system 1> /dev/null
done
awk -f uops.awk uops.system >> uops.report
cat uops.report
$0 == "--------------------------- performance counters ---------------------------" {
ready = 1
next
}
ready == 0 { next }
$1 == "0x79" {
cpu_clk_unhalted = $4
need = 1
i = 0
next
}
$1 == "0xc2" && need == 1 {
uops_retired = $4
if (cpu_clk_unhalted > 0) {
printf "average uops = %.4f per cycle over %d cycles\n", \
uops_retired/cpu_clk_unhalted, cpu_clk_unhalted
} else {
printf "clock rate not positive\n"
exit
}
need = 2
next
}
$1 == "0xc2" && need == 2 { s[i] = $4; need = 3; next }
$1 == "0xc2" && need == 3 { t[i] = $4; need = 2; i++; next }
END {
for (j = 0; j < i; j++) { u[j] = s[j]/(s[j] + t[j]) }
v[0] = u[0]
for (j = 1; j < i; j++) { v[j] = u[j] - u[j-1] }
a = 0
for (j = 0; j < i; j++) {
printf "%d uops per cycle, %5.2f%%\n", j, 100 * v[j]
a = a + j*v[j]
}
printf "summary: %.4f average uops per cycle\n", a
}
Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
Author: Don Heller, dheller@scl.ameslab.gov
Last revised: 2 August 2000