Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
This example introduces
   rabbit command-line options --compare m,n --invert i,j
   an experiment to estimate the Pentium Pro/II/III micro-operation processing rate

Download this example: uops, uops.awk, foo.c
Return to Main Menu


Command-line options --compare m,n --invert i,j Pentium Pro/II/III and Athlon only m, n = 8-bit unsigned integer, default 0; "counter mask". i, j = 1-bit flag, default 0; "invert flag".
Notes In the default case (m = 0), the event counters are incremented by the number of events in each cycle. These options set a threshold value, and increment the counters by 1 if the threshold is exceeded, or not reached. Formally, event counter 0 is incremented by this method: e = events that occurred in this cycle; if (m == 0) { counter_0 += e; } else { if (i == 0) { if (e >= m) { counter_0 += 1; } } else { if (e < m) { counter_0 += 1; } } } and similarly for counter 1 using n and j. On the Athlon, four counters are available. As a shorthand when using a non-zero counter mask, the symbols 'lt', 'le', 'gt', 'ge' can be used according to these examples: --compare lt2 equivalent to --compare 2 --invert 1 --compare le2 equivalent to --compare 3 --invert 1 --compare ge2 equivalent to --compare 2 --invert 0 --compare gt2 equivalent to --compare 3 --invert 0
Example - Pentium Pro micro-operations The processor decomposes x86 instructions into micro-operations; the instruction decoder can produce up to 6 uops per cycle, the execution units can consume up to 5 uops/cycle, and the retirement unit can retire (accept as completed) up to 3 uops/cycle. The number of uops/instruction is typically 1 for integer instructions, 1-4 for floating-point instructions, or more as obtained from the microcode instruction sequencer. Load and store are 1 and 2 uops, respectively. First, select the events to investigate uops retired per cycle. h% rabbit --e 121,194 -d 0x79 121 cycles processor is not halted 0xc2 194 instruction decode and retire unit, micro-operations retired Since only two events are used, and we are only looking for an average, we do not need to sample frequently. The test uses a 200 MHz Pentium Pro, and a test program 'foo'. 'foo' does integer calculations, while 'foo f' and 'foo d' do single and double precision floating-point calculations. h% rabbit -s 1 --e 121,194 foo ... Event Events Events/sec ---------------------------------- ---------------- ---------------- 0x79 121 cpu_clk_unhalted 30066740927 200000023.29 0xc2 194 uops_retired 50040030583 332859730.50 The average number of uops retired/cycle for this program (foo) is thus 50040030583 / 30066740927 = 1.664. Now we ask, how often do we retire 0, 1, 2 or 3 uops/cycle? h% rabbit -s 1 --e 194 --compare le0,gt0 foo ... Event Events Events/sec ---------------------------------- ---------------- ---------------- 0xc2 194 uops_retired 4381857244 29149528.45 0xc2 194 uops_retired 25682833187 170850494.44 h% rabbit -s 1 --e 194 --compare le1,gt1 foo ... Event Events Events/sec ---------------------------------- ---------------- ---------------- 0xc2 194 uops_retired 14258801992 94851503.51 0xc2 194 uops_retired 15806728109 105148520.02 First, check the data: 29149528.45 + 170850494.44 = 200000022.89, and 94851503.51 + 105148520.02 = 200000023.53, as expected. So, about 47.4% of the cycles (94851503.51 / 200000023.53) retire at most one micro-operation. This percentage is subject to some variation, as it measures all system activity during the course of the program foo, and the program itself has some small variations. h% rabbit -s 1 --e 194 --compare le2,gt2 foo ... Event Events Events/sec ---------------------------------- ---------------- ---------------- 0xc2 194 uops_retired 21503833770 143051213.05 0xc2 194 uops_retired 8560694505 56948809.53 h% rabbit -s 1 --e 194 --compare le3,gt3 foo ... Event Events Events/sec ---------------------------------- ---------------- ---------------- 0xc2 194 uops_retired 30067703726 200000022.54 0xc2 194 uops_retired 0 0.00 h% rabbit -s 1 --e 194 --compare le4,gt4 foo ... Event Events Events/sec ---------------------------------- ---------------- ---------------- 0xc2 194 uops_retired 30062615214 200000022.91 0xc2 194 uops_retired 0 0.00 From the table le 0 uops retired / sec: 29149528.45 le 1 uops retired / sec: 94851503.51 le 2 uops retired / sec: 143051213.05 le 3 uops retired / sec: 200000022.54 we derive 0 uops retired / sec: 29149528.45 = 14.57% of cycles 1 uops retired / sec: 65701975.06 = 32.85% of cycles 2 uops retired / sec: 48199709.54 = 24.10% of cycles 3 uops retired / sec: 56948809.49 = 28.47% of cycles which agrees with the previous average of 1.664 uops retired / cycle. The following shell and awk scripts (uops and uops.awk) help to automate this calculation.
#!/bin/sh prog="$*" trials="0 1 2 3" if [ "X$prog" = "X" -o "$prog" = "-h" ] then echo "Testing Intel Pentium Pro/II/III micro-operations" echo "usage: uops prog [args]" echo " intermediate output to uops.user, uops.system, uops.both" echo " summary to stdout and uops.report" exit 1 fi $prog 1> /dev/null echo "Testing Intel Pentium Pro/II/III micro-operations" > uops.report echo -e "$prog\n\nuser and system modes combined" >> uops.report rabbit -s 1 --e 121,194 $prog 2> uops.both 1> /dev/null for i in $trials do rabbit -s 1 --e 194 --compare le$i,gt$i $prog 2>> uops.both 1> /dev/null done awk -f uops.awk uops.both >> uops.report echo -e "\nuser mode only" >> uops.report rabbit -s 1 --u 1 --o 0 --e 121,194 $prog 2> uops.user 1> /dev/null for i in $trials do rabbit -s 1 --u 1 --o 0 --e 194 --compare le$i,gt$i $prog 2>> uops.user 1> /dev/null done awk -f uops.awk uops.user >> uops.report echo -e "\nsystem mode only" >> uops.report rabbit -s 1 --u 0 --o 1 --e 121,194 $prog 2> uops.system 1> /dev/null for i in $trials do rabbit -s 1 --u 0 --o 1 --e 194 --compare le$i,gt$i $prog 2>> uops.system 1> /dev/null done awk -f uops.awk uops.system >> uops.report cat uops.report
$0 == "--------------------------- performance counters ---------------------------" { ready = 1 next } ready == 0 { next } $1 == "0x79" { cpu_clk_unhalted = $4 need = 1 i = 0 next } $1 == "0xc2" && need == 1 { uops_retired = $4 if (cpu_clk_unhalted > 0) { printf "average uops = %.4f per cycle over %d cycles\n", \ uops_retired/cpu_clk_unhalted, cpu_clk_unhalted } else { printf "clock rate not positive\n" exit } need = 2 next } $1 == "0xc2" && need == 2 { s[i] = $4; need = 3; next } $1 == "0xc2" && need == 3 { t[i] = $4; need = 2; i++; next } END { for (j = 0; j < i; j++) { u[j] = s[j]/(s[j] + t[j]) } v[0] = u[0] for (j = 1; j < i; j++) { v[j] = u[j] - u[j-1] } a = 0 for (j = 0; j < i; j++) { printf "%d uops per cycle, %5.2f%%\n", j, 100 * v[j] a = a + j*v[j] } printf "summary: %.4f average uops per cycle\n", a }

Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
Author: Don Heller, dheller@scl.ameslab.gov
Last revised: 2 August 2000