Performance Counters Library - command-line options

Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux

This example introduces
   rabbit command-line options --compare m,n --invert i,j
   an experiment to estimate the Pentium Pro/II/III micro-operation processing rate

Download this example: uops, uops.awk, foo.c
Return to Main Menu


Command-line options

--compare m,n
--invert i,j

  Pentium Pro/II/III and Athlon only
  m, n = 8-bit unsigned integer, default 0; "counter mask".
  i, j = 1-bit flag, default 0; "invert flag".


Notes

  In the default case (m = 0), the event counters are incremented by the
  number of events in each cycle.  These options set a threshold value, and
  increment the counters by 1 if the threshold is exceeded, or not reached.

  Formally, event counter 0 is incremented by this method:

      e = events that occurred in this cycle;
      if (m == 0)
        { counter_0 += e; }
      else
        { if (i == 0)
            { if (e >= m) { counter_0 += 1; } }
          else
            { if (e <  m) { counter_0 += 1; } }
        }

  and similarly for counter 1 using n and j.  On the Athlon, four counters
  are available.

  As a shorthand when using a non-zero counter mask, the symbols 'lt',
  'le', 'gt', 'ge' can be used according to these examples:

      --compare lt2      equivalent to      --compare 2 --invert 1
      --compare le2      equivalent to      --compare 3 --invert 1
      --compare ge2      equivalent to      --compare 2 --invert 0
      --compare gt2      equivalent to      --compare 3 --invert 0


Example - Pentium Pro micro-operations

  The processor decomposes x86 instructions into micro-operations; the
  instruction decoder can produce up to 6 uops per cycle, the execution units
  can consume up to 5 uops/cycle, and the retirement unit can retire (accept
  as completed) up to 3 uops/cycle.  The number of uops/instruction is
  typically 1 for integer instructions, 1-4 for floating-point instructions,
  or more as obtained from the microcode instruction sequencer.  Load and
  store are 1 and 2 uops, respectively.

  First, select the events to investigate uops retired per cycle.

    h% rabbit --e 121,194 -d
    0x79 121  cycles processor is not halted
    0xc2 194  instruction decode and retire unit, micro-operations retired

  Since only two events are used, and we are only looking for an average, we
  do not need to sample frequently.  The test uses a 200 MHz Pentium Pro, and
  a test program 'foo'.  'foo' does integer calculations, while 'foo f' and
  'foo d' do single and double precision floating-point calculations.

    h% rabbit -s 1 --e 121,194 foo
    ...

    Event                                           Events          Events/sec
    ----------------------------------    ----------------    ----------------
    0x79 121  cpu_clk_unhalted                 30066740927        200000023.29
    0xc2 194  uops_retired                     50040030583        332859730.50

  The average number of uops retired/cycle for this program (foo) is thus
  50040030583 / 30066740927 = 1.664.  Now we ask, how often do we retire 0, 1,
  2 or 3 uops/cycle?

    h% rabbit -s 1 --e 194 --compare le0,gt0 foo
    ...

    Event                                           Events          Events/sec
    ----------------------------------    ----------------    ----------------
    0xc2 194  uops_retired                      4381857244         29149528.45
    0xc2 194  uops_retired                     25682833187        170850494.44

    h% rabbit -s 1 --e 194 --compare le1,gt1 foo
    ...

    Event                                           Events          Events/sec
    ----------------------------------    ----------------    ----------------
    0xc2 194  uops_retired                     14258801992         94851503.51
    0xc2 194  uops_retired                     15806728109        105148520.02

  First, check the data:
    29149528.45 + 170850494.44 = 200000022.89, and
    94851503.51 + 105148520.02 = 200000023.53,
  as expected.  So, about 47.4% of the cycles (94851503.51 / 200000023.53)
  retire at most one micro-operation.  This percentage is subject to some
  variation, as it measures all system activity during the course of the
  program foo, and the program itself has some small variations.

    h% rabbit -s 1 --e 194 --compare le2,gt2 foo
    ...

    Event                                           Events          Events/sec
    ----------------------------------    ----------------    ----------------
    0xc2 194  uops_retired                     21503833770        143051213.05
    0xc2 194  uops_retired                      8560694505         56948809.53


    h% rabbit -s 1 --e 194 --compare le3,gt3 foo
    ...

    Event                                           Events          Events/sec
    ----------------------------------    ----------------    ----------------
    0xc2 194  uops_retired                     30067703726        200000022.54
    0xc2 194  uops_retired                               0                0.00

    h% rabbit -s 1 --e 194 --compare le4,gt4 foo
    ...

    Event                                           Events          Events/sec
    ----------------------------------    ----------------    ----------------
    0xc2 194  uops_retired                     30062615214        200000022.91
    0xc2 194  uops_retired                               0                0.00

  From the table
    le 0 uops retired / sec:   29149528.45
    le 1 uops retired / sec:   94851503.51
    le 2 uops retired / sec:  143051213.05
    le 3 uops retired / sec:  200000022.54
  we derive
    0 uops retired / sec:  29149528.45  =  14.57% of cycles
    1 uops retired / sec:  65701975.06  =  32.85% of cycles
    2 uops retired / sec:  48199709.54  =  24.10% of cycles
    3 uops retired / sec:  56948809.49  =  28.47% of cycles
  which agrees with the previous average of 1.664 uops retired / cycle.

  The following shell and awk scripts (uops and uops.awk) help to automate this
  calculation.


#!/bin/sh

prog="$*"
trials="0 1 2 3"

if [ "X$prog" = "X" -o "$prog" = "-h" ]
then
  echo "Testing Intel Pentium Pro/II/III micro-operations"
  echo "usage: uops prog [args]"
  echo "  intermediate output to uops.user, uops.system, uops.both"
  echo "  summary to stdout and uops.report"
  exit 1
fi

$prog 1> /dev/null

echo "Testing Intel Pentium Pro/II/III micro-operations" > uops.report
echo -e "$prog\n\nuser and system modes combined" >> uops.report

rabbit -s 1 --e 121,194 $prog 2> uops.both 1> /dev/null
for i in $trials
do
  rabbit -s 1 --e 194 --compare le$i,gt$i $prog 2>> uops.both 1> /dev/null
done

awk -f uops.awk uops.both >> uops.report

echo -e "\nuser mode only" >> uops.report

rabbit -s 1 --u 1 --o 0 --e 121,194 $prog 2> uops.user 1> /dev/null
for i in $trials
do
  rabbit -s 1 --u 1 --o 0 --e 194 --compare le$i,gt$i $prog 2>> uops.user 1> /dev/null
done

awk -f uops.awk uops.user >> uops.report

echo -e "\nsystem mode only" >> uops.report

rabbit -s 1 --u 0 --o 1 --e 121,194 $prog 2> uops.system 1> /dev/null
for i in $trials
do
  rabbit -s 1 --u 0 --o 1 --e 194 --compare le$i,gt$i $prog 2>> uops.system 1> /dev/null
done

awk -f uops.awk uops.system >> uops.report

cat uops.report


$0 == "---------------------------   performance counters   ---------------------------" {
  ready = 1
  next
}

ready == 0 { next }

$1 == "0x79" {
  cpu_clk_unhalted = $4
  need = 1
  i = 0
  next
}

$1 == "0xc2" && need == 1 {
  uops_retired = $4
  if (cpu_clk_unhalted > 0) {
    printf "average uops = %.4f per cycle over %d cycles\n", \
	 uops_retired/cpu_clk_unhalted, cpu_clk_unhalted
  } else {
    printf "clock rate not positive\n"
    exit
  }
  need = 2
  next
}

$1 == "0xc2" && need == 2 { s[i] = $4; need = 3; next }

$1 == "0xc2" && need == 3 { t[i] = $4; need = 2; i++; next }

END {
  for (j = 0; j < i; j++) { u[j] = s[j]/(s[j] + t[j]) }
  v[0] = u[0]
  for (j = 1; j < i; j++) { v[j] = u[j] - u[j-1] }
  a = 0
  for (j = 0; j < i; j++) {
    printf "%d uops per cycle, %5.2f%%\n", j, 100 * v[j]
    a = a + j*v[j]
  }
  printf "summary: %.4f average uops per cycle\n", a
}

Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux Author: Don Heller, dheller@scl.ameslab.gov Last revised: 2 August 2000