|
Scope and Motivation |
Scope of This Tutorial:
Application Related Factors:
Hardware Related Factors:
Software Related Factors:
Performance Considerations and Strategies |
Procedure | % CPU Time |
---|---|
main()
procedure1() procedure2() procedure3() |
13%
17% 20% 50% |
A 20% increase in the performance of procedure3() results in a 10% performance increase overall.
A 20% increase in the performance of main() results in only a 2.6% performance increase overall.
Timers |
Timer | Usage | Wallclock / CPU Time | Resolution | Languages | Portable? |
---|---|---|---|---|---|
time | shell / script | both | 1/100th second | any | yes |
timex | shell / script | both | 1/100th second | any | yes |
gettimeofday | subroutine | wallclock | microsecond | C/C++ | yes |
read_real_time | subroutine | wallclock | nanosecond | C/C++ | no |
rtc | subroutine | wallclock | microsecond | Fortran | no |
irtc | subroutine | wallclock | nanosecond | Fortran | no |
dtime_ | subroutine | CPU | 1/100th second | Fortran | no |
etime_ | subroutine | CPU | 1/100th second | Fortran | no |
mclock | subroutine | CPU | 1/100th second | Fortran | no |
timef | subroutine | wallclock | millisecond | Fortran | no |
MPI_Wtime | subroutine | wallclock | microsecond | C/C++, Fortran | yes |
AIX Trace Facility | shell / script / subroutine | wallclock | microsecond | any | no |
1 2 3 4 5 6 7 8 1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w |
Explanation:
real 0m2.58s user 0m1.14s sys 0m0.03sExplanation:
Note that much of the timex command's output is NOT described in the timex man page. Some of it may be understood from reading the sar command man page. |
Example Codes (C and Fortran)
#include "mpi.h" int main(argc,argv) int argc; char *argv[]; { int n=0,m; double start,end,resolution; MPI_Init(&argc,&argv); start = MPI_Wtime(); /* start time */ for (m=0;m<2000000;m++) n = n + m; end = MPI_Wtime(); /* end time */ resolution = MPI_Wtick(); printf("Wallclock times(secs): start= %f end= %f \n",start,end); printf("elapsed= %e resolution= %e\n", end-start,resolution); MPI_Finalize (); } |
Example Code
PROGRAM RTC_TIME REAL(8) A, B, rtc A = rtc() DO M = 1,2000000 N = N + M END DO B = rtc() PRINT *, 'Seconds elapsed: ',B - A END |
Example Code
PROGRAM IRTC_TIME INTEGER(8) A, B, irtc A = irtc() DO M = 1,2000000 N = N + M END DO B = irtc() PRINT *, 'Nanoseconds elapsed: ',B - A END |
Example Code
PROGRAM DTIME_TIME REAL(4) DELTA, dtime_ TYPE TB_TYPE SEQUENCE REAL(4) USRTIME REAL(4) SYSTIME END TYPE TYPE (TB_TYPE) DTIME_STRUCT DELTA = dtime_(DTIME_STRUCT) DO M = 1,2000000 N = N + M END DO DELTA = dtime_(DTIME_STRUCT) PRINT *, 'User time: ',DTIME_STRUCT%USRTIME, 'seconds' PRINT *, 'System time: ',DTIME_STRUCT%SYSTIME, 'seconds' PRINT *, 'Elapsed time: ',DELTA, 'seconds' END |
Example Code
PROGRAM ETIME_TIME REAL(4) ELAPSED, etime_ TYPE TB_TYPE SEQUENCE REAL(4) USRTIME REAL(4) SYSTIME END TYPE TYPE (TB_TYPE) ETIME_STRUCT DO M = 1,2000000 N = N + M END DO ELAPSED = etime_(ETIME_STRUCT) PRINT *, 'User time: ',ETIME_STRUCT%USRTIME, 'seconds' PRINT *, 'System time: ',ETIME_STRUCT%SYSTIME, 'seconds' PRINT *, 'Elapsed time: ',ELAPSED, 'seconds' END |
Example Code
PROGRAM MCLOCK_TIME INTEGER(4) T1, T2, mclock REAL(4) SECONDS T1 = mclock() DO M = 1,2000000 N = N + M END DO T2 = mclock() SECONDS = REAL(T2 - T1) / 100 PRINT *, 'Elapsed CPU time: ',SECONDS,'seconds' END |
Example Code
PROGRAM TIMEF_TIME REAL(8) ELAPSED, timef ELAPSED = timef() DO M = 1,2000000 N = N + M END DO ELAPSED = timef() PRINT *, 'Elapsed time: ',ELAPSED,'milliseconds' END |
Example:
Thread create=465, thread terminate=467, wait lock=46d.
rm trcfile
trace -a -o trcfile -j 465,467,46d; dotprod; trcstop
trcrpt -o trace.report -O pid=on trcfile
Sample AIX Trace Facilty output |
Fri Jul 19 00:02:05 2002 System: AIX frost067 Node: 5 Machine: 006008594C00 Internet Address: 86092F43 134.9.47.67 The system contains 16 cpus, of which 16 were traced. Buffering: Kernel Heap This is from a 32-bit kernel. Tracing only these hooks, 465,467,46d trace -a -o trcfile -j 465,467,46d ID PID I ELAPSED_SEC DELTA_MSEC APPL SYSCALL KERNEL INTERRUPT 001 70640 0.000000000 0.000000 TRACE ON channel 0 Fri Jul 19 00:02: 05 2002 465 -1 0.000482511 0.482511 thread_create: pid=7825 0 tid=226787 priority=60 policy=0 46D 85616 0.000495484 0.012973 wait_on_lock: pid=85616 tid=238427 lockaddr=5819F0 465 -1 0.027897260 27.401776 thread_create: pid=7731 8 tid=146769 priority=61 policy=0 467 77318 0.028225717 0.328457 thread_terminate: pid=7 7318 tid=146769 465 -1 0.028235827 0.010110 thread_create: pid=7731 8 tid=206215 priority=61 policy=0 46D -1 0.028251016 0.015189 wait_on_lock: pid=77318 tid=238429 lockaddr=5819F0 465 -1 0.028633968 0.382952 thread_create: pid=7731 8 tid=151151 priority=61 policy=0 467 77318 0.028795141 0.161173 thread_terminate: pid=7 7318 tid=206215 467 77318 0.028861449 0.066308 thread_terminate: pid=7 7318 tid=151151 465 -1 0.028876390 0.014941 thread_create: pid=7731 8 tid=233391 priority=61 policy=0 46D -1 0.028886636 0.010246 wait_on_lock: pid=77318 tid=238429 lockaddr=5819F0 467 77318 0.029303598 0.416962 thread_terminate: pid=7 7318 tid=233391 002 -1 0.055766970 26.463372 TRACE OFF channel 0000 Fri Jul 19 00:02:05 2002 |
AIX System Commands |
Sample output |
% lscfg INSTALLED RESOURCE LIST The following resources are installed on the machine. +/- = Added or deleted from Resource List. * = Diagnostic support not available. Model Architecture: chrp Model Implementation: Multiple Processor, PCI bus + sys0 00-00 System Object + sysplanar0 00-00 System Planar + mem0 00-00 Memory + proc0 00-00 Processor + L2cache0 00-00 L2 Cache * pmc0 00-00 Power Management Controller + proc1 00-01 Processor + proc2 00-02 Processor + proc3 00-03 Processor + proc4 00-04 Processor + proc5 00-05 Processor + proc6 00-06 Processor + proc7 00-07 Processor + proc8 00-08 Processor + proc9 00-09 Processor + proc10 00-10 Processor + proc11 00-11 Processor + proc12 00-12 Processor + proc13 00-13 Processor + proc14 00-14 Processor + proc15 00-15 Processor * pci0 00-c8000000 PCI Bus * isa0 10-58 ISA Bus + fda0 01-D1 Standard I/O Diskette Adapter + sa0 01-S1 Standard I/O Serial Port + tty0 01-S1-00-00 Asynchronous Terminal + sa1 01-S2 Standard I/O Serial Port + tty1 01-S2-00-00 Asynchronous Terminal + sa2 01-S3 Standard I/O Serial Port + tty2 01-S3-00-00 Asynchronous Terminal + scsi0 10-60 Wide/Fast-20 SCSI I/O Controller + hdisk0 10-60-00-0,0 Other SCSI Disk Drive + hdisk1 10-60-00-1,0 Other SCSI Disk Drive + ent0 10-68 IBM 10/100 Mbps Ethernet PCI Adapter (23100020) * pci1 00-c8100000 PCI Bus * pci2 00-d8000000 PCI Bus * pci3 00-d8100000 PCI Bus + ent1 J0-58 Gigabit Ethernet-SX PCI Adapter (14100401) * pci4 00-d8200000 PCI Bus * pci5 00-d8300000 PCI Bus + ent2 L0-58 Gigabit Ethernet-SX PCI Adapter (14100401) * pci6 00-d8400000 PCI Bus * pci7 00-d8500000 PCI Bus + ent3 N0-58 Gigabit Ethernet-SX PCI Adapter (14100401) * pci8 00-d8600000 PCI Bus * pci9 00-d8700000 PCI Bus + ent4 Q0-58 Gigabit Ethernet-SX PCI Adapter (14100401) * pci10 00-d8800000 PCI Bus * pci11 00-d8900000 PCI Bus + ent5 S0-58 Gigabit Ethernet-SX PCI Adapter (14100401) * pci12 00-d8a00000 PCI Bus * pci13 00-d8b00000 PCI Bus + ent6 U0-58 Gigabit Ethernet-SX PCI Adapter (14100401) + css0 00-e0000000 SP Switch2 Communications Adapter + css1 00-e8000000 SP Switch2 Communications Adapter |
Sample output |
% lsattr -EH -l sys0 lsattr -EH -l sys0 attribute value description use r_settable keylock normal State of system keylock at boot time Fal se maxbuf 20 Maximum number of pages in block I/O BUFFER CACHE Tru e maxmbuf 0 Maximum Kbytes of real memory allowed for MBUFS Tru e maxuproc 256 Maximum number of PROCESSES allowed per user Tru e autorestart false Automatically REBOOT system after a crash Tru e iostat false Continuously maintain DISK I/O history Tru e realmem 16777216 Amount of usable physical memory in Kbytes Fal se conslogin enable System Console Login Fal se fwversion IBM,NI01261 Firmware version and revision levels Fal se maxpout 0 HIGH water mark for pending write I/Os per file Tru e minpout 0 LOW water mark for pending write I/Os per file Tru e fullcore true Enable full CORE dump Tru e pre430core false Use pre-430 style CORE dump Tru e ncargs 6 ARG/ENV list size in 4K byte blocks Tru e rtasversion 1 Open Firmware RTAS version Fal se modelname IBM,9076-N81 Machine name Fal se systemid IBM,011600859 Hardware system identifier Fal se boottype disk N/A Fal se SW_dist_intr false Enable SW distribution of interrupts Tru e cpuguard disable CPU Guard Tru e frequency 125000000 System Bus Frequency Fal se |
Sample output |
% lsps -a Page Space Physical Volume Volume Group Size %Used Active Auto Type paging00 hdisk1 rootvg 12032MB 1 yes yes lv hd6 hdisk0 rootvg 12032MB 1 yes yes lv |
vmstat [options] n [m]n = interval in seconds
Sample output |
% vmstat 2 3 kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 2 1 504038 3489832 0 0 0 0 0 0 2760 8705 1730 4 3 92 0 1 0 503080 3490975 0 0 0 0 0 0 5676 8619 4277 0 15 85 0 0 0 503073 3490982 0 0 0 0 0 0 1937 2707 232 0 0 99 0* 1 page = 4096 bytes |
r = kernel threads placed on the run queue
b = kernel threads blocked avm = active virtual memory pages fre = free memory pages re= pager input/output list pi = page-ins from paging space | po = page-outs to paging space
fr = pages freed sr = pages scanned cy = scan cycles in = device interrupts sy = system calls | cs = context switches
us = % user cpu utilization sy = % system cpu utilization id = % cpu idle time wa = % cpu time waiting on I/O |
netstat [options] nn = interval in seconds
Sample output |
% netstat -i - shows state of all configured interfaces Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll en0 1500 link#2 0.4.ac.ec.11.e 319849 0 2787430 0 0 en0 1500 134.9.46 efrost067 319849 0 2787430 0 0 en1 1500 link#3 0.4.ac.7c.d6.3a 60416480 0 119342413 0 0 en1 1500 134.9.45 frost067-ge0.llnl 60416480 0 119342413 0 0 en2 1500 link#4 0.4.ac.7c.d3.1d 111043029 0 131259660 3 0 en2 1500 134.9.6 frost067-ge1.llnl 111043029 0 131259660 3 0 en3 9000 link#5 0.4.ac.7c.d6.43 12557524 0 14013408 0 0 en3 9000 134.9.60 frost067-ge2.llnl 12557524 0 14013408 0 0 en4 9000 link#6 0.4.ac.7c.d5.37 12555933 0 13775758 0 0 en4 9000 134.9.61 frost067-ge3.llnl 12555933 0 13775758 0 0 en5 9000 link#7 0.4.ac.7c.d5.36 12711192 0 13896158 0 0 en5 9000 134.9.62 frost067-ge4.llnl 12711192 0 13896158 0 0 en6 9000 link#8 0.4.ac.7c.d5.21 12408381 0 12936330 0 0 en6 9000 134.9.63 frost067-ge5.llnl 12408381 0 12936330 0 0 css0 65504 link#9 20162383 0 47045300 0 0 css0 65504 134.9.48 frost067-css0 20162383 0 47045300 0 0 css1 65504 link#10 20155508 0 47041708 0 0 css1 65504 134.9.49 frost067-css1 20155508 0 47041708 0 0 ml0 65504 link#11 0 0 75604311 4367 0 ml0 65504 134.9.47 frost067 0 0 75604311 4367 0 ml0 65504 134.9.47 frostgw 0 0 75604311 4367 0 lo0 16896 link#1 1902914 0 1903901 0 0 lo0 16896 127 localhost 1902914 0 1903901 0 0 lo0 16896 ::1 1902914 0 1903901 0 0 |
iostat [options] n [m]n = interval in seconds
Sample output |
% iostat 10 3 tty: tin tout avg-cpu: % user % sys % idle % iowait 1.9 569.4 4.1 3.4 92.2 0.3 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 1.2 79.3 6.2 497100 9818957 hdisk1 0.4 56.7 1.8 48840 7328280 tty: tin tout avg-cpu: % user % sys % idle % iowait 0.3 1183.6 0.4 0.2 99.3 0.0 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 0.0 1.6 0.1 0 16 hdisk1 0.0 1.6 0.1 0 16 tty: tin tout avg-cpu: % user % sys % idle % iowait 0.1 418.2 0.1 0.7 99.2 0.0 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 3.9 67.7 13.0 0 677 hdisk1 0.2 10.4 1.5 0 104 |
Sample output |
% ps ux USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND userjoe 41728 62.9 0.0 352 396 pts/15 A 18:39:14 5:02 a.out userjoe 41728 22.3 33.0 412 478 pts/15 A 18:39:14 5:02 b.out userjoe 41502 0.0 0.0 696 828 pts/15 A 17:25:27 0:03 -tcsh userjoe 13334 0.0 0.0 200 252 pts/15 A 18:40:14 0:00 ps ux |
Profilers |
Sample prof output (abbreviated) |
Name %Time Seconds Cumsecs #Calls msec/call .fft 51.8 0.59 0.59 1024 0.576 .main 40.4 0.46 1.05 1 460. .bit_reverse 7.9 0.09 1.14 1024 0.088 .cos 0.0 0.00 1.14 256 0.00 .sin 0.0 0.00 1.14 256 0.00 .catopen 0.0 0.00 1.14 1 0. .setlocale 0.0 0.00 1.14 1 0. ._doprnt 0.0 0.00 1.14 7 0. ._flsbuf 0.0 0.00 1.14 11 0.0 ._xflsbuf 0.0 0.00 1.14 7 0. ._wrtchk 0.0 0.00 1.14 1 0. ._findbuf 0.0 0.00 1.14 1 0. ._xwrite 0.0 0.00 1.14 7 0. .free 0.0 0.00 1.14 2 0. .free_y 0.0 0.00 1.14 2 0. .write 0.0 0.00 1.14 7 0. .exit 0.0 0.00 1.14 1 0. .memchr 0.0 0.00 1.14 19 0.0 .atoi 0.0 0.00 1.14 1 0. .__nl_langinfo_std 0.0 0.00 1.14 4 0. .gettimeofday 0.0 0.00 1.14 8 0. .printf 0.0 0.00 1.14 7 0. |
prof -m mon.out.0 prof -m mon.out.0 mon.out.1 mon.out.2 prof -m mon.out.*
Note: If you view more than one mon.out file at a time, the results are averaged into a single report. |
gprof myprog gmon.out.0 gprof myprog gmon.out.0 gmon.out.1 gmon.out.2 gprof myprog gmon.out.*
Note: If you view more than one gmon.out file at a time, the results are summed into a single report. |
Additional Notes About prof and gprof:
xlc_r myprog.f -o myprog
tprof -vx myprog
-rw------- 1 blaise blaise 10078 Jul 20 12:06 __ldmap -rw------- 1 blaise blaise 1808 Jul 20 12:06 __prof.all -rw------- 1 blaise blaise 0 Jul 20 12:06 __tmp.j -rw------- 1 blaise blaise 1451 Jul 20 12:06 __tmp.k -rw------- 1 blaise blaise 0 Jul 20 12:06 __tmp.o -rw------- 1 blaise blaise 0 Jul 20 12:06 __tmp.s -rw------- 1 blaise blaise 0 Jul 20 12:06 __tmp.u -rw------- 1 blaise blaise 89982 Jul 20 12:06 __trc_rpt2
Trace is done now open: check_unix_file failure: Permission denied * Samples from __trc_rpt2 * Reached second section of __trc_rpt2
Sample tprof output |
Process PID TID Total Kernel User Shared Other ======= === === ===== ====== ==== ====== ===== prime 99578 223099 1233 837 396 0 0 wait 516 517 1231 1231 0 0 0 wait 2322 2323 1216 1216 0 0 0 wait 3870 3871 1215 1215 0 0 0 wait 1548 1549 1213 1213 0 0 0 wait 3612 3613 1209 1209 0 0 0 wait 4386 4387 1208 1208 0 0 0 wait 2580 2581 1197 1197 0 0 0 [snip - big chunk of long report ommitted] /usr/sbin/entstat.phxent 76924 250719 1 1 0 0 0 /usr/sbin/entstat.phxent 76926 250721 1 1 0 0 0 /usr/sbin/entstat.gxent 76928 250723 1 1 0 0 0 /usr/sbin/entstat.gxent 76930 250725 1 1 0 0 0 /usr/sbin/entstat.gxent 66452 262251 1 1 0 0 0 /usr/sbin/entstat.gxent 66454 262253 1 1 0 0 0 /usr/sbin/entstat.gxent 66456 262255 1 1 0 0 0 /usr/sbin/entstat.gxent 66460 262259 1 1 0 0 0 uname 111540 144913 1 1 0 0 0 netstat 44478 145855 1 1 0 0 0 /bin/stty 41944 197711 1 1 0 0 0 execerror 13234 248433 1 1 0 0 0 ======= === === ===== ====== ==== ====== ===== Total 19771 19286 408 77 0 Process FREQ Total Kernel User Shared Other ======= === ===== ====== ==== ====== ===== wait 16 17738 17738 0 0 0 prime 1 1233 837 396 0 0 netperfbin/netperf.AIX 1 568 568 0 0 0 LoadL_negotiator 4 41 8 0 33 0 pspd 1 27 1 1 25 0 sh 14 23 20 1 2 0 mld 1 20 17 2 1 0 ps 2 19 10 0 9 0 expect 5 18 14 2 2 0 netmon.AIX 6 10 9 1 0 0 netstat 9 9 9 0 0 0 /usr/sbin/entstat.gxent 9 9 9 0 0 0 date 9 9 8 0 1 0 /usr/sbin/entstat 7 8 8 0 0 0 acctd 1 6 4 2 0 0 tprof 1 5 2 3 0 0 ping 2 4 2 0 2 0 sshd1 1 3 3 0 0 0 hats_nim 2 2 2 0 0 0 /usr/lpp/csd/bin/statvsd 2 2 2 0 0 0 PID.99578 1 2 1 0 1 0 |
Sample output |
AIX Monitor v2.1.9llnl 21apr2002: frost067.pacific.llnlThu Jul 18 23:51:44 2002 Uptime: 1 days, 11:00 Users: 1 of 51 active 51 remote 501:00 sleep time CPU: User 0.3% Sys 0.1% Wait 0.0% Idle 99.6% Refresh: 10.00 s 0% 25% 50% 75% 100% Runnable (Swap-in) processes 0.10 (0.00) load average: 0.24, 0.18, 0.23 FOR NON-COMMERCIAL USE ONLY see http://www.mesa.nl/monitor Memory Real Virtual Paging (4kB) Process events File/TTY-IO free 13388 MB N/A 0.0 pgfaults 220 pswitch 0 iget procs 1958 MB N/A 0.0 pgin 1237 syscall 8 namei files 1035 MB 0.0 pgout 48 read 0 dirblk total 16383 MB N/A 0.0 pgsin 21 write 13519 readch IO (kB/s) read write busy% 0.0 pgsout 0 fork 6787 writech hdisk0 0.0 0.0 0 0 exec 0 ttyrawch hdisk1 0.0 0.0 0 Client Server NFS/s 0 rcvint 0 ttycanch 0.5 0.0 calls 0 xmtint 26 ttyoutch 0.0 0.0 retry 0 mdmint 0.0 0.0 getattr MESA Consulting 0.0 0.0 lookup Netw read write kB/s Unix and Internet 0.0 0.0 read en0 0.2 0.1 Technology Specialists 0.0 0.0 write en1 1.4 1.7 http://www.mesa.nl 0.0 0.0 other en2 0.1 0.7 en3 0.0 0.0 en4 0.0 0.0 en5 0.0 0.0 en6 0.0 0.0 css0 7.0 12.9 css1 5.6 7.7 ml0 0.0 7.7 lo0 0.0 0.0 |
Sample output for monitor command with -smp option |
Load averages: 0.22, 0.17, 0.21 frost067.pacific.llnThu Jul 18 23:56:06 2002 Cpu states: 4.0% user 3.3% system 0.0% wait 92.7% idle For non-commercial Logged on: 52 users 2 active 52 remote 495:01 sleep time use only! Real memory: 1966.6M procs 1024.3M files 13392.8M free 16383.6M total Virtual memory: N/A used N/A free N/A total CPU USER KERN WAIT IDLE% PSW SYSCALL WRITE READ WRITEkb READkb #0 2 2 0 94 47 367 65 48 92.97 106.72 #1 1 2 0 96 37 290 14 49 2.26 41.24 #2 3 3 0 92 56 3827 20 83 43.87 224.47 #3 4 2 0 93 22 709 6 158 2.00 198.02 #4 4 2 0 92 42 481 42 58 126.00 173.23 #5 5 4 0 89 38 7159 47 105 50.89 346.71 #6 1 2 0 96 47 767 1 157 0.89 211.42 #7 1 4 0 93 48 895 4 66 2.90 173.56 #8 12 3 0 83 27 836 12 215 2.69 590.16 #9 3 3 0 93 61 957 56 65 52.87 165.28 #10 6 3 0 90 52 602 5 129 2.53 357.57 #11 3 4 0 92 35 3737 20 63 44.11 192.94 #12 3 1 0 95 50 307 98 51 134.06 93.54 #13 3 5 0 91 34 3659 95 37 94.71 29.31 #14 3 1 0 94 15 370 90 108 94.77 71.00 #15 3 3 0 93 41 7194 7 59 0.90 162.28 SUM 4 3 0 92 659 32166 588 1459 748.40 3137.45 PID USER PRI NICE SIZE RES STAT TIME CPU% COMMAND 774 root 255 21 12k 73M run 32:57:01 102.4/93.7 Kernel (wait) 2064 root 255 21 12k 73M run 32:52:13 100.8/93.5 Kernel (wait) 4128 root 255 21 12k 73M run 32:39:51 100.8/92.9 Kernel (wait) 3612 root 255 21 12k 73M run 32:35:47 100.8/92.7 Kernel (wait) 516 root 255 21 12k 73M run 33:15:45 99.2/94.6 Kernel (wait) 4386 root 255 21 12k 73M run 32:51:00 99.2/93.4 Kernel (wait) 2322 root 255 21 12k 73M run 32:49:28 99.2/93.3 Kernel (wait) 1290 root 255 21 12k 73M run 32:55:52 97.6/93.6 Kernel (wait) 1032 root 255 21 12k 73M run 32:53:48 97.6/93.5 Kernel (wait) 2838 root 255 21 12k 73M run 32:48:12 97.6/93.3 Kernel (wait) 1548 root 255 21 12k 73M run 32:56:02 97.6/93.7 Kernel (wait) 3354 root 255 21 12k 73M run 32:51:38 97.6/93.4 Kernel (wait) 3870 root 255 21 12k 73M run 32:40:09 96.0/92.9 Kernel (wait) 3096 root 255 21 12k 73M run 32:44:59 94.4/93.1 Kernel (wait) 1806 root 255 21 12k 73M run 31:54:16 94.4/90.7 Kernel (wait) 2580 root 255 21 12k 73M run 32:55:12 88.0/93.6 Kernel (wait) 53960 root 60 0 32M 32M run 1:17:39 12.8/ 3.7 LoadL_negotiator 74622 yetter 60 0 9884k 11M Frun 1:31 3.2/ 0.2 xemacs-21.1.14 52242 root 56 -4 11M 11M run 41:00 1.6/ 2.0 pspd 16268 root 40 -20 400k 420k run 24:21 0.0/ 1.2 mld 23240 root 60 0 771k 812k run 1:20 0.0/ 0.1 sh 50850 root 29 21 8019k 8136k run 7:16 0.0/ 0.3 acctd 12918 root 60 0 2983k 3116k run 0:30 0.0/ 0.0 snmpd 78200 blaise 61 0 5464k 5496k Frun 0:00 0.0/ 0.7 monitor115240 jdh |
Overview:
Example Displays and Reports:
Opening display of the entire call tree. Initial display is fully zoomed-out, demonstrating the main library clusters, calling arc summaries, and overall routine control flow. | |
Unclustered functions view. Same as previous display, but functions are not grouped within their library cluster. | |
Zoom-in view of main program. The smaller "Overview Window" allows easy positioning of desired view area. Zoomed-in area provides function details. | |
Collapsed library clusters. Zoomed-out view. Demonstrates how undesired information can be removed/summarized. |
Flat Profile Report. Similar to the same report produced by the gprof utility. Lists functions according to cpu usage. | |
Source Code Report showing accumulated ticks for code statement lines. | |
Call Graph Profile. Similar to the same report produced by by the gprof utility. | |
Library Statistics. Shows execution statistics for libraries called |
Using xprofiler:
Note: when you compile and link separately, you must use the -pg option with both the compile and link commands.
xprofiler | Starts without a file loaded. Must load file by using xprofiler's File pull-down menu. |
xprofiler myprog gmon.out | Loads the serial program with it's stat file |
xprofiler myprog gmon.out.N | Loads parallel program with selected stat file |
xprofiler myprog gmon.out.* | Loads parallel program with combined/merged stat files |
Note that there are also several command line flags available to define certain xprofiler characteristics and behaviors.
|
Important note: it is often necessary to uncluster functions and zoom-in to get to important detailed information. It is also usually useful to collapse/hide library information that isn't needed (like system libs).
Documentation:
IBM Parallel Environment for AIX: Operation and Use Volume 2 Tools Reference. Complete product documentation. |
mpcc_r -g -o myprog myprog.c -L/usr/local/tools/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm |
Environment Information Section |
@ mpiP @ Command : sphot @ Version : 0.9 @ Build date : Mar 8 2001, 16:22:46 @ Start time : 2001 04 11 16:04:23 @ Stop time : 2001 04 11 16:04:51 @ Number of tasks : 4 @ Collector Rank : 0 @ Collector PID : 30088 @ Event Buffer Size : 100000 @ Final Trace Dir : . @ Local Trace Dir : /usr/tmp @ Task Map : 0 blue333.pacific.llnl.gov 0 @ Task Map : 1 blue334.pacific.llnl.gov 0 @ Task Map : 2 blue335.pacific.llnl.gov 0 @ Task Map : 3 blue336.pacific.llnl.gov 0 |
MPI Time Per Task |
------------------------------------------------------------------------------- @--- MPI Time (seconds) ------------------------------------------------------- ------------------------------------------------------------------------------- Task AppTime MPITime MPI% 0 27.9 7.18 25.73 1 27.9 7.5 26.89 2 27.9 7.78 27.90 3 27.9 7.73 27.72 * 112 30.2 27.06 |
Callsites |
------------------------------------------------------------------------------- @--- Callsites: 38 ------------------------------------------------------------ ------------------------------------------------------------------------------- ID MPICall ParentFunction Filename Line PC 1 Barrier copyglob copyglob.f 65 10000b9c 2 Barrier copypriv@OL@1 copypriv.f 195 10001cd4 3 Barrier copypriv@OL@2 copypriv.f 237 1000213c 4 Barrier copypriv@OL@3 copypriv.f 279 10002624 5 Barrier copypriv@OL@4 copypriv.f 324 10002b04 6 Barrier sphot sphot.f 269 10008f2c 7 Bcast rdopac rdopac.f 49 10008638 8 Comm_rank copyglob copyglob.f 13 100003a8 9 Comm_rank copypriv copypriv.f 75 10000c38 10 Comm_rank genxsec genxsec.f 37 1000503c 11 Comm_rank rdinput rdinput.f 17 100071d4 12 Comm_rank rdopac rdopac.f 29 1000806c 13 Comm_rank sphot sphot.f 67 10008a04 14 Comm_size copyglob copyglob.f 12 10000390 15 Comm_size copypriv copypriv.f 76 10000c50 16 Comm_size sphot sphot.f 68 10008a1c 17 Finalize sphot sphot.f 279 10008f60 18 Init sphot sphot.f 66 100089e8 19 Irecv copyglob copyglob.f 47 10000a70 20 Irecv copypriv@OL@1 copypriv.f 162 1000184c 21 Irecv copypriv@OL@2 copypriv.f 205 10001db4 22 Irecv copypriv@OL@3 copypriv.f 254 1000221c 23 Irecv copypriv@OL@4 copypriv.f 297 10002704 24 Irecv sphot@OL@1@OL@2 sphot.f 201 1000974c 25 Reduce sphot sphot.f 258 10008eac 26 Reduce sphot sphot.f 262 10008eec 27 Send copyglob copyglob.f 60 10000b78 28 Send copypriv@OL@1 copypriv.f 191 10001cb8 29 Send copypriv@OL@2 copypriv.f 233 10002120 30 Send copypriv@OL@3 copypriv.f 275 10002608 31 Send copypriv@OL@4 copypriv.f 320 10002ae8 32 Send sphot@OL@1@OL@3 sphot.f 234 100099d8 33 Waitall copyglob copyglob.f 51 10000ac4 34 Waitall copypriv@OL@1 copypriv.f 168 10001928 35 Waitall copypriv@OL@2 copypriv.f 212 10001e94 36 Waitall copypriv@OL@3 copypriv.f 260 100022fc 37 Waitall copypriv@OL@4 copypriv.f 304 100027e4 38 Waitall sphot@OL@1@OL@2 sphot.f 206 100097b4 |
Aggregate Times of Top 20 Callsites |
------------------------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) -------------------- ------------------------------------------------------------------------------- Call Site Time App% MPI% Bcast 7 1.54e+04 13.79 50.95 Barrier 1 1.42e+04 12.73 47.03 Barrier 2 563 0.50 1.87 Waitall 34 25.7 0.02 0.09 Reduce 25 7.4 0.01 0.02 Barrier 5 2.54 0.00 0.01 Barrier 6 1.55 0.00 0.01 Barrier 4 1.44 0.00 0.00 Comm_rank 13 1.22 0.00 0.00 Barrier 3 1.01 0.00 0.00 Comm_rank 9 0.967 0.00 0.00 Send 27 0.755 0.00 0.00 Send 31 0.694 0.00 0.00 Waitall 37 0.42 0.00 0.00 Send 28 0.336 0.00 0.00 Waitall 35 0.21 0.00 0.00 Waitall 36 0.202 0.00 0.00 Waitall 38 0.2 0.00 0.00 Irecv 19 0.2 0.00 0.00 Reduce 26 0.185 0.00 0.00 Waitall 33 0.161 0.00 0.00 Comm_size 15 0.132 0.00 0.00 |
Callsite Statistics |
------------------------------------------------------------------------------- @--- Callsite statistics (all, milliseconds): 102 ----------------------------- ------------------------------------------------------------------------------- Name Site Rank Count Max Mean Min App% MPI% Barrier 1 0 1 0.087 0.087 0.087 0.00 0.00 Barrier 1 1 1 12.7 12.7 12.7 0.05 0.17 Barrier 1 2 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.17 Barrier 1 3 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.75 Barrier 1 * 4 7.09e+03 3.55e+03 0.087 12.73 47.03 Barrier 2 0 1 0.12 0.12 0.12 0.00 0.00 Barrier 2 1 1 0.29 0.29 0.29 0.00 0.00 Barrier 2 2 1 307 307 307 1.10 3.95 Barrier 2 3 1 255 255 255 0.92 3.31 Barrier 2 * 4 307 141 0.12 0.50 1.87 ..... [snip]..... Send 31 1 1 0.169 0.169 0.169 0.00 0.00 Send 31 2 1 0.341 0.341 0.341 0.00 0.00 Send 31 3 1 0.184 0.184 0.184 0.00 0.00 Send 31 * 3 0.341 0.231 0.169 0.00 0.00 Send 32 1 1 0.027 0.027 0.027 0.00 0.00 Send 32 2 1 0.042 0.042 0.042 0.00 0.00 Send 32 3 1 0.043 0.043 0.043 0.00 0.00 Send 32 * 3 0.043 0.0373 0.027 0.00 0.00 Waitall 33 0 1 0.161 0.161 0.161 0.00 0.00 Waitall 33 * 1 0.161 0.161 0.161 0.00 0.00 Waitall 34 0 1 25.7 25.7 25.7 0.09 0.36 Waitall 34 * 1 25.7 25.7 25.7 0.02 0.09 Waitall 35 0 1 0.21 0.21 0.21 0.00 0.00 Waitall 35 * 1 0.21 0.21 0.21 0.00 0.00 Waitall 36 0 1 0.202 0.202 0.202 0.00 0.00 Waitall 36 * 1 0.202 0.202 0.202 0.00 0.00 Waitall 37 0 1 0.42 0.42 0.42 0.00 0.01 Waitall 37 * 1 0.42 0.42 0.42 0.00 0.00 Waitall 38 0 1 0.2 0.2 0.2 0.00 0.00 Waitall 38 * 1 0.2 0.2 0.2 0.00 0.00 |
Performance Analysis Tools |
Event Set 1 | Event Set 2 | Event Set 3 | Event Set 4 |
---|---|---|---|
Cycles | Cycles | Cycles | Cycles |
Inst. Completed | Inst. Completed | Inst. Completed | TLB Misses |
TLB Misses | TLB Misses | Inst. Cache Misses | Loads Completed |
Stores Completed | Stores Dispatched | FXU0 Operations | Stores Completed |
Loads Completed | L1 Store Misses | FXU1 Operations | L2 Load Misses |
FPU0 Operations | Loads Dispatched | FXU2 Operations | L2 Store Misses |
FPU1 Operations | L1 Load Misses | FPU0 Operations | Branches |
FMAs Executed | Load/Store Unit Idle | FPU1 Operations | Branches Mispredicted |
Using hpmcount:
hpmcount [-h] [-o <file name>] [-s <set>] [-e ev[,ev]*] programwhere:
-h
|
Prints help message |
-o <filename> |
Sets the output file name. For parallel jobs, each process gets a unique output file. |
-s <set> |
Defines one of four default event sets to measure. Event set 1 is the default. |
-e ev[,ev]* |
Manually defines a list of hardware counters to monitor. (Only for the truly adventurous.) |
Sample hpmcount output |
% hpmcount a.out -s 1 Running pipe()... s= 80200000000.0000000 Running unroll()... s= 51840000160.0000000 Running strength()... s= 84147.0984807570785 Running block()... c(N,N) = 67239936.0000000000 adding counter 5 event 12 Cycles adding counter 0 event 1 Instructions completed adding counter 7 event 0 TLB misses adding counter 2 event 9 Stores completed adding counter 3 event 5 Loads completed adding counter 4 event 5 FPU 0 instructions adding counter 1 event 35 FPU 1 instructions adding counter 6 event 9 FMAs executed hpmcount (V 2.3.1) summary Total execution time (wall clock time): 85.366748 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 75.190000 seconds Total amount of time in system mode : 0.090000 seconds Maximum resident set size : 7648 Kbytes Average shared memory use in text segment : 120432 Kbytes*sec Average unshared memory use in data segment : 26017908 Kbytes*sec Number of page faults without I/O activity : 1912 Number of page faults with I/O activity : 3 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 47 Number of involuntary context switches : 1321 ####### End of Resource Statistics ######## PM_CYC (Cycles) : 27845219686 PM_INST_CMPL (Instructions completed) : 20517718852 PM_TLB_MISS (TLB misses) : 138272209 PM_ST_CMPL (Stores completed) : 1722228911 PM_LD_CMPL (Loads completed) : 6770324612 PM_FPU0_CMPL (FPU 0 instructions) : 1485346931 PM_FPU1_CMPL (FPU 1 instructions) : 200209450 PM_EXEC_FMA (FMAs executed) : 1259225176 Utilization rate : 86.968 % Avg number of loads per TLB miss : 48.964 Load and store operations : 8492.554 M Instructions per load/store : 2.416 MIPS : 240.348 Instructions per cycle : 0.737 HW Float points instructions per Cycle : 0.061 Floating point instructions + FMAs : 2944.782 M Float point instructions + FMA rate : 34.496 Mflip/s FMA percentage : 85.522 % Computation intensity : 0.347 |
perfhpm[taskID].[pid]
which contains a text version of the
performance data
hpm[taskID]_[progName]_[pid].viz
which contains data for
the hpmviz program.
For example, a 4 task run of an executable called "swim" produces"
-rw------- 1 blaise blaise 31840 Jul 18 18:38 hpm0000_swim_53912.viz -rw------- 1 blaise blaise 31832 Jul 18 18:38 hpm0001_swim_53042.viz -rw------- 1 blaise blaise 31840 Jul 18 18:38 hpm0002_swim_51346.viz -rw------- 1 blaise blaise 31840 Jul 18 18:38 hpm0003_swim_56720.viz -rw------- 1 blaise blaise 52662 Jul 18 18:38 perfhpm0000.53912 -rw------- 1 blaise blaise 52633 Jul 18 18:38 perfhpm0001.53042 -rw------- 1 blaise blaise 52684 Jul 18 18:38 perfhpm0002.51346 -rw------- 1 blaise blaise 52673 Jul 18 18:38 perfhpm0003.56720
Example libhpm report |
hpmviz hpm0003_swim_56720.viz
pct - graphical mode.
Example here
pct -c - command line
pct -c -s - command line reading commands from a script file
pvt - starts the GUI
pvt filename(s) - starts GUI with one or more input files.
Example here
pvt -c - starts the command line interface
pvt -c filename(s) - starts command line interface
with one or more input files
IBM Parallel Environment for AIX: Operation and Use Volume 2 Tools Reference. Complete product documentation. |
Overview:
Example Displays:
Performance Toolbox monitor view 1 | |
Performance Toolbox monitor view 2: | |
Performance Toolbox 3D monitor view 1: multiple hosts | |
Performance Toolbox 3D monitor view 2: single host |
Using DPCL:
Documentation:
mpi_trace:
MPIMap:
And More...
This completes the tutorial.
Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise. |
Where would you like to go now?
References and More Information |