Privacy and Legal Notice |
Introduction | Benchmarks | General Test Conditions | Results | Summary | Algorithms
The information presented herein is a series of simple MPI performance measurements on a selection of parallel systems in the Open Computing Facility at Lawrence Livermore National Laboratory. Additional earlier MPI performance measurements (Simple MPI Performance Measurements I and Simple MPI Performance Measurements II) are also available.
Three benchmarks are used to make these MPI performance measurements:
The LBW benchmark attempts to measure the point-to-point message passing latency and bandwidth. The test uses two MPI processes that repeatedly exchange messages. The time to send/receive a small message is a measure of the latency in the message passing system. Exchanging large messages is used to determine the bandwidth of the message passing system. The size of the message is user-specifiable, so the test can be used to profile the message passing bandwidth as a function of message size.
The LBW command line is:
lbw -n num_times [-B|-L] [-b buff_size] [-s sync/async] [-h] [-a]
where
The ATA benchmark performs a sequence of calls to the MPI_Alltoall() library routine. The MPI_Alltoall() routine exchanges a data buffer with all the other MPI processes. This routine can generate considerable message traffic and is meant to model an MPI application that is message passing intense. Both bandwidth and latency test types are supported with a user specifiable buffer size for either.
The ATA command line is:
ata -n num_times [-B|-L] [-b buff_size] [-h] [-a]
where
The STS benchmark emulates the message passing sequences that occur in a number of large scientific and engineering applications that rely on a domain decomposition approach to parallel execution.
Each MPI process sends and receives a set of randomly sized messages. The selection of message passing pairs is made randomly. The total number of source-destination pairs is determined from the product of the total number of processes and the "average" number of messages handled by each process. The latter value is user specifiable via a command-line option. The basic idea is to set up a relatively sparse collection of message passing pairs as compared to the full set of communicating pairs as happens in the ATA test.
Provision is also made to support a double-sided style of message passing where the list of source-destination pairs is doubled in length by including pairs constructed by reversing the source and destination process ranks of the original (single-sided) list. Each such reversed pair is assigned a message length that is randomly selected, i.e., (usually) different from the message size of the original pair.
The STS command line is:
sts -n num_times -m num_msgs [-S|-D] [-s sync|async] [-r rand_seed] [-v] [-a] [-h]
where
MPI benchmarks results for are provided for the following systems:
See the Algorithms section for descriptions of the actual algorithms used in the benchmarks. See the Summary section for a comparison of the MPI performance of all the machines tested.
The MCR system is a tightly coupled cluster for use by the Multiprogrammatic and Institutional Computing (M&IC) community. MCR has 1,152 nodes, each with two 2.4-GHz Pentium 4 Xeon processors and 4 GB of memory. MCR runs the LLNL CHAOS software environment.
Because each node has two processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use both processors on each node.
The test was run in November 2004. MCR test conditions were 2.4.21-p4smp-75chaos GNU/Linux operating system and Intel ifort Fortran compiler for 32-bit applications, Version 8.0.
|
|
|
LBW bandwidth versus buffer size for MCR.
The ASC Linux Cluster (ALC) system provides computing cycles for ASC Alliance users and unclassified ASC code development. The system contains 960 nodes, each with dual 2.4 GHz Xeon (Prestonia) processors and 4 GB of memory. ALC runs the LLNL CHAOS software environment.
Because each node has two processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use both processors on each node.
The test was run in November 2004. ALC test conditions were 924 nodes, 2.4.21-p4smp-75chaos GNU/Linux operating system, and Intel ifort Fortran compiler for 32-bit applications, Version 8.0.
|
|
|
LBW bandwidth versus buffer size for ALC.
The Thunder system provides computing cycles by the Multiprogrammatic and Institutional Computing (M&IC) community. The system is a Linux cluster containing 1024 nodes, each with four 1.4 GHz Itanium Tiger4 and 8 GB of memory. Thunder runs the LLNL CHAOS software environment.
Because each node has four processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use all processors on each node.
The test was run in March 2005. Thunder test conditions were 1024 nodes, 2.4.21-ia64-79.1chaos GNU/Linux operating system, and Intel ifort Fortran compiler for Itanium-based applications, Version 8.1.
|
|
|
LBW bandwidth versus buffer size for Thunder.
While setting up the various benchmark jobs that were used to measure the data for Thunder presented above, it was noticed that there appeared to be some non-trivial dependence of the measured MPI performance results on the loading of nodes with MPI processes. To better assess this apparent dependence, a set of jobs were run that kept the total number of MPI processes constant but varied the placement of the MPI processes across nodes. The data in the following two tables shows the results for the 16 MPI process case and different numbers of processes per node.
|
|
The Frost system is an IBM SP cluster with 68 nodes. Each node consists of 16 IBM Power3 CPUs and 16 GB of shared memory. Frost provides computing resources for the Advanced Simulation and Computing (ASC) program.
Because each node has 16 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections use all processors on each node. For example, the 32 process run of the STS benchmark used two (16 processor) nodes.
The test was run in March 2005. Frost test conditions were 68 nodes, IBM AIX 5.2 operating system, and IBM XLF Fortran compiler, Version 8.01.001.007.
|
|
|
LBW bandwidth versus buffer size for Frost.
While setting up the various benchmark jobs that were used to measure the data for Frost presented above, it was noticed that there appeared to be some non-trivial dependence of the measured MPI performance results on the loading of nodes with MPI processes. To better assess this apparent dependence, a set of jobs were run that kept the total number of MPI processes constant but varied the placement of the MPI processes across nodes. The data in the following two tables shows the results for the 16 MPI process case and different numbers of processes per node.
|
|
The IBM Bluegene/L (BG/L) system has 64 nodes, each with 512 IBM 700 MHz PPC 440 processors having 512 MB DDR memory GB memory. The system architecture is described in the Bluegene/L Web pages and is a major computing resource for the Advanced Simulation and Computing (ASC) Program.
The test was run in July 2005. BGL test conditions were 64 nodes, Linux bgl1 2.6.5-7.155-pseries64-3llnl operating system, IBM XL Fortran Advanced Edition V9.1 for Linux. All runs were made in co-processor mode so that the second processor was not used by the benchmark.
|
|
LBW bandwidth versus buffer size for BG/L.
The BGL STS data are not available becuse many (but not all) of the scaling runs hit some sort of resource limit:
RVZ: cannot allocate unexpected buffer BE_MPI (Info) : IO - Listening thread terminated
This occurred for such relatively low process counts as 16, and the behavior would change merely by modifying the random number seed and thus modifying the message passing buffer sizes for communication among the MPI processes. Also, this failure happened only for the single-sided communication style. Unfortunately, there was insufficient time to delve further into this problem before the BG/L system entered it last integration cycle.
TopThe uP system is an unclassified portion of the full Purple system consisting of 109 nodes, each with eight IBM 1.9-GHz Power5 processors and 32 GB of shared memory. uP and Purple provide computing resources for the Advanced Simulation and Computing (ASC) program.
Because each node has 8 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections were done so as to use all of the processors on each node. For example, the 32-process run of the STS benchmark used 4 (8 processor) nodes.
The test was run in July 2005. uP test conditions were 108 nodes, IBM AIX 5.2.0.0 operating system, IBM XLF Fortran compiler, version 8.01.001.007.
|
|
|
LBW bandwidth versus buffer size for uP.
The CPU interference data for the 8 process runs for both the ATA and STS benchmarks are presented below.
|
|
The Purple system contains 1353 SMP nodes, with each node containing eight Power5 1.9-GHz CPUs and 32 GB of shared memory. Purple provides computing resources for the ASC Program.
Because each node has 8 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections were done so as to use all of the processors on each node. For example, the 32-process run of the STS benchmark used four (8 processor) nodes.
Note that because of necessary limitations in test time, the usual five-fold datasets were reduced to one or two identical runs and not all possible process counts were explored.
The test was run in March 2006. Purple test conditions were 1353 nodes, IBM AIX 5.3 ML4 operating system, IBM XLF Fortran compiler, version 9.01.000.003.
|
|
|
LBW bandwidth versus buffer size for Purple.
The CPU interference data for the 8 process runs for both the ATA and STS benchmarks are presented below.
|
|
The Gauss system—the visualization engine that supports LLNL's BG/L system—contains 256 dual AMD Opteron processor nodes, each with two 2.4-GHz 250 CPUs and 12 GB shared memory, and a Voltaire InfiniBand interconnect.
Because each node has 2 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections were done so as to use all of the processors on each node. For example, the 32-process run of the STS benchmark used 16 (2 processor) nodes.
The test was run in May 2006. Gauss test conditions were 256 nodes, Linux 2.6.9-38 Chaos x86_64 operating system, PathScale EKOPath Compiler Suite Vesion 2.1, POSIX thread model, and GNU gcc version 3.3.1 (PathScale 2.1 Driver).
|
|
|
LBW bandwidth versus buffer size for Gauss.
|
|
|
The basic message passing algorithm for each MPI benchmark test is listed below.
t0 = MPI_Wtime() do i = 1, num_times if (my_rank .eq. SRC_RANK) then call MPI_Send (out_buf, msg_size, ..., DEST_RANK, ...) call MPI_Recv (in_buf, msg_size, ..., DEST_RANK, ...) else call MPI_Recv (in_buf, msg_size, ..., SRC_RANK, ...) call MPI_Send (out_buf, msg_size, ..., SRC_RANK, ...) endif enddo t1 = MPI_Wtime()
t0 = MPI_Wtime() do i = 1, num_times if (my_rank .eq. SRC_RANK) then call MPI_Irecv (in_buf, msg_size, ..., DEST_RANK, ...) call MPI_Send (out_buf, msg_size, ..., DEST_RANK, ...) call MPI_Wait (...) else call MPI_Irecv (in_buf, msg_size, ..., SRC_RANK, ...) call MPI_Send (out_buf, msg_size, ..., SRC_RANK, ...) call MPI_Wait (...) endif enddo t1 = MPI_Wtime()
t1 = MPI_Wtime() do i = 1, num_times / 2 call MPI_Alltoall (inbuf, ...., outbuf, ...) call MPI_Alltoall (outbuf, ..., inbuf, ....) end do t2 = MPI_Wtime()
t1 = MPI_Wtime() do inx = 1, num_times call MPI_Sometosome (num_ifaces, iface_list, iface_cnt ...) enddo t2 = MPI_Wtime()
The iface_list array contains num_ifaces source-destination pairs with the message lengths in the iface_cnt array. The two message passing styles in the MPI_Sometosome() routine are illustrated below.
* Issue irecvs for all of this rank's receives. do inx = 1, num_ifaces src_rank = iface_list(SRC, inx) dest_rank = iface_list(DST, inx) count = iface_cnt(inx) if (my_rank .eq. dest_rank) then call MPI_Irecv (recv_buffer, count, ..., src_rank, ...) endif enddo * Issue sends for all of this rank's sends. do inx = 1, num_ifaces src_rank = iface_list(SRC, inx) dest_rank = iface_list(DST, inx) count = iface_cnt(inx) if (my_rank .eq. src_rank) then call MPI_Send (send_buffer, count, ..., dest_rank, ...) endif enddo * Wait for all incoming messages. call MPI_Waitall (...)
* Issue send/recv pairs for each interface. do inx = 1, num_ifaces src_rank = iface_list(SRC, inx) dest_rank = iface_list(DST, inx) count = iface_cnt(inx) if (my_rank .eq. src_rank) then call MPI_Send (send_buffer, count, ..., dest_rank, ...) endif if (my_rank .eq. dest_rank) then call MPI_Recv (recv_buffer, count, ..., src_rank, ...) endif enddo
Last modified June 5, 2006
UCRL-WEB-218462