MPI at LLNL: MPI Performance Measurements

MPI Performance Measurements

Introduction

The information presented herein is a series of simple MPI performance measurements on a selection of parallel systems in the Open Computing Facility at Lawrence Livermore National Laboratory. Additional earlier MPI performance measurements (Simple MPI Performance Measurements I and Simple MPI Performance Measurements II) are also available.

Three benchmarks are used to make these MPI performance measurements:

LBW—Point to point bandwidth and latency.
ATA—All-to-all message interchange.
STS—Some-to-some message interchange.

Benchmarks

LBW—Latency and Bandwidth Test

The LBW benchmark attempts to measure the point-to-point message passing latency and bandwidth. The test uses two MPI processes that repeatedly exchange messages. The time to send/receive a small message is a measure of the latency in the message passing system. Exchanging large messages is used to determine the bandwidth of the message passing system. The size of the message is user-specifiable, so the test can be used to profile the message passing bandwidth as a function of message size.

The LBW command line is:

lbw -n num_times [-B|-L] [-b buff_size] [-s sync/async] [-h] [-a]

where

-n num_times: number of times to repeat the send-recv message passing sequence.
-B|-L: select bandwidth (-B) or latency (-L) test; default is the bandwidth test.
-b buff_size: message buffer size in bytes.
-s sync/async: synchronous/asynchronous message passing style; sync is the default.
-h: print usage line.
-a: print out info line for each MPI process.

ATA—All-To-All Test

The ATA benchmark performs a sequence of calls to the MPI_Alltoall() library routine. The MPI_Alltoall() routine exchanges a data buffer with all the other MPI processes. This routine can generate considerable message traffic and is meant to model an MPI application that is message passing intense. Both bandwidth and latency test types are supported with a user specifiable buffer size for either.

The ATA command line is:

ata -n num_times [-B|-L] [-b buff_size] [-h] [-a]

where

-n num_times: number of times to repeat the ATA message passing sequence.
-B|-L: bandwidth (-B) or latency (-L) test selection; default is bandwidth test.
-b buff_size: buffer size in bytes.
-h: print usage line to <stdout>.
-a: print out info line for each MPI process.

STS—Some-To-Some Test

The STS benchmark emulates the message passing sequences that occur in a number of large scientific and engineering applications that rely on a domain decomposition approach to parallel execution.

Each MPI process sends and receives a set of randomly sized messages. The selection of message passing pairs is made randomly. The total number of source-destination pairs is determined from the product of the total number of processes and the "average" number of messages handled by each process. The latter value is user specifiable via a command-line option. The basic idea is to set up a relatively sparse collection of message passing pairs as compared to the full set of communicating pairs as happens in the ATA test.

Provision is also made to support a double-sided style of message passing where the list of source-destination pairs is doubled in length by including pairs constructed by reversing the source and destination process ranks of the original (single-sided) list. Each such reversed pair is assigned a message length that is randomly selected, i.e., (usually) different from the message size of the original pair.

The STS command line is:

sts -n num_times -m num_msgs [-S|-D] [-s sync|async] [-r rand_seed] [-v] [-a] [-h]

where

-n num_times: number of times to repeat the SomeToSome operation.
-m num_msgs: average number of messages sent by each MPI process.
-S|-D: single (-S) or double (-D) sided message passing.
-s sync/async: synchronous/asynchronous message passing style; async is the default.
-r rand_seed: optional random number seed (4 byte integer).
-v: verbose flag for listing of the intermediate interconnection map and buffer sizes.
-h: print usage line.
-a: print out info line for each MPI process.

Top

General Test Conditions

All tests were conducted in the midst of regular production so they may reflect more nearly what an actual user might be able to achieve in normal use.
To eliminate the usual 10 to 20% variation in run times that is common on some systems, each test was run a minimum of three to five times and the shortest run time of the set was chosen as the "best" run time.
Finally, the num_times input option for each test was chosen so that each test would run for at least approximately 5 minutes of wall-clock time. The intent was to minimize the statistical effects of any slow first passes that might occur, e.g, while the processors "heat up" their caches or allocate memory for internal buffers.

LBW—The default buffer sizes were 1000000 bytes for the bandwidth test and 40 bytes for the latency test (unless otherwise noted). Bandwidth is given in 10⁶ bytes/s/process; latency is given in microseconds. Default test type is bandwidth. Default message passing style is synchronous.

ATA— The default buffer sizes were 100000 bytes for the bandwidth test and 40 bytes for the latency test (unless otherwise noted). Bandwidth is given in 10⁶ bytes/s/process; total time and latency is given in milliseconds. Default test type is bandwidth.

STS—Message size was randomly selected in the range of 8 * [1, 2¹⁵-1] bytes. Provision is made for a user specifiable random number seed so that different sequences of message sizes can be used. Otherwise, the message size sequence is always the same and is determined by the default action of the in-board random number generator. Bandwidth is given in 10⁶ bytes/s/process. (Per pass) run time is given in milliseconds. Default message passing style is asynchronous.

Top

Results

MPI benchmarks results for are provided for the following systems:

MCR (Linux/Intel Xeon/QsNet Cluster)
ALC (Linux/Intel Xeon/QsNet Cluster)
Thunder (Linux/Intel Itanium2/QsNet Cluster)
Frost (IBM SP/Power3)
BlueGene/L (IBM PowerPC)
uP (IBM SP/Power5)
Purple (IBM SP/Power5)
Gauss (AMD Opteron/InfiniBand Cluster)

See the Algorithms section for descriptions of the actual algorithms used in the benchmarks. See the Summary section for a comparison of the MPI performance of all the machines tested.

MCR

The MCR system is a tightly coupled cluster for use by the Multiprogrammatic and Institutional Computing (M&IC) community. MCR has 1,152 nodes, each with two 2.4-GHz Pentium 4 Xeon processors and 4 GB of memory. MCR runs the LLNL CHAOS software environment.

Because each node has two processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use both processors on each node.

The test was run in November 2004. MCR test conditions were 2.4.21-p4smp-75chaos GNU/Linux operating system and Intel ifort Fortran compiler for 32-bit applications, Version 8.0.

MCR LBW Test

Bandwidth
(10⁶ bytes/s/process)

Latency
(µs)

Intranode

Internode

Intranode

Internode

291.7

320.7

8.2

7.6

Message Size	Bandwidth (10⁶ bytes/s/process)
Message Size	Intranode	Internode
40	4.82	5.37
400	23.8	31.4
1000	57.6	67.8
10000	376	233
50000	555	298
100000	430	309
500000	292	319
1000000	282	321
2500000	281	321
5000000	281	322
10000000	280	322
15000000	284	322
20000000	287	322

MCR ATA Test
Processes	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
2	274.7	2.1
4	161.8	22.2
8	87.8	47.5
16	72.5	117
24	67.5	197
32	60.4	310
48	61.4	472
64	54.6	745
96	55.6	1246
128	51.6	1879

MCR STS Test*
	Single-Sided		Double-Sided
Processes	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)
2	171.5	3.8	162.9	11.0
4	85.6	10.8	93.3	16.6
8	49.0	17.2	59.9	27.6
16	43.4	19.2	40.2	39.0
32	37.7	21.4	38.1	41.1
64	31.4	23.7	29.5	51.9
128	NA	NA	NA	NA
* num_msgs=6

LBW bandwidth versus buffer size for MCR.

Top

ALC

The ASC Linux Cluster (ALC) system provides computing cycles for ASC Alliance users and unclassified ASC code development. The system contains 960 nodes, each with dual 2.4 GHz Xeon (Prestonia) processors and 4 GB of memory. ALC runs the LLNL CHAOS software environment.

Because each node has two processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use both processors on each node.

The test was run in November 2004. ALC test conditions were 924 nodes, 2.4.21-p4smp-75chaos GNU/Linux operating system, and Intel ifort Fortran compiler for 32-bit applications, Version 8.0.

ALC LBW Test

Bandwidth
(10⁶ bytes/s/process)

Latency
(µs)

Intranode

Internode

Intranode

Internode

322.4

324.2

6.8

6.0

Message Size	Bandwidth (10⁶ bytes/s/process)
Message Size	Intranode	Internode
40	5.74	6.62
400	24.5	36.3
1000	60.3	74
10000	398	245
50000	593	305
100000	517	314
500000	330	323
1000000	323	324
2500000	320	325
5000000	320	325
10000000	320	325
15000000	319	325
20000000	310	325

ALC ATA Test
Processes	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
2	293.0	2.4
4	154.7	19.7
8	85.5	41.3
16	69.4	108
24	63.3	260.3
32	63.3	260
48	60.9	428
64	59.1	605
96	57.1	997
128	53.9	1503
192	53.8	1721
256	51.1	3924

ALC STS Test*
	Single-Sided		Double-Sided
Processes	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)
2	193	3.4	184	9.7
4	88.3	10.4	96.8	16.0
8	49.3	17.1	60.1	27.5
16	42.7	19.5	41.6	37.7
32	37.2	21.7	37.5	41.8
64	33.9	21.9	32.2	47.6
128	28.1	26.7	27.4	55.9
256	24	31.6	26.4	57.9
* num_msgs=6

LBW bandwidth versus buffer size for ALC.

Top

Thunder

The Thunder system provides computing cycles by the Multiprogrammatic and Institutional Computing (M&IC) community. The system is a Linux cluster containing 1024 nodes, each with four 1.4 GHz Itanium Tiger4 and 8 GB of memory. Thunder runs the LLNL CHAOS software environment.

Because each node has four processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use all processors on each node.

The test was run in March 2005. Thunder test conditions were 1024 nodes, 2.4.21-ia64-79.1chaos GNU/Linux operating system, and Intel ifort Fortran compiler for Itanium-based applications, Version 8.1.

Thunder LBW Test

Bandwidth
(10⁶ bytes/s/process)

Latency
(µs)

Intranode

Internode

Intranode

Internode

1754

893

2.7

3.3

Message Size	Bandwidth (10⁶ bytes/s/process)
Message Size	Intranode	Internode
40	14.5	12.3
400	107	61.6
1000	244	129
10000	636	594
50000	1078	812
100000	1163	851
500000	1660	888
1000000	1770	891
2500000	847	894
5000000	757	895
10000000	747	896
15000000	743	895
20000000	741	897

Thunder ATA Test
Processes	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
2	1651	7.9
4	754	46.3
8	210	244
16	115	806
24	98.9	1211
32	95.5	1935
48	89.9	2884
64	87.7	4110
96	85.9	5203
128	84	9732
192	83.5	8702
256	82.9	19694

Thunder STS Test*
	Single-Sided		Double-Sided
Processes	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)
2	613	1.1	484	3.7
4	367	2.5	246	6.3
8	150	5.6	161	10.3
16	76.1	10.9	84.2	18.6
32	61.3	13.1	67.7	23.1
64	60	12.4	56.7	27
128	53.6	14	50.5	30.3
256	48.7	15.6	50.4	30.3
* num_msgs=6

LBW bandwidth versus buffer size for Thunder.

While setting up the various benchmark jobs that were used to measure the data for Thunder presented above, it was noticed that there appeared to be some non-trivial dependence of the measured MPI performance results on the loading of nodes with MPI processes. To better assess this apparent dependence, a set of jobs were run that kept the total number of MPI processes constant but varied the placement of the MPI processes across nodes. The data in the following two tables shows the results for the 16 MPI process case and different numbers of processes per node.

Thunder ATA
Processes per Node	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
1	361	176
2	185	365
4	115	722

Thunder STS
Processes per Node	Bandwidth (10⁶ bytes/s/process)	Run Time (ms)
1	157	5.3
2	113	7.4
4	75.3	11.1

Top

Frost

The Frost system is an IBM SP cluster with 68 nodes. Each node consists of 16 IBM Power3 CPUs and 16 GB of shared memory. Frost provides computing resources for the Advanced Simulation and Computing (ASC) program.

Because each node has 16 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections use all processors on each node. For example, the 32 process run of the STS benchmark used two (16 processor) nodes.

The test was run in March 2005. Frost test conditions were 68 nodes, IBM AIX 5.2 operating system, and IBM XLF Fortran compiler, Version 8.01.001.007.

Frost LBW Test

Bandwidth
(10⁶ bytes/s/process)

Latency
(µs)

Intranode

Internode

Intranode

Internode

465

358

9.3

22.9

Message Size	Bandwidth (10⁶ bytes/s/process)
Message Size	Intranode	Internode
40	4.3	1.7
400	39	12.7
1000	86.3	25.8
10000	135	68.4
50000	225	200
100000	333	261
500000	467	347
1000000	464	359
2500000	584	343
5000000	497	321
10000000	479	317
15000000	393	318
20000000	425	318

Frost ATA Test
Processes	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
2	377	23.7
4	229	41.4
8	157	62.2
16	121	89.1
24	68.5	218
32	51.3	280
48	35.1	407
64	40.7	1184
96	38.2	1744
128	37.3	1967
192	36.3	2454
256	35.5	2933

Frost STS Test*
	Single-Sided		Double-Sided
Processes	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)
2	193	4.3	186	8.6
4	111	6.4	100	15.4
8	81.2	10.8	72.9	21
16	47.6	17.6	46.3	36.3
32	35.3	20.2	36.7	42.3
64	24.7	30.9	24.2	64
128	21.9	35.5	19.6	80.5
256	18.9	41.5	18.7	83.5
* num_msgs=6

LBW bandwidth versus buffer size for Frost.

While setting up the various benchmark jobs that were used to measure the data for Frost presented above, it was noticed that there appeared to be some non-trivial dependence of the measured MPI performance results on the loading of nodes with MPI processes. To better assess this apparent dependence, a set of jobs were run that kept the total number of MPI processes constant but varied the placement of the MPI processes across nodes. The data in the following two tables shows the results for the 16 MPI process case and different numbers of processes per node.

Frost ATA
Processes per Node	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
1	156	151
2	153	151
4	130	149
8	96	150
16	121	88.3

Frost STS
Processes per Node	Bandwidth (10⁶ bytes/s/process)	Run Time (ms)
1	47.2	17.8
2	47.1	17.8
4	47.1	17.8
8	47.4	17.7
16	55	15.3

Top

BlueGene/L

The IBM Bluegene/L (BG/L) system has 64 nodes, each with 512 IBM 700 MHz PPC 440 processors having 512 MB DDR memory GB memory. The system architecture is described in the Bluegene/L Web pages and is a major computing resource for the Advanced Simulation and Computing (ASC) Program.

The test was run in July 2005. BGL test conditions were 64 nodes, Linux bgl1 2.6.5-7.155-pseries64-3llnl operating system, IBM XL Fortran Advanced Edition V9.1 for Linux. All runs were made in co-processor mode so that the second processor was not used by the benchmark.

BG/L LBW Test

Bandwidth
(10⁶ bytes/s/process)

Latency
(µs)

154

3.4

Message Size	Bandwidth (10⁶ bytes/s/process)
40	11.8
400	59.4
1000	88.8
10000	145
50000	150
100000	152
500000	154
1000000	154
2500000	154
5000000	154
10000000	154
15000000	154
20000000	154

BG/L ATA Test
Processes	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
2	167	38
4	128	43
8	135	51
16	139	84
24	140	89
32	135	139
48	118	197
64	142	186
96	114	364
128	135	480
192	132	717
256	126	980
512	132	2.2
1024	58.5	6.6
2048	29.2	16
4096	28.7	37
8192	30.5	83
16384	29	184

LBW bandwidth versus buffer size for BG/L.

The BGL STS data are not available becuse many (but not all) of the scaling runs hit some sort of resource limit:

RVZ: cannot allocate unexpected buffer BE_MPI (Info) : IO - Listening thread terminated

This occurred for such relatively low process counts as 16, and the behavior would change merely by modifying the random number seed and thus modifying the message passing buffer sizes for communication among the MPI processes. Also, this failure happened only for the single-sided communication style. Unfortunately, there was insufficient time to delve further into this problem before the BG/L system entered it last integration cycle.

Top

uP

The uP system is an unclassified portion of the full Purple system consisting of 109 nodes, each with eight IBM 1.9-GHz Power5 processors and 32 GB of shared memory. uP and Purple provide computing resources for the Advanced Simulation and Computing (ASC) program.

Because each node has 8 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections were done so as to use all of the processors on each node. For example, the 32-process run of the STS benchmark used 4 (8 processor) nodes.

The test was run in July 2005. uP test conditions were 108 nodes, IBM AIX 5.2.0.0 operating system, IBM XLF Fortran compiler, version 8.01.001.007.

uP LBW Test

Bandwidth
(10⁶ bytes/s/process)

Latency
(µs)

Intranode

Internode

Intranode

Internode

4125

1734

2.3

4.8

Message Size	Bandwidth (10⁶ bytes/s/process)
Message Size	Intranode	Internode
40	17.9	8.2
400	168	62
1000	376	130
10000	1806	659
50000	2273	1029
100000	3045	1308
500000	3965	1674
1000000	5463	1733
2500000	5481	1773
5000000	5652	1792
10000000	5685	1800
15000000	4952	1801
20000000	3784	1796

uP ATA Test
Processes	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
2	2865	4.9
4	2295	8.2
8	1652	13
16	273	33
24	209	51
32	192	56
48	165	73
64	160	82
96	141	136
128	116	177
192	108	269
256	94.6	324
512	69.9	808

uP STS Test*
	Single-Sided		Double-Sided
Processes	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)
2	1560	0.5	1552	1.0
4	919	0.9	848	1.8
8	691	1.3	621	2.5
16	200	4.2	229	4.2
32	111	6.4	111	14.0
64	88.9	8.6	85.8	18.0
128	78.1	9.9	74.4	21.2
256	65.9	11.9	65	23.9
512	117	6.7	110	14.2
* num_msgs=6

LBW bandwidth versus buffer size for uP.

The CPU interference data for the 8 process runs for both the ATA and STS benchmarks are presented below.

uP ATA
Processes per Node	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
1	1062	22.5
2	693	21
4	514	16.7
8	1470	7.8

uP STS
Processes per Node	Bandwidth (10⁶ bytes/s/process)	Run Time (ms)
1	328	2.7
2	296	3
4	367	2.4
8	689	1.3

Top

Purple

The Purple system contains 1353 SMP nodes, with each node containing eight Power5 1.9-GHz CPUs and 32 GB of shared memory. Purple provides computing resources for the ASC Program.

Note that because of necessary limitations in test time, the usual five-fold datasets were reduced to one or two identical runs and not all possible process counts were explored.

The test was run in March 2006. Purple test conditions were 1353 nodes, IBM AIX 5.3 ML4 operating system, IBM XLF Fortran compiler, version 9.01.000.003.

Purple LBW Test

Bandwidth
(10⁶ bytes/s/process)

Latency
(µs)

Intranode

Internode

Intranode

Internode

6381

3063

1.8

4.4

Message Size	Bandwidth (10⁶ bytes/s/process)
Message Size	Intranode	Internode
40	22.9	8.84
400	163	60
1000	373	126
10000	1748	641
50000	2345	1014
100000	3905	1294
500000	6501	2632
1000000	6375	3018
2500000	6397	3465
5000000	6735	3412
10000000	6679	3376
15000000	6887	3368
20000000	6046	3453

Purple ATA Test
Processes	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
2	3550	4.2
4	2155	5.7
8	1312	7.8
16	269	27
24	204	43
32	192	48
64	153	77
96	132	132
128	133	172
192	111	250
256	112	306
512	86.5	692
1024	80.8	1727
2048	72.2	3775
4096*	43.4	9996
8192*	24.6	12689
The buffer length was reduced to 10000 bytes to fit into the memory on each node.

Purple STS Test*
	Single-Sided		Double-Sided
Processes	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)
2	2025	0.4	1857	0.9
4	1290	0.5	1038	1.5
8	942	0.9	791	1.9
16	271	3.1	302	5.6
32	151	4.7	165	9.4
64	130	5.9	125	12.4
128	116	6.7	111	14.2
256	111	7.0	104	14.9
512	165	4.8	146	10.7
1024	94	8.3	87.1	18.1
2048	80.7	9.7	94.3	16.7
4096	89.9	8.7	94.3	17.2
8192	86.1	9.1	89.1	17.7
* num_msgs=6

LBW bandwidth versus buffer size for Purple.

The CPU interference data for the 8 process runs for both the ATA and STS benchmarks are presented below.

Purple ATA
Processes per Node	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
1	1062	23.3
2	693	21
4	514	16.9
8	1309	7.9

Purple STS
Processes per Node	Bandwidth (10⁶ bytes/s/process)	Run Time (ms)
1	438	2
2	402	2.2
4	522	1.7
8	951	0.9

Top

Gauss

The Gauss system—the visualization engine that supports LLNL's BG/L system—contains 256 dual AMD Opteron processor nodes, each with two 2.4-GHz 250 CPUs and 12 GB shared memory, and a Voltaire InfiniBand interconnect.

Because each node has 2 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections were done so as to use all of the processors on each node. For example, the 32-process run of the STS benchmark used 16 (2 processor) nodes.

The test was run in May 2006. Gauss test conditions were 256 nodes, Linux 2.6.9-38 Chaos x86_64 operating system, PathScale EKOPath Compiler Suite Vesion 2.1, POSIX thread model, and GNU gcc version 3.3.1 (PathScale 2.1 Driver).

Gauss LBW Test

Bandwidth
(10⁶ bytes/s/process)

Latency
(µs)

Intranode

Internode

Intranode

Internode

1361

947

0.7

Message Size	Bandwidth (10⁶ bytes/s/process)
Message Size	Intranode	Internode
40	60.5	10
400	390	71.1
1000	665	139
10000	1100	378
50000	1088	736
100000	1097	833
500000	1253	933
1000000	1290	948
2500000	1305	956
5000000	1318	960
10000000	1318	961
15000000	1321	962
20000000	1324	962

Gauss ATA Test
Processes	Bandwidth (10⁶ bytes/s/process)	Latency (ms)
2	1604	1.6
4	563	14
8	381	25
16	325	41
24	378	59
48	211	70
32	243	99
64	171	122
96	169	339
128	152	448
192	147	573
256	141	711

Gauss STS Test*
	Single-Sided		Double-Sided
Processes	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)	Bandwidth (10⁶ bytes/s/process)	Per Pass Run Time (ms)
2	406	1.3	405	3.5
4	215	1.4	222	5.9
8	113	3.2	120	10.4
16	81.1	3.8	80.1	17.2
32	85.5	6.4	73.1	20.4
64	84.5	7.5	62.6	24.4
128	76.1	9.4	62.3	24.8
256	71.9	10.2	55.7	27.7
* num_msgs=6

LBW bandwidth versus buffer size for Gauss.

Top

Summary

LBW
System	Bandwidth	Latency
MCR (intra/inter)	292/321	8.2/7.6
ALC (intra/inter)	322/324	6.8/6
Thunder (intra/inter)	1754/893	2.7/3.3
Frost (intra/inter)	465/358	9.3/22.9
BG/L	154	3.4
uP (intra/inter)	4125/1734	2.3/4.8
Purple (intra/inter)	6381/3063	1.8/4.5
Gauss (intra/inter)	1362/947	0.7/4.0

ATA*
System	Bandwidth
MCR	162
ALC	155
Thunder	754
Frost	229
BG/L	128
uP	2295
Purple	2155
Gauss	381
* 4 processes

STS*
System	Bandwidth	Per Pass Run Time
MCR	43.4	19.2
ALC	42.7	19.5
Thunder	42.7	19.5
Frost	47.6	17.6
BG/L	NA	NA
uP	200	4.2
Purple	271	3.1
Gauss	81.1	3.8
* 16 processes; num_msgs=6

Top

Algorithms

The basic message passing algorithm for each MPI benchmark test is listed below.

LBW

Synchronous

    t0 = MPI_Wtime()
    do i = 1, num_times
       if (my_rank .eq. SRC_RANK) then
          call MPI_Send (out_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Recv (in_buf, msg_size, ..., DEST_RANK, ...)
       else
          call MPI_Recv (in_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., SRC_RANK, ...)
       endif
    enddo
    t1 = MPI_Wtime()

Asynchronous

    t0 = MPI_Wtime()
    do i = 1, num_times
       if (my_rank .eq. SRC_RANK) then
          call MPI_Irecv (in_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Wait (...)
       else
          call MPI_Irecv (in_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Wait (...)
       endif
    enddo
    t1 = MPI_Wtime()

ATA

    t1 = MPI_Wtime()
    do i = 1, num_times / 2
       call MPI_Alltoall (inbuf, ...., outbuf, ...)
       call MPI_Alltoall (outbuf, ..., inbuf, ....)
    end do
    t2 = MPI_Wtime()

STS

    t1 = MPI_Wtime()
    do inx = 1, num_times
       call MPI_Sometosome (num_ifaces, iface_list, iface_cnt ...)
    enddo
    t2 = MPI_Wtime()

The iface_list array contains num_ifaces source-destination pairs with the message lengths in the iface_cnt array. The two message passing styles in the MPI_Sometosome() routine are illustrated below.

Asynchronous

*   Issue irecvs for all of this rank's receives.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. dest_rank) then
          call MPI_Irecv (recv_buffer, count, ...,
                          src_rank, ...)
       endif
    enddo

*   Issue sends for all of this rank's sends.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. src_rank) then
          call MPI_Send (send_buffer, count, ..., 
                         dest_rank, ...)
       endif
    enddo

*   Wait for all incoming messages.

    call MPI_Waitall (...)

Synchronous

*   Issue send/recv pairs for each interface.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. src_rank) then
          call MPI_Send (send_buffer, count, ..., 
                         dest_rank, ...)
       endif
       if (my_rank .eq. dest_rank) then
          call MPI_Recv (recv_buffer, count, ..., 
                         src_rank, ...)
       endif
    enddo

Top

Last modified June 5, 2006
UCRL-WEB-218462