Privacy and Legal Notice

MPI Performance Measurements

Introduction | Benchmarks | General Test Conditions | Results | Summary | Algorithms


Introduction

The information presented herein is a series of simple MPI performance measurements on a selection of parallel systems in the Open Computing Facility at Lawrence Livermore National Laboratory. Additional earlier MPI performance measurements (Simple MPI Performance Measurements I and Simple MPI Performance Measurements II) are also available.

Three benchmarks are used to make these MPI performance measurements:


Benchmarks

LBW—Latency and Bandwidth Test

The LBW benchmark attempts to measure the point-to-point message passing latency and bandwidth. The test uses two MPI processes that repeatedly exchange messages. The time to send/receive a small message is a measure of the latency in the message passing system. Exchanging large messages is used to determine the bandwidth of the message passing system. The size of the message is user-specifiable, so the test can be used to profile the message passing bandwidth as a function of message size.

The LBW command line is:

lbw -n num_times [-B|-L] [-b buff_size] [-s sync/async] [-h] [-a]

where

-n num_times
number of times to repeat the send-recv message passing sequence.
-B|-L
select bandwidth (-B) or latency (-L) test; default is the bandwidth test.
-b buff_size
message buffer size in bytes.
-s sync/async
synchronous/asynchronous message passing style; sync is the default.
-h
print usage line.
-a
print out info line for each MPI process.

ATA—All-To-All Test

The ATA benchmark performs a sequence of calls to the MPI_Alltoall() library routine. The MPI_Alltoall() routine exchanges a data buffer with all the other MPI processes. This routine can generate considerable message traffic and is meant to model an MPI application that is message passing intense. Both bandwidth and latency test types are supported with a user specifiable buffer size for either.

The ATA command line is:

ata -n num_times [-B|-L] [-b buff_size] [-h] [-a]

where

-n num_times
number of times to repeat the ATA message passing sequence.
-B|-L
bandwidth (-B) or latency (-L) test selection; default is bandwidth test.
-b buff_size
buffer size in bytes.
-h
print usage line to <stdout>.
-a
print out info line for each MPI process.

STS—Some-To-Some Test

The STS benchmark emulates the message passing sequences that occur in a number of large scientific and engineering applications that rely on a domain decomposition approach to parallel execution.

Each MPI process sends and receives a set of randomly sized messages. The selection of message passing pairs is made randomly. The total number of source-destination pairs is determined from the product of the total number of processes and the "average" number of messages handled by each process. The latter value is user specifiable via a command-line option. The basic idea is to set up a relatively sparse collection of message passing pairs as compared to the full set of communicating pairs as happens in the ATA test.

Provision is also made to support a double-sided style of message passing where the list of source-destination pairs is doubled in length by including pairs constructed by reversing the source and destination process ranks of the original (single-sided) list. Each such reversed pair is assigned a message length that is randomly selected, i.e., (usually) different from the message size of the original pair.

The STS command line is:

sts -n num_times -m num_msgs [-S|-D] [-s sync|async] [-r rand_seed] [-v] [-a] [-h]

where

-n num_times
number of times to repeat the SomeToSome operation.
-m num_msgs
average number of messages sent by each MPI process.
-S|-D
single (-S) or double (-D) sided message passing.
-s sync/async
synchronous/asynchronous message passing style; async is the default.
-r rand_seed
optional random number seed (4 byte integer).
-v
verbose flag for listing of the intermediate interconnection map and buffer sizes.
-h
print usage line.
-a
print out info line for each MPI process.

Top


General Test Conditions

Top


Results

MPI benchmarks results for are provided for the following systems:

See the Algorithms section for descriptions of the actual algorithms used in the benchmarks. See the Summary section for a comparison of the MPI performance of all the machines tested.


MCR

The MCR system is a tightly coupled cluster for use by the Multiprogrammatic and Institutional Computing (M&IC) community. MCR has 1,152 nodes, each with two 2.4-GHz Pentium 4 Xeon processors and 4 GB of memory. MCR runs the LLNL CHAOS software environment.

Because each node has two processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use both processors on each node.

The test was run in November 2004. MCR test conditions were 2.4.21-p4smp-75chaos GNU/Linux operating system and Intel ifort Fortran compiler for 32-bit applications, Version 8.0.

MCR LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Intranode Internode Intranode Internode
291.7 320.7 8.2 7.6
Message Size Bandwidth
(106 bytes/s/process)
Intranode Internode
40 4.82 5.37
400 23.8 31.4
1000 57.6 67.8
10000 376 233
50000 555 298
100000 430 309
500000 292 319
1000000 282 321
2500000 281 321
5000000 281 322
10000000 280 322
15000000 284 322
20000000 287 322
MCR ATA Test
Processes Bandwidth
(106 bytes/s/process)
Latency
(ms)
2 274.7 2.1
4 161.8 22.2
8 87.8 47.5
16 72.5 117
24 67.5 197
32 60.4 310
48 61.4 472
64 54.6 745
96 55.6 1246
128 51.6 1879
MCR STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
2 171.5 3.8 162.9 11.0
4 85.6 10.8 93.3 16.6
8 49.0 17.2 59.9 27.6
16 43.4 19.2 40.2 39.0
32 37.7 21.4 38.1 41.1
64 31.4 23.7 29.5 51.9
128 NA NA NA NA
* num_msgs=6

LBW bandwidth versus buffer size for MCR
LBW bandwidth versus buffer size for MCR.

Top


ALC

The ASC Linux Cluster (ALC) system provides computing cycles for ASC Alliance users and unclassified ASC code development. The system contains 960 nodes, each with dual 2.4 GHz Xeon (Prestonia) processors and 4 GB of memory. ALC runs the LLNL CHAOS software environment.

Because each node has two processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use both processors on each node.

The test was run in November 2004. ALC test conditions were 924 nodes, 2.4.21-p4smp-75chaos GNU/Linux operating system, and Intel ifort Fortran compiler for 32-bit applications, Version 8.0.

ALC LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Intranode Internode Intranode Internode
322.4 324.2 6.8 6.0
Message Size Bandwidth
(106 bytes/s/process)
Intranode Internode
40 5.74 6.62
400 24.5 36.3
1000 60.3 74
10000 398 245
50000 593 305
100000 517 314
500000 330 323
1000000 323 324
2500000 320 325
5000000 320 325
10000000 320 325
15000000 319 325
20000000 310 325
ALC ATA Test
Processes Bandwidth
(106 bytes/s/process)
Latency
(ms)
2 293.0 2.4
4 154.7 19.7
8 85.5 41.3
16 69.4 108
24 63.3 260.3
32 63.3 260
48 60.9 428
64 59.1 605
96 57.1 997
128 53.9 1503
192 53.8 1721
256 51.1 3924
ALC STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
2 193 3.4 184 9.7
4 88.3 10.4 96.8 16.0
8 49.3 17.1 60.1 27.5
16 42.7 19.5 41.6 37.7
32 37.2 21.7 37.5 41.8
64 33.9 21.9 32.2 47.6
128 28.1 26.7 27.4 55.9
256 24 31.6 26.4 57.9
* num_msgs=6

LBW bandwidth versus buffer size for ALC
LBW bandwidth versus buffer size for ALC.

Top


Thunder

The Thunder system provides computing cycles by the Multiprogrammatic and Institutional Computing (M&IC) community. The system is a Linux cluster containing 1024 nodes, each with four 1.4 GHz Itanium Tiger4 and 8 GB of memory. Thunder runs the LLNL CHAOS software environment.

Because each node has four processors, the LBW was run for both intranode and internode cases. Otherwise, the node/processor selections use all processors on each node.

The test was run in March 2005. Thunder test conditions were 1024 nodes, 2.4.21-ia64-79.1chaos GNU/Linux operating system, and Intel ifort Fortran compiler for Itanium-based applications, Version 8.1.

Thunder LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Intranode Internode Intranode Internode
1754 893 2.7 3.3
Message Size Bandwidth
(106 bytes/s/process)
Intranode Internode
40 14.5 12.3
400 107 61.6
1000 244 129
10000 636 594
50000 1078 812
100000 1163 851
500000 1660 888
1000000 1770 891
2500000 847 894
5000000 757 895
10000000 747 896
15000000 743 895
20000000 741 897
Thunder ATA Test
Processes Bandwidth
(106 bytes/s/process)
Latency
(ms)
2 1651 7.9
4 754 46.3
8 210 244
16 115 806
24 98.9 1211
32 95.5 1935
48 89.9 2884
64 87.7 4110
96 85.9 5203
128 84 9732
192 83.5 8702
256 82.9 19694
Thunder STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
2 613 1.1 484 3.7
4 367 2.5 246 6.3
8 150 5.6 161 10.3
16 76.1 10.9 84.2 18.6
32 61.3 13.1 67.7 23.1
64 60 12.4 56.7 27
128 53.6 14 50.5 30.3
256 48.7 15.6 50.4 30.3
* num_msgs=6

LBW bandwidth versus buffer size for Thunder
LBW bandwidth versus buffer size for Thunder.

While setting up the various benchmark jobs that were used to measure the data for Thunder presented above, it was noticed that there appeared to be some non-trivial dependence of the measured MPI performance results on the loading of nodes with MPI processes. To better assess this apparent dependence, a set of jobs were run that kept the total number of MPI processes constant but varied the placement of the MPI processes across nodes. The data in the following two tables shows the results for the 16 MPI process case and different numbers of processes per node.

Thunder ATA
Processes per Node Bandwidth
(106 bytes/s/process)
Latency
(ms)
1 361 176
2 185 365
4 115 722
Thunder STS
Processes per Node Bandwidth
(106 bytes/s/process)
Run Time
(ms)
1 157 5.3
2 113 7.4
4 75.3 11.1

Top


Frost

The Frost system is an IBM SP cluster with 68 nodes. Each node consists of 16 IBM Power3 CPUs and 16 GB of shared memory. Frost provides computing resources for the Advanced Simulation and Computing (ASC) program.

Because each node has 16 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections use all processors on each node. For example, the 32 process run of the STS benchmark used two (16 processor) nodes.

The test was run in March 2005. Frost test conditions were 68 nodes, IBM AIX 5.2 operating system, and IBM XLF Fortran compiler, Version 8.01.001.007.

Frost LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Intranode Internode Intranode Internode
465 358 9.3 22.9
Message Size Bandwidth
(106 bytes/s/process)
Intranode Internode
40 4.3 1.7
400 39 12.7
1000 86.3 25.8
10000 135 68.4
50000 225 200
100000 333 261
500000 467 347
1000000 464 359
2500000 584 343
5000000 497 321
10000000 479 317
15000000 393 318
20000000 425 318
Frost ATA Test
Processes Bandwidth
(106 bytes/s/process)
Latency
(ms)
2 377 23.7
4 229 41.4
8 157 62.2
16 121 89.1
24 68.5 218
32 51.3 280
48 35.1 407
64 40.7 1184
96 38.2 1744
128 37.3 1967
192 36.3 2454
256 35.5 2933
Frost STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
2 193 4.3 186 8.6
4 111 6.4 100 15.4
8 81.2 10.8 72.9 21
16 47.6 17.6 46.3 36.3
32 35.3 20.2 36.7 42.3
64 24.7 30.9 24.2 64
128 21.9 35.5 19.6 80.5
256 18.9 41.5 18.7 83.5
* num_msgs=6

LBW bandwidth versus buffer size for Frost
LBW bandwidth versus buffer size for Frost.

While setting up the various benchmark jobs that were used to measure the data for Frost presented above, it was noticed that there appeared to be some non-trivial dependence of the measured MPI performance results on the loading of nodes with MPI processes. To better assess this apparent dependence, a set of jobs were run that kept the total number of MPI processes constant but varied the placement of the MPI processes across nodes. The data in the following two tables shows the results for the 16 MPI process case and different numbers of processes per node.

Frost ATA
Processes per Node Bandwidth
(106 bytes/s/process)
Latency
(ms)
1 156 151
2 153 151
4 130 149
8 96 150
16 121 88.3
Frost STS
Processes per Node Bandwidth
(106 bytes/s/process)
Run Time
(ms)
1 47.2 17.8
2 47.1 17.8
4 47.1 17.8
8 47.4 17.7
16 55 15.3

Top


BlueGene/L

The IBM Bluegene/L (BG/L) system has 64 nodes, each with 512 IBM 700 MHz PPC 440 processors having 512 MB DDR memory GB memory. The system architecture is described in the Bluegene/L Web pages and is a major computing resource for the Advanced Simulation and Computing (ASC) Program.

The test was run in July 2005. BGL test conditions were 64 nodes, Linux bgl1 2.6.5-7.155-pseries64-3llnl operating system, IBM XL Fortran Advanced Edition V9.1 for Linux. All runs were made in co-processor mode so that the second processor was not used by the benchmark.

BG/L LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
154 3.4
Message Size Bandwidth
(106 bytes/s/process)
40 11.8
400 59.4
1000 88.8
10000 145
50000 150
100000 152
500000 154
1000000 154
2500000 154
5000000 154
10000000 154
15000000 154
20000000 154
BG/L ATA Test
Processes Bandwidth
(106 bytes/s/process)
Latency
(ms)
2 167 38
4 128 43
8 135 51
16 139 84
24 140 89
32 135 139
48 118 197
64 142 186
96 114 364
128 135 480
192 132 717
256 126 980
512 132 2.2
1024 58.5 6.6
2048 29.2 16
4096 28.7 37
8192 30.5 83
16384 29 184

 


LBW bandwidth versus buffer size for BG/L.

The BGL STS data are not available becuse many (but not all) of the scaling runs hit some sort of resource limit:

RVZ: cannot allocate unexpected buffer BE_MPI (Info) : IO - Listening thread terminated

This occurred for such relatively low process counts as 16, and the behavior would change merely by modifying the random number seed and thus modifying the message passing buffer sizes for communication among the MPI processes. Also, this failure happened only for the single-sided communication style. Unfortunately, there was insufficient time to delve further into this problem before the BG/L system entered it last integration cycle.

Top


uP

The uP system is an unclassified portion of the full Purple system consisting of 109 nodes, each with eight IBM 1.9-GHz Power5 processors and 32 GB of shared memory. uP and Purple provide computing resources for the Advanced Simulation and Computing (ASC) program.

Because each node has 8 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections were done so as to use all of the processors on each node. For example, the 32-process run of the STS benchmark used 4 (8 processor) nodes.

The test was run in July 2005. uP test conditions were 108 nodes, IBM AIX 5.2.0.0 operating system, IBM XLF Fortran compiler, version 8.01.001.007.

uP LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Intranode Internode Intranode Internode
4125 1734 2.3 4.8
Message Size Bandwidth
(106 bytes/s/process)
Intranode Internode
40 17.9 8.2
400 168 62
1000 376 130
10000 1806 659
50000 2273 1029
100000 3045 1308
500000 3965 1674
1000000 5463 1733
2500000 5481 1773
5000000 5652 1792
10000000 5685 1800
15000000 4952 1801
20000000 3784 1796
uP ATA Test
Processes Bandwidth
(106 bytes/s/process)
Latency
(ms)
2 2865 4.9
4 2295 8.2
8 1652 13
16 273 33
24 209 51
32 192 56
48 165 73
64 160 82
96 141 136
128 116 177
192 108 269
256 94.6 324
512 69.9 808
uP STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
2 1560 0.5 1552 1.0
4 919 0.9 848 1.8
8 691 1.3 621 2.5
16 200 4.2 229 4.2
32 111 6.4 111 14.0
64 88.9 8.6 85.8 18.0
128 78.1 9.9 74.4 21.2
256 65.9 11.9 65 23.9
512 117 6.7 110 14.2
* num_msgs=6


LBW bandwidth versus buffer size for uP.

The CPU interference data for the 8 process runs for both the ATA and STS benchmarks are presented below.

uP ATA
Processes per Node Bandwidth
(106 bytes/s/process)
Latency
(ms)
1 1062 22.5
2 693 21
4 514 16.7
8 1470 7.8
uP STS
Processes per Node Bandwidth
(106 bytes/s/process)
Run Time
(ms)
1 328 2.7
2 296 3
4 367 2.4
8 689 1.3

Top


Purple

The Purple system contains 1353 SMP nodes, with each node containing eight Power5 1.9-GHz CPUs and 32 GB of shared memory. Purple provides computing resources for the ASC Program.

Because each node has 8 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections were done so as to use all of the processors on each node. For example, the 32-process run of the STS benchmark used four (8 processor) nodes.

Note that because of necessary limitations in test time, the usual five-fold datasets were reduced to one or two identical runs and not all possible process counts were explored.

The test was run in March 2006. Purple test conditions were 1353 nodes, IBM AIX 5.3 ML4 operating system, IBM XLF Fortran compiler, version 9.01.000.003.

Purple LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Intranode Internode Intranode Internode
6381 3063 1.8 4.4
Message Size Bandwidth
(106 bytes/s/process)
Intranode Internode
40 22.9 8.84
400 163 60
1000 373 126
10000 1748 641
50000 2345 1014
100000 3905 1294
500000 6501 2632
1000000 6375 3018
2500000 6397 3465
5000000 6735 3412
10000000 6679 3376
15000000 6887 3368
20000000 6046 3453
Purple ATA Test
Processes Bandwidth
(106 bytes/s/process)
Latency
(ms)
2 3550 4.2
4 2155 5.7
8 1312 7.8
16 269 27
24 204 43
32 192 48
64 153 77
96 132 132
128 133 172
192 111 250
256 112 306
512 86.5 692
1024 80.8 1727
2048 72.2 3775
4096* 43.4 9996
8192* 24.6 12689
The buffer length was reduced to 10000 bytes to fit into the memory on each node.
Purple STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
2 2025 0.4 1857 0.9
4 1290 0.5 1038 1.5
8 942 0.9 791 1.9
16 271 3.1 302 5.6
32 151 4.7 165 9.4
64 130 5.9 125 12.4
128 116 6.7 111 14.2
256 111 7.0 104 14.9
512 165 4.8 146 10.7
1024 94 8.3 87.1 18.1
2048 80.7 9.7 94.3 16.7
4096 89.9 8.7 94.3 17.2
8192 86.1 9.1 89.1 17.7
* num_msgs=6


LBW bandwidth versus buffer size for Purple.

The CPU interference data for the 8 process runs for both the ATA and STS benchmarks are presented below.

Purple ATA
Processes per Node Bandwidth
(106 bytes/s/process)
Latency
(ms)
1 1062 23.3
2 693 21
4 514 16.9
8 1309 7.9
Purple STS
Processes per Node Bandwidth
(106 bytes/s/process)
Run Time
(ms)
1 438 2
2 402 2.2
4 522 1.7
8 951 0.9

Top


Gauss

The Gauss system—the visualization engine that supports LLNL's BG/L system—contains 256 dual AMD Opteron processor nodes, each with two 2.4-GHz 250 CPUs and 12 GB shared memory, and a Voltaire InfiniBand interconnect.

Because each node has 2 processors, the LBW benchmark was run for both intranode and internode cases. Otherwise, the node/processor selections were done so as to use all of the processors on each node. For example, the 32-process run of the STS benchmark used 16 (2 processor) nodes.

The test was run in May 2006. Gauss test conditions were 256 nodes, Linux 2.6.9-38 Chaos x86_64 operating system, PathScale EKOPath Compiler Suite Vesion 2.1, POSIX thread model, and GNU gcc version 3.3.1 (PathScale 2.1 Driver).

Gauss LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Intranode Internode Intranode Internode
1361 947 0.7 4
Message Size Bandwidth
(106 bytes/s/process)
Intranode Internode
40 60.5 10
400 390 71.1
1000 665 139
10000 1100 378
50000 1088 736
100000 1097 833
500000 1253 933
1000000 1290 948
2500000 1305 956
5000000 1318 960
10000000 1318 961
15000000 1321 962
20000000 1324 962
Gauss ATA Test
Processes Bandwidth
(106 bytes/s/process)
Latency
(ms)
2 1604 1.6
4 563 14
8 381 25
16 325 41
24 378 59
48 211 70
32 243 99
64 171 122
96 169 339
128 152 448
192 147 573
256 141 711
Gauss STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
(ms)
2 406 1.3 405 3.5
4 215 1.4 222 5.9
8 113 3.2 120 10.4
16 81.1 3.8 80.1 17.2
32 85.5 6.4 73.1 20.4
64 84.5 7.5 62.6 24.4
128 76.1 9.4 62.3 24.8
256 71.9 10.2 55.7 27.7
* num_msgs=6


LBW bandwidth versus buffer size for Gauss.

Top


Summary

LBW
System Bandwidth Latency
MCR (intra/inter) 292/321 8.2/7.6
ALC (intra/inter) 322/324 6.8/6
Thunder (intra/inter) 1754/893 2.7/3.3
Frost (intra/inter) 465/358 9.3/22.9
BG/L 154 3.4
uP (intra/inter) 4125/1734 2.3/4.8
Purple (intra/inter) 6381/3063 1.8/4.5
Gauss (intra/inter) 1362/947 0.7/4.0
ATA*
System Bandwidth
MCR 162
ALC 155
Thunder 754
Frost 229
BG/L 128
uP 2295
Purple 2155
Gauss 381
* 4 processes
STS*
System Bandwidth Per Pass Run Time
MCR 43.4 19.2
ALC 42.7 19.5
Thunder 42.7 19.5
Frost 47.6 17.6
BG/L NA NA
uP 200 4.2
Purple 271 3.1
Gauss 81.1 3.8
* 16 processes; num_msgs=6

Top


Algorithms

The basic message passing algorithm for each MPI benchmark test is listed below.

LBW

    t0 = MPI_Wtime()
    do i = 1, num_times
       if (my_rank .eq. SRC_RANK) then
          call MPI_Send (out_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Recv (in_buf, msg_size, ..., DEST_RANK, ...)
       else
          call MPI_Recv (in_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., SRC_RANK, ...)
       endif
    enddo
    t1 = MPI_Wtime()
    t0 = MPI_Wtime()
    do i = 1, num_times
       if (my_rank .eq. SRC_RANK) then
          call MPI_Irecv (in_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Wait (...)
       else
          call MPI_Irecv (in_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Wait (...)
       endif
    enddo
    t1 = MPI_Wtime()

ATA

    t1 = MPI_Wtime()
    do i = 1, num_times / 2
       call MPI_Alltoall (inbuf, ...., outbuf, ...)
       call MPI_Alltoall (outbuf, ..., inbuf, ....)
    end do
    t2 = MPI_Wtime()

STS

    t1 = MPI_Wtime()
    do inx = 1, num_times
       call MPI_Sometosome (num_ifaces, iface_list, iface_cnt ...)
    enddo
    t2 = MPI_Wtime()

The iface_list array contains num_ifaces source-destination pairs with the message lengths in the iface_cnt array. The two message passing styles in the MPI_Sometosome() routine are illustrated below.

*   Issue irecvs for all of this rank's receives.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. dest_rank) then
          call MPI_Irecv (recv_buffer, count, ...,
                          src_rank, ...)
       endif
    enddo

*   Issue sends for all of this rank's sends.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. src_rank) then
          call MPI_Send (send_buffer, count, ..., 
                         dest_rank, ...)
       endif
    enddo

*   Wait for all incoming messages.

    call MPI_Waitall (...)
*   Issue send/recv pairs for each interface.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. src_rank) then
          call MPI_Send (send_buffer, count, ..., 
                         dest_rank, ...)
       endif
       if (my_rank .eq. dest_rank) then
          call MPI_Recv (recv_buffer, count, ..., 
                         src_rank, ...)
       endif
    enddo

Top


High Performance Computing at LLNL    Lawrence Livermore National Laboratory

Last modified June 5, 2006
UCRL-WEB-218462