In this section we will present an brief overview of how MPI is implemented on the IBM SP followed by details on how the implementation performs at high concurrency alongside information on how to mitigate several common problems encountered when scaling up codes.
In order to demonstrate scaling bottlenecks in point to point switch bandwidth, an experiment similar to what was done above for memory bandwidth can be done. Bandwidth across the switch as a function of message size is measured between two different nodes in the cluster. Each measurement involves two processes, one on each node which act as both sender and receiver (MPI roundtrip bandwidth is reported). The number of such pairs running concurrently is increased from 1 to 16 in order to quantitatively demonstrate the scaling properties of contention and switch bandwidth.
Two such experiments were done; the first on a single adapter css0 and the second on the multi-link adapter csss.
css0 | csss |
---|---|
While for a single pair of tasks the two adapters perform roughly the same, the csss adapter shows better performance (less contention) when the number of concurrently communicating tasks increases. Currently the csss or multi-link adapter does not support striping of individual messages across the physical adapters. For this reason a single stream sees no additional bandwidth by using the multi-link adapter, but multiple streams will see a better aggregate bandwidth.
The csss adapter is used by default on seaborg and most application developers should use it since it is most often the best route to the switch. At earlier software levels there were some constraints on the use of csss which have been removed.
Not all MPI communication goes across the switch. For communications between tasks that reside on the same node it is much more efficient to route MPI message through shared memory buffers. The environment variable "MP_SHARED_MEMORY=YES" is set by default to allow this. An experiment similar to that done above for css0 and csss is presented below where intra and inter-node bandwidths for pairs of MPI tasks are compared.
Intra and Inter Node Messaging | |
---|---|
|
Shared memory messages in the range of 1MB in length (10^5 doubles) show optimal bandwidth.
Likewise it can be seen that the latency of messages is much lower on a node than through the switch. This is to be expected, but it is useful to keep in mind that rapid exchanges of small messages may benefit from being kept on node or possibly aggregated before being sent off node.
That the distribution of latencies is much tighter for shared memory than switch routed MPI makes sense as well. The number of compute resources, and in particular shared compute resources, that are involved in an MPI message sent over the switch is much greater than shared memory messaging on a dedicated node. For instances GPFS, which we'll discuss in the next section, uses the same switch for its data. As a result deviations from optimal latency are expected based on contention for this resource.
The software stack for MPI on the SP is, as of this writing (06/03), as follows:
API | Library | Description |
---|---|---|
MPI | libmpi.a | The MPI API itself, implementing MPI1 and MPI2 (except dynamic processes) |
MPCI | libmpci.a | A point to point API in which MPI library is written. |
PIPES | libmpci.a | Low level message buffer system. |
HAL | libhal | Hardware abstraction layer (IP or US) |
The above library are available in 32 and 64 bit, thread-safe and non-thread-safe as well as internet protocol (IP) and user space (US) versions. Threadsafe compilation (via the "_r" compilers) is required for the 64 bit MPI library. US and IP differ in the paths that messages traverse between MPI processes.
seaborg node 1 | seaborg node 2 | ||
---|---|---|---|
IP | user code -> kernel -> adapter -> | switch fabric | -> adapter -> kernel -> user code |
US | user code -> adapter -> | switch fabric | -> adapter -> user code |
User space communication allows your program to talk directly to the switch adapter avoiding kernel interruption. The user space MPI library is the default as its performance (latency and bandwidth) far exceeds that of the IP version. In the preceding and following examples user space MPI is understood unless specifically noted otherwise.
LAPI is also available on the SP, but will not be treated in this work where we focus on the scaling of MPI codes.
MPI based programs are run via the Parallel Environment (PE) most often using poe. Programs compiled with the PE compilers, e.g., mpxlf, mpcc, mpxlf90 have their main entry point replaced with a new entry point which handles node allocation and process setup on all the nodes involved in the parallel job. A large number of environment variables ("MP_*") and command line options affect the way that PE starts the job. A few relevant to scaling which NERSC sets by default for all users are listed below:
MP_SHARED_MEMORY=YES | Intranode messages through memory: faster, less switch traffic |
MP_EUIDEVICE=csss | Use both switch adapters: more switch bandwidth |
MP_EUILIB=us | User space MPI: faster MPI messaging |
MP_RETRANSMIT_INTERVAL=50000 | How often to retry dropped packets: empirically best setting |
The users program code is then called after all PE startup work is complete. The time required to complete this startup is examined below.
Often the first step toward doing production work at a higher level of parallelism involves trying short running test codes with more tasks. One of the first things that a seaborg user may notice when examining scaling this way is a increased sluggishness of program startup and termination.
Therefore in order to properly gauge the timings from such scaling tests it is important to take into account the scaling job startup time. Below results are shown for the time taken to initiate poe and complete MPI_Init() for a varying number of tasks.
Since this is a one-time cost, its overall impact on user codes is minimal except for very short running jobs. A parallel application based on several short running parallel job steps would be impacted in proportion to its MPI concurrency.
It is however useful to know the time scale for parallel job startup. Especially for applications running at large concurrency it may be important to wait a minute or so before concluding that the job is in trouble. This small time window, while waiting for the first line of output from the program, is often highly scrutinized by users when running at a higher concurrency for the first time. Before deciding the job has failed or is hung make sure to wait for this startup to complete.
For longer running production jobs this issue is less important.
Aside from the fixed amount of memory required to build the PIPEs,
PIPE memory per task = 2*MP_PIPE_SIZE*(# of MPI tasks total - 1)
various MPI functions require internal temporary buffers. The size of these buffers is not documented, but is seen to vary with message buffer size, MPI_TYPE, and between the various MPI functions themselves. E.g., MPI collective and reduction operations tend to require greater internal buffer space than point to point MPI functions.
Characterizing the amount of memory required for a given message size to each MPI function call while potentially useful would probably provide too much information to be easily digested. The need for temporary storage by certain routines can be inferred from the algorithmic specifics of an MPI call.
From the perspective of someone writing MPI applications, the important question here is "How much memory can I use in my code and how much do I have to set aside for MPI?"
In a somewhat reduced approach to answering this question a microkernel code mpimem.c was written which exercises some of the most often used MPI functions (MPI_Barrier, MPI_Bcast, MPI_Alltoall, MPI_Allreduce, MPI_Reduce) and measures by way of system calls (e.g. ps, sbrk, getrusage ) the memory used by the process as a function of problem/message size. Memory used by the process which is not allocated by the user code is considered part of the memory used internally by MPI.
The following plot shows the measurement described above for a 1024 way job:
MPI Memory Usage |
---|
|
For large concurrency jobs, internal memory usage by MPI can become a significant issue leading unexpected to job failures.
IBM's MPI implementation allows throttling down the size of the PIPEs via the PE environment variable MP_PIPE_SIZE. The default setting of 64 KB per PIPE can be decreased to either 32 KB or 16 KB. The memory savings by decreasing MP_PIPE_SIZE is shown in the following graphs:
# tasks | MP_PIPE_SIZE |
---|---|
256 | |
512 | |
1024 |
Since there are potential performance penalties when using smaller MPI buffers it is recommended that the use of MP_PIPE_SIZE=32 or 16 should be viewed as a workaround for codes running out of memory rather than a standard programming practice for large concurrency jobs.
One of the major impediments to scaling up the concurrency of scientific codes is latency in synchronization. This is a well understood aspect of coordinating parallel operation of several resources when each may or may not be ready for work at a given time. As the number of resources increases the frequency and/or duration of interruptions must decrease in order to maintain the same efficiency.
Unfortunately for clusters of independently scheduled PEs the frequency and duration of interruptions are roughly fixed, being intensive properties of the cluster's building blocks. The cumulative time required to schedule and complete collective work across the cluster is extensive and scales with the concurrency of the parallel job. It is largely for this reason that we focus on mitigating the impacts of synchronization at high concurrency.
The graph below shows the time required to complete a synchronizing MPI_Barrier call for a varying number of tasks. Four jobs were run on 8, 32, 64, and 128 nodes respectively. Within each job the timings of MPI_Barrier were measured for MPI communicators of varying size.
MPI_Barrier Scaling |
---|
|
source code comm.c |
The impact of concurrency on synchronization times is dramatic, spanning four orders of magnitude. The IBM SP is a cluster of independently scheduled instances of AIX, for this reason it grows increasingly unlikely that at a given moment all the nodes involved will be ready to complete a synchronizing operation.
Advice on mitigating the impact of high concurrency global synchronization.
Aside from the intrinsic delays ( due to kernel, OS background activity, etc.) in scheduling a large number of tasks, additional delays can be introduced through unequal partition of work which will further increase the delay when a synchronizing section of the code is reached.
To illustrate the impact of load imbalance on parallel execution consider a hypothetical code running on 4 tasks. The code proceeds through iterations composed of synchronizing, doing work and doing I/O. The time to solution is negatively impacted if the time spent doing work shows wide variance. For such an application the impact of load imbalance scales dramatically with concurrency, as it introduces more serial time into an Amdahl like model for parallel efficiency.
Wallclock Impact of Load Balance | |
---|---|
|
|
|
|
|
In practice the MPI portion of the code could be in in either or both of the I/O or Sync parts of the code. While load imbalance can occur in these parts of a code (particularly if the I/O involves disk activity), the most important path to enforcing load balance is in evenly partitioning the problem space across the PEs.
Equally distributing problem domains across a processor topology can be made easier through libraries such as Metis which can produce optimal partitions. Scalapack and other parallel libraries have functionality to decompose problems across PEs. It is generally a good idea to check for load imbalance by, e.g., looking at the distribution of HPM statistics when preparing production runs for a new problem or at a new oncurrency.
One may also choose to adopt a partitioning scheme which may be suboptimal in some sense if it makes load balance easier. E.g., a hyperslab decomposition might be chosen over a block cyclic one if it leads to more manageable and predictable load balance. The tradeoff here is between time lost to the load balance versus time lost to the algorithm itself.
It's difficult to provide a taxonomy which describes the level of synchronization required by MPI library calls. This is due to the multitude of ways they can be called and the fact that the implementation may choose certain synchronization rules at run time (e.g., eager vs. rendevous protocols). However, generally speaking it is possible to categorize certain commonly used MPI calls on their overall level of synchronization based on when tasks are allowed to leave an MPI function call.
By MPI collectives we mean MPI routines which all tasks in a communicator call and either
or
MPI calls which are know to suffer from scaling performance problems are certain MPI collective calls (e.g., MPI_Alltoall and MPI_Allreduce operations). The reasons for this are more complicated than in a simple MPI_Barrier, since not only synchronization, but switch bandwidth and the algorithmic specifics of the MPI calls come into play. Aside from the some degree of synchronization collective MPI calls involve some amount of work in relation to the size of messages being communicated or reduced.
Below the time to complete an MPI_Alltoall and MPI_ALlreduce for varying message sizes is shown:
# tasks | MPI_Alltoall | MPI_Allreduce | Synch Penalty |
---|---|---|---|
256 | |||
512 | |||
1024 | |||
2048 |
MPI_Bcast which is not fully synchronizing does not show as large or as varied timings. Though a broadcast and a all-reduce are quite different, there are cases where not all tasks need the reduced data or do not need at the same time and set of broadcasts can replace a more highly synchronizing collective.
MPI_Bcast |
---|
|
One aspect of the scaling of MPI collectives which is not depicted above is the variability in timings. Highly synchronizing calls of any sort not only show greater average delays at higher concurrency, but also show greater relative variability in the length of delays.
Some of the performance shortcomings of the MPI implementation on the SP arise from (or are exacerbated by) certain generalities in the MPI standard. In the interest of standards compliance certain general cases must be taken into account which lead to worse performance than one might expect from a rough estimate. In general, use the most contiguous representation possible when creating MPI_Types.
If the user code/application in question does not make use of MPI_Types or other higher level MPI constructs, e.g., the message data are all vectors of atomic types (double, int, real*8, etc.) it has been demonstrated that better performing (non standards compliant) variants of MPI collectives may written.
IBM's ACTC provides such an MPI variant, Turbo MP. Likewise users have presented their own hand coded MPI collective routines at various conferences . It is important to realize that this version of the MPI library is neither standards compliant nor a supported IBM software product. In evaluating such approaches the user must evaluate the tradeoffs between standards compliance, portability, and performance. As of this writing, this library requires both MPI and LAPI so in general codes will be limited to 8 tasks per node in order to accomodate 2 switch windows per task.
Crucial to the scaling efficiency of most codes is reducing the frequency and duration of synchronizing MPI collectives. Global reductions are often unavoidable from an algorithmic perspective, but the impact on the code performance may often be mitigated by consolidating the global reduction to the smallest set of exchanges.
More Synchronizing | Less Synchronizing |
---|---|
sump=0.0 do i=1,np sum=zero do j=1,nz jj=j+rank*nz sum=sum+phi(i,j,ip)*psi(jj,n,il2) continue sumr=real(sum) sumi=aimag(sum) call mpi_allreduce(sumr,sumrt,1,mpi_real8, 1 mpi_sum,mpi_comm_world,ierr) sumr=sumrt call mpi_allreduce(sumi,sumit,1,mpi_real8, 1 mpi_sum,mpi_comm_world,ierr) sumi=sumit sump=sump+hzz*(sumr**2+sumi**2) continue |
do i=1,np sum = zero sumpit(i) = zero sumpi(i) = zero do j=1,nz jj=j+rank*nz sum=sum+phi(i,j,ip)*chi(jj,n,il2) continue sumpi(i) = sum continue call MPI_Reduce(sumpi,sumpit,np,mpi_complex16, 1 MPI_SUM,0,mpi_comm_word,ierr) if(rank .eq. 0) then sump=0.0 do i=1,np sump = sump + 1 (real(sumpit(i))**2 + aimag(sumpit(i))**2)*hzz enddo endif call MPI_Bcast(sump,1,mpi_complex16, 1 0,mpi_comm_world,ierr) |
Comparing timings and usage of globally called collectives is much simpler than surveying the space of possible pairwise communications. The pattern of messages in a code using point to point messages will depend closely on the problem being solved and the best we can do here is provide some general information and a few specific examples.
Here we focus on nearest neighbor exchanges which are common to solving PDE's on regular grids. An examination based on dimensionality of the grid (and therefore number of neighbors) along with message size is presented. The information below along with information about the switch should be suffcient to answer most questions about point to point communication on seaborg. If you have further questions feel free to contact NERSC Consultants.
A common communication pattern in scientific codes is halo or N-nearest neighbor communication. While the ordering and number of steps will depend on the calculation at hand, there are certain worst case patterns that applicaiton programmers will most always want to avoid.
The following cartoons and fragments from a real code contain two nearest neighbor exchanges that show very different levels of synchronization. The second approach outperforms the second by 40% on 128 tasks.
if(rank.ne.size-1) then call mpi_send(mesh(1,ny),nx,mpi_real8, rank+1,1,mpi_comm_world,ierr) call mpi_recv(back,nx,mpi_real8, mpi_any_source,2,mpi_comm_world,istat,ierr) end if if(rank.ne.0) then call mpi_recv(front,nx,mpi_real8, mpi_any_source,1,mpi_comm_world,istat,ierr) call mpi_send(mesh(1,1),nx,mpi_real8, rank-1,2,mpi_comm_world,ierr) end if |
|
rankf = rank+1 rankb = rank-1 if(rank.eq.nproc-1) rankf = MPI_PROC_NULL if(rank.eq.0) rankb = MPI_PROC_NULL call mpi_sendrecv(mesh(1,ny),nx,mpi_real8, rankf,1,front,nx,mpi_real8, rankb,1,mpi_comm_world,istat,ierr) call mpi_sendrecv(mesh(1,1),nx,mpi_real8, rankb,2,back,nx,mpi_real8, rankf,2,mpi_comm_world,istat,ierr) |
Choose messaging strategies that wait as long as required (but not longer) to initiate the communucation needed for the computation. The above example includes more delays (though blocking MPI) than are required for each node to exchange boundary data with it's neighbors.
Another common approach to domain decomposed exchanges is to post MPI_Irecvs, MPI_Isends, and then wait for the communication to complete. This asynchronous approach imposes a minimal amount of blocking and exposes opportunity for overlap of communication and computation. In practice on seaborg, the benefit from the former is greater than for the latter.
Below we will compare five strategies that differ in their degree of snychronization. The meshes are set up such that each node has a neighboring rank, neigh[dim][{0,1}], where dim is 0,1,2 (for one, to and three dimensioanl meshes) and {0,1} indiciates the directiona long each dimension.
The message exchange stratgies and a simple mnemonic tag for each are:
tag | MPI Calls | code |
---|---|---|
BSBR | MPI_Send, MPI_Recv | if(rank%2) { MPI_Send(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm); MPI_Recv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, s+0); MPI_Send(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm); MPI_Recv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, s+0); } else { MPI_Recv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, s+0); MPI_Send(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm); MPI_Recv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, s+0); MPI_Send(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm); } |
SERE | MPI_Sendrecv | MPI_Sendrecv(obuf+0*bytes,bytes,MPI_BYTE,neigh[0][0],0, ibuf+0*bytes,bytes,MPI_BYTE,neigh[0][1],0,comm,s+0); MPI_Sendrecv(obuf+1*bytes,bytes,MPI_BYTE,neigh[0][1],0, ibuf+1*bytes,bytes,MPI_BYTE,neigh[0][0],0,comm,s+1); |
ISBR | MPI_Isend, MPI_Recv | MPI_Isend(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm,r+0); MPI_Isend(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm,r+1); MPI_Recv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, s+0); MPI_Recv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, s+1); |
ISIR | MPI_Isend, MPI_Irecv | MPI_Isend(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, r+0); MPI_Isend(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, r+1); MPI_Irecv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, r+2); MPI_Irecv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, r+3); MPI_Waitall(4,r,s); |
IRIS | MPI_Isend, MPI_Irecv | MPI_Irecv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, r+0); MPI_Irecv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, r+1); MPI_Isend(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, r+2); MPI_Isend(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, r+3); MPI_Waitall(4,r,s); |
The ibuf and obuf are non overlapping receive and send buffers. The boundary conditions are periodic.
The time to complete an exchange as a funciton of message size is shown below.
Topology | Low Concurrency | High Concurrency | Asynchronous Speedup |
---|---|---|---|
1D Ring | |||
2D Mesh | |||
3D Mesh |
This experiment shows the benefit from asynchonous point to point becomes significant at large scale concurrencies. Even more so when the topolgy of the problem include many point to point pairs. For seaborg the timings generally follow the trend:
Isend/Irecv < Isend/Recv < Sendrecv < Send/Recv |
Avoiding synchronization in so far as possible can be of great benefit to the scaling performance of a parallel code. The way in which a problem is coded has direct impact on the amount and degree of synchronization that happens between tasks. Some amount of synchronization, either partial or global, is intrinsic to most algorithms, other uneccessary synchronization is often the result of a mismatch between the order in which events are specified (in the code) to occur and the order in which they must occcur algorithmically. The structure of MPI makes is easy to unintentionally introduce extraneous points of synchronization which if removed may benefit the scaling performance of the code.