IBM SP Parallel Scaling Overview - MPI

MPI Scaling

Abstract

In this section we will present an brief overview of how MPI is implemented on the IBM SP followed by details on how the implementation performs at high concurrency alongside information on how to mitigate several common problems encountered when scaling up codes.

Colony Switch

The network fabric which joins the compute nodes on seaborg is the SP Colony switch II . Each NightHawk II node has two switch adapters (css0 and css1) which connect directly to the memory bus of the NightHawk II node. These adapters interface to a three level switch.

In order to demonstrate scaling bottlenecks in point to point switch bandwidth, an experiment similar to what was done above for memory bandwidth can be done. Bandwidth across the switch as a function of message size is measured between two different nodes in the cluster. Each measurement involves two processes, one on each node which act as both sender and receiver (MPI roundtrip bandwidth is reported). The number of such pairs running concurrently is increased from 1 to 16 in order to quantitatively demonstrate the scaling properties of contention and switch bandwidth.

Two such experiments were done; the first on a single adapter css0 and the second on the multi-link adapter csss.

css0 csss

While for a single pair of tasks the two adapters perform roughly the same, the csss adapter shows better performance (less contention) when the number of concurrently communicating tasks increases. Currently the csss or multi-link adapter does not support striping of individual messages across the physical adapters. For this reason a single stream sees no additional bandwidth by using the multi-link adapter, but multiple streams will see a better aggregate bandwidth.

The csss adapter is used by default on seaborg and most application developers should use it since it is most often the best route to the switch. At earlier software levels there were some constraints on the use of csss which have been removed.

Not all MPI communication goes across the switch. For communications between tasks that reside on the same node it is much more efficient to route MPI message through shared memory buffers. The environment variable "MP_SHARED_MEMORY=YES" is set by default to allow this. An experiment similar to that done above for css0 and csss is presented below where intra and inter-node bandwidths for pairs of MPI tasks are compared.

Intra and Inter Node Messaging
source: pingpongn.c , latency samples collected over 33 hour period

Shared memory messages in the range of 1MB in length (10^5 doubles) show optimal bandwidth.

Likewise it can be seen that the latency of messages is much lower on a node than through the switch. This is to be expected, but it is useful to keep in mind that rapid exchanges of small messages may benefit from being kept on node or possibly aggregated before being sent off node.

That the distribution of latencies is much tighter for shared memory than switch routed MPI makes sense as well. The number of compute resources, and in particular shared compute resources, that are involved in an MPI message sent over the switch is much greater than shared memory messaging on a dedicated node. For instances GPFS, which we'll discuss in the next section, uses the same switch for its data. As a result deviations from optimal latency are expected based on contention for this resource.

IBM's MPI implementation

The software stack for MPI on the SP is, as of this writing (06/03), as follows:

API Library Description
MPI libmpi.a The MPI API itself, implementing MPI1 and MPI2 (except dynamic processes)
MPCI libmpci.a A point to point API in which MPI library is written.
PIPES libmpci.a Low level message buffer system.
HAL libhal Hardware abstraction layer (IP or US)

The above library are available in 32 and 64 bit, thread-safe and non-thread-safe as well as internet protocol (IP) and user space (US) versions. Threadsafe compilation (via the "_r" compilers) is required for the 64 bit MPI library. US and IP differ in the paths that messages traverse between MPI processes.

  seaborg node 1   seaborg node 2
IP user code -> kernel -> adapter -> switch fabric -> adapter -> kernel -> user code
US user code -> adapter -> switch fabric -> adapter -> user code

User space communication allows your program to talk directly to the switch adapter avoiding kernel interruption. The user space MPI library is the default as its performance (latency and bandwidth) far exceeds that of the IP version. In the preceding and following examples user space MPI is understood unless specifically noted otherwise.

LAPI is also available on the SP, but will not be treated in this work where we focus on the scaling of MPI codes.

MPI based programs are run via the Parallel Environment (PE) most often using poe. Programs compiled with the PE compilers, e.g., mpxlf, mpcc, mpxlf90 have their main entry point replaced with a new entry point which handles node allocation and process setup on all the nodes involved in the parallel job. A large number of environment variables ("MP_*") and command line options affect the way that PE starts the job. A few relevant to scaling which NERSC sets by default for all users are listed below:

MP_SHARED_MEMORY=YES Intranode messages through memory: faster, less switch traffic
MP_EUIDEVICE=csss Use both switch adapters: more switch bandwidth
MP_EUILIB=us User space MPI: faster MPI messaging
MP_RETRANSMIT_INTERVAL=50000 How often to retry dropped packets: empirically best setting
Other environment variable setting that impact application scaling performance are touched on in the following sections .

The users program code is then called after all PE startup work is complete. The time required to complete this startup is examined below.

Job Startup

Often the first step toward doing production work at a higher level of parallelism involves trying short running test codes with more tasks. One of the first things that a seaborg user may notice when examining scaling this way is a increased sluggishness of program startup and termination.

Therefore in order to properly gauge the timings from such scaling tests it is important to take into account the scaling job startup time. Below results are shown for the time taken to initiate poe and complete MPI_Init() for a varying number of tasks.

Since this is a one-time cost, its overall impact on user codes is minimal except for very short running jobs. A parallel application based on several short running parallel job steps would be impacted in proportion to its MPI concurrency.

It is however useful to know the time scale for parallel job startup. Especially for applications running at large concurrency it may be important to wait a minute or so before concluding that the job is in trouble. This small time window, while waiting for the first line of output from the program, is often highly scrutinized by users when running at a higher concurrency for the first time. Before deciding the job has failed or is hung make sure to wait for this startup to complete.

For longer running production jobs this issue is less important.

MPI Memory Usage

Aside from the fixed amount of memory required to build the PIPEs,

PIPE memory per task = 2*MP_PIPE_SIZE*(# of MPI tasks total - 1)

various MPI functions require internal temporary buffers. The size of these buffers is not documented, but is seen to vary with message buffer size, MPI_TYPE, and between the various MPI functions themselves. E.g., MPI collective and reduction operations tend to require greater internal buffer space than point to point MPI functions.

Characterizing the amount of memory required for a given message size to each MPI function call while potentially useful would probably provide too much information to be easily digested. The need for temporary storage by certain routines can be inferred from the algorithmic specifics of an MPI call.

From the perspective of someone writing MPI applications, the important question here is "How much memory can I use in my code and how much do I have to set aside for MPI?"

In a somewhat reduced approach to answering this question a microkernel code mpimem.c was written which exercises some of the most often used MPI functions (MPI_Barrier, MPI_Bcast, MPI_Alltoall, MPI_Allreduce, MPI_Reduce) and measures by way of system calls (e.g. ps, sbrk, getrusage ) the memory used by the process as a function of problem/message size. Memory used by the process which is not allocated by the user code is considered part of the memory used internally by MPI.

The following plot shows the measurement described above for a 1024 way job:

MPI Memory Usage
source: mpimem.c

For large concurrency jobs, internal memory usage by MPI can become a significant issue leading unexpected to job failures.

IBM's MPI implementation allows throttling down the size of the PIPEs via the PE environment variable MP_PIPE_SIZE. The default setting of 64 KB per PIPE can be decreased to either 32 KB or 16 KB. The memory savings by decreasing MP_PIPE_SIZE is shown in the following graphs:

# tasks MP_PIPE_SIZE
256
512
1024

Since there are potential performance penalties when using smaller MPI buffers it is recommended that the use of MP_PIPE_SIZE=32 or 16 should be viewed as a workaround for codes running out of memory rather than a standard programming practice for large concurrency jobs.

Synchronization

One of the major impediments to scaling up the concurrency of scientific codes is latency in synchronization. This is a well understood aspect of coordinating parallel operation of several resources when each may or may not be ready for work at a given time. As the number of resources increases the frequency and/or duration of interruptions must decrease in order to maintain the same efficiency.

Unfortunately for clusters of independently scheduled PEs the frequency and duration of interruptions are roughly fixed, being intensive properties of the cluster's building blocks. The cumulative time required to schedule and complete collective work across the cluster is extensive and scales with the concurrency of the parallel job. It is largely for this reason that we focus on mitigating the impacts of synchronization at high concurrency.

The graph below shows the time required to complete a synchronizing MPI_Barrier call for a varying number of tasks. Four jobs were run on 8, 32, 64, and 128 nodes respectively. Within each job the timings of MPI_Barrier were measured for MPI communicators of varying size.

MPI_Barrier Scaling
source code comm.c

The impact of concurrency on synchronization times is dramatic, spanning four orders of magnitude. The IBM SP is a cluster of independently scheduled instances of AIX, for this reason it grows increasingly unlikely that at a given moment all the nodes involved will be ready to complete a synchronizing operation.

Advice on mitigating the impact of high concurrency global synchronization.

Load Balancing

Aside from the intrinsic delays ( due to kernel, OS background activity, etc.) in scheduling a large number of tasks, additional delays can be introduced through unequal partition of work which will further increase the delay when a synchronizing section of the code is reached.

To illustrate the impact of load imbalance on parallel execution consider a hypothetical code running on 4 tasks. The code proceeds through iterations composed of synchronizing, doing work and doing I/O. The time to solution is negatively impacted if the time spent doing work shows wide variance. For such an application the impact of load imbalance scales dramatically with concurrency, as it introduces more serial time into an Amdahl like model for parallel efficiency.

Wallclock Impact of Load Balance
Imbalanced FLOP work

Key

Balanced FLOP work

Yellow region shows time lost to load imbalance.

In practice the MPI portion of the code could be in in either or both of the I/O or Sync parts of the code. While load imbalance can occur in these parts of a code (particularly if the I/O involves disk activity), the most important path to enforcing load balance is in evenly partitioning the problem space across the PEs.

Equally distributing problem domains across a processor topology can be made easier through libraries such as Metis which can produce optimal partitions. Scalapack and other parallel libraries have functionality to decompose problems across PEs. It is generally a good idea to check for load imbalance by, e.g., looking at the distribution of HPM statistics when preparing production runs for a new problem or at a new oncurrency.

One may also choose to adopt a partitioning scheme which may be suboptimal in some sense if it makes load balance easier. E.g., a hyperslab decomposition might be chosen over a block cyclic one if it leads to more manageable and predictable load balance. The tradeoff here is between time lost to the load balance versus time lost to the algorithm itself.


MPI Collectives

It's difficult to provide a taxonomy which describes the level of synchronization required by MPI library calls. This is due to the multitude of ways they can be called and the fact that the implementation may choose certain synchronization rules at run time (e.g., eager vs. rendevous protocols). However, generally speaking it is possible to categorize certain commonly used MPI calls on their overall level of synchronization based on when tasks are allowed to leave an MPI function call.

By MPI collectives we mean MPI routines which all tasks in a communicator call and either

or

MPI calls which are know to suffer from scaling performance problems are certain MPI collective calls (e.g., MPI_Alltoall and MPI_Allreduce operations). The reasons for this are more complicated than in a simple MPI_Barrier, since not only synchronization, but switch bandwidth and the algorithmic specifics of the MPI calls come into play. Aside from the some degree of synchronization collective MPI calls involve some amount of work in relation to the size of messages being communicated or reduced.

Below the time to complete an MPI_Alltoall and MPI_ALlreduce for varying message sizes is shown:

# tasks MPI_Alltoall MPI_Allreduce Synch Penalty
256
512
1024
2048

MPI_Bcast which is not fully synchronizing does not show as large or as varied timings. Though a broadcast and a all-reduce are quite different, there are cases where not all tasks need the reduced data or do not need at the same time and set of broadcasts can replace a more highly synchronizing collective.

MPI_Bcast
source: mpimem.c

One aspect of the scaling of MPI collectives which is not depicted above is the variability in timings. Highly synchronizing calls of any sort not only show greater average delays at higher concurrency, but also show greater relative variability in the length of delays.

Some of the performance shortcomings of the MPI implementation on the SP arise from (or are exacerbated by) certain generalities in the MPI standard. In the interest of standards compliance certain general cases must be taken into account which lead to worse performance than one might expect from a rough estimate. In general, use the most contiguous representation possible when creating MPI_Types.

If the user code/application in question does not make use of MPI_Types or other higher level MPI constructs, e.g., the message data are all vectors of atomic types (double, int, real*8, etc.) it has been demonstrated that better performing (non standards compliant) variants of MPI collectives may written.

IBM's ACTC provides such an MPI variant, Turbo MP. Likewise users have presented their own hand coded MPI collective routines at various conferences . It is important to realize that this version of the MPI library is neither standards compliant nor a supported IBM software product. In evaluating such approaches the user must evaluate the tradeoffs between standards compliance, portability, and performance. As of this writing, this library requires both MPI and LAPI so in general codes will be limited to 8 tasks per node in order to accomodate 2 switch windows per task.

Example : Reducing MPI_Allreduce

Crucial to the scaling efficiency of most codes is reducing the frequency and duration of synchronizing MPI collectives. Global reductions are often unavoidable from an algorithmic perspective, but the impact on the code performance may often be mitigated by consolidating the global reduction to the smallest set of exchanges.

More Synchronizing Less Synchronizing
      sump=0.0
      do i=1,np
       sum=zero
       do j=1,nz
        jj=j+rank*nz
        sum=sum+phi(i,j,ip)*psi(jj,n,il2)
       continue
       sumr=real(sum)
       sumi=aimag(sum)
       call mpi_allreduce(sumr,sumrt,1,mpi_real8,
     1 mpi_sum,mpi_comm_world,ierr)
       sumr=sumrt
       call mpi_allreduce(sumi,sumit,1,mpi_real8,
     1 mpi_sum,mpi_comm_world,ierr)
       sumi=sumit
       sump=sump+hzz*(sumr**2+sumi**2)
      continue

       do i=1,np
        sum = zero
        sumpit(i) = zero
        sumpi(i) = zero
        do j=1,nz
         jj=j+rank*nz
         sum=sum+phi(i,j,ip)*chi(jj,n,il2)
        continue
        sumpi(i) = sum
       continue
 
 
 
       call MPI_Reduce(sumpi,sumpit,np,mpi_complex16,
     1            MPI_SUM,0,mpi_comm_word,ierr)
       if(rank .eq. 0) then
        sump=0.0
        do i=1,np
         sump = sump +
     1  (real(sumpit(i))**2 + aimag(sumpit(i))**2)*hzz
        enddo
       endif
       call MPI_Bcast(sump,1,mpi_complex16,
     1  0,mpi_comm_world,ierr)


MPI Point to Point

Comparing timings and usage of globally called collectives is much simpler than surveying the space of possible pairwise communications. The pattern of messages in a code using point to point messages will depend closely on the problem being solved and the best we can do here is provide some general information and a few specific examples.

Here we focus on nearest neighbor exchanges which are common to solving PDE's on regular grids. An examination based on dimensionality of the grid (and therefore number of neighbors) along with message size is presented. The information below along with information about the switch should be suffcient to answer most questions about point to point communication on seaborg. If you have further questions feel free to contact NERSC Consultants.

Halo Communication I - Synchronous

A common communication pattern in scientific codes is halo or N-nearest neighbor communication. While the ordering and number of steps will depend on the calculation at hand, there are certain worst case patterns that applicaiton programmers will most always want to avoid.

The following cartoons and fragments from a real code contain two nearest neighbor exchanges that show very different levels of synchronization. The second approach outperforms the second by 40% on 128 tasks.

 if(rank.ne.size-1) then
  call mpi_send(mesh(1,ny),nx,mpi_real8,
	rank+1,1,mpi_comm_world,ierr)
  call mpi_recv(back,nx,mpi_real8,
	mpi_any_source,2,mpi_comm_world,istat,ierr)
 end if
 if(rank.ne.0) then
  call mpi_recv(front,nx,mpi_real8,
	mpi_any_source,1,mpi_comm_world,istat,ierr)
  call mpi_send(mesh(1,1),nx,mpi_real8,
	rank-1,2,mpi_comm_world,ierr)
 end if
 rankf = rank+1
 rankb = rank-1
 if(rank.eq.nproc-1) rankf = MPI_PROC_NULL
 if(rank.eq.0) rankb = MPI_PROC_NULL
 call mpi_sendrecv(mesh(1,ny),nx,mpi_real8,
	rankf,1,front,nx,mpi_real8,
	rankb,1,mpi_comm_world,istat,ierr)
 call mpi_sendrecv(mesh(1,1),nx,mpi_real8,
	rankb,2,back,nx,mpi_real8,
	rankf,2,mpi_comm_world,istat,ierr)

Choose messaging strategies that wait as long as required (but not longer) to initiate the communucation needed for the computation. The above example includes more delays (though blocking MPI) than are required for each node to exchange boundary data with it's neighbors.


Halo Communication II - Asynchronous

Another common approach to domain decomposed exchanges is to post MPI_Irecvs, MPI_Isends, and then wait for the communication to complete. This asynchronous approach imposes a minimal amount of blocking and exposes opportunity for overlap of communication and computation. In practice on seaborg, the benefit from the former is greater than for the latter.

Below we will compare five strategies that differ in their degree of snychronization. The meshes are set up such that each node has a neighboring rank, neigh[dim][{0,1}], where dim is 0,1,2 (for one, to and three dimensioanl meshes) and {0,1} indiciates the directiona long each dimension.

The message exchange stratgies and a simple mnemonic tag for each are:

tag MPI Calls code
BSBR MPI_Send, MPI_Recv
 
    if(rank%2) {
     MPI_Send(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm);
     MPI_Recv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, s+0);
     MPI_Send(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm);
     MPI_Recv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, s+0);
    } else {
     MPI_Recv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, s+0);
     MPI_Send(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm);
     MPI_Recv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, s+0);
     MPI_Send(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm);
    }
SERE MPI_Sendrecv
 
    MPI_Sendrecv(obuf+0*bytes,bytes,MPI_BYTE,neigh[0][0],0,
                 ibuf+0*bytes,bytes,MPI_BYTE,neigh[0][1],0,comm,s+0);
    MPI_Sendrecv(obuf+1*bytes,bytes,MPI_BYTE,neigh[0][1],0,
                 ibuf+1*bytes,bytes,MPI_BYTE,neigh[0][0],0,comm,s+1);
ISBR MPI_Isend, MPI_Recv
 
   MPI_Isend(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm,r+0);
   MPI_Isend(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm,r+1);
   MPI_Recv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, s+0);
   MPI_Recv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, s+1);
ISIR MPI_Isend, MPI_Irecv
 
    MPI_Isend(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, r+0);
    MPI_Isend(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, r+1);
    MPI_Irecv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, r+2);
    MPI_Irecv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, r+3);
    MPI_Waitall(4,r,s);
IRIS MPI_Isend, MPI_Irecv
 
    MPI_Irecv(ibuf+0*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, r+0);
    MPI_Irecv(ibuf+1*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, r+1);
    MPI_Isend(obuf+0*bytes, bytes, MPI_BYTE, neigh[0][0], 0, comm, r+2);
    MPI_Isend(obuf+1*bytes, bytes, MPI_BYTE, neigh[0][1], 0, comm, r+3);
    MPI_Waitall(4,r,s);

The ibuf and obuf are non overlapping receive and send buffers. The boundary conditions are periodic.

The time to complete an exchange as a funciton of message size is shown below.

Topology Low Concurrency High Concurrency Asynchronous Speedup
1D Ring
2D Mesh
3D Mesh

This experiment shows the benefit from asynchonous point to point becomes significant at large scale concurrencies. Even more so when the topolgy of the problem include many point to point pairs. For seaborg the timings generally follow the trend:

Isend/Irecv < Isend/Recv < Sendrecv < Send/Recv

Avoiding Synchronization

Avoiding synchronization in so far as possible can be of great benefit to the scaling performance of a parallel code. The way in which a problem is coded has direct impact on the amount and degree of synchronization that happens between tasks. Some amount of synchronization, either partial or global, is intrinsic to most algorithms, other uneccessary synchronization is often the result of a mismatch between the order in which events are specified (in the code) to occur and the order in which they must occcur algorithmically. The structure of MPI makes is easy to unintentionally introduce extraneous points of synchronization which if removed may benefit the scaling performance of the code.