The ASCI sPPM Benchmark Code

Privacy & Legal Notice


Contents


Code Description

A. General Description

The sPPM benchmark solves a 3D gas dynamics problem on a uniform Cartesian mesh using a simplified version of the PPM (Piecewise Parabolic Method) code; hence the "s" for simplified. The code is written to simultaneously exploit explicit threads for multiprocessing shared memory parallelism and domain decomposition with message passing for distributed parallelism. It represents the current state of ongoing research which, has demonstrated good processor performance, excellent multithreaded efficiency, and excellent message passing parallel speedups all at the same time.

The hydrodynamics algorithm involves a split scheme of X, Y, and Z Lagrangian and remap steps that are computed as three separate passes or sweeps through the mesh per timestep, each time sweeping in the appropriate direction with the appropriate operator. Each such sweep through the mesh requires approximately 680 FLOPs to update all of the state variables for each real mesh cell. Message passing is used to update ghost cells with data from neighboring domains three times per timestep and occurs just before each of the X, Y, and Z sweeps. Multiple threads are used to manipulate data and update pencils of cells in parallel.

The sPPM code is of interest as an efficient SMP cluster implementation of numerical methods relevant to DOE's Stockpile Stewardship Program. One goal of the ASCI Purple strategy is to demonstrate a sustained 35 TeraFLOP/s execution rate for the sPPM code.

B. Implementation

The sPPM code is written in Fortran 77 with some C routines. All of the Fortran source files must be preprocessed first by the m4 macro preprocessor and then again by the cpp preprocessor (see the Makefile). The I/O related routines are all in C and have been grouped together in the file c_io.c. It is possible that there may be some system dependencies in this part of the code. For direct use of POSIX threads, the multithreaded support routines are all in C and can be found in the cpthreads_sppm_XXX.c files as described below. The timer routine is in C (timers.c) and may require changes for specific systems; it is the same routine that is used by several other ASCI benchmarks.

Multithreaded support is written in POSIX threads or in SGI "sproc" flavors. The Fortran interface accommodates both POSIX and SGI environments with a single calling syntax and was tested on Compaq, SGI, IBM, IA64 Linux, and SUN Solaris systems. Compare cpthreads_sppm_SGI.c and cpthreads_sppm_DEC.c for the details of implementation on each system. The cpthreads_sppm_DEC.c and cpthreads_sppm_POSIX.c files are identical. In all cases, additional slave threads are created that, along with the parent process, communicate through shared static memory that has been declared in Fortran with either SAVE or COMMON (e.g., indx, ddd, ipencil,...).

The POSIX flavor creates additional threads (2,...,nthreads). The SGI version spawns extra copies of the parent process via the "sproc" routine, which is similar but not identical to the standard Unix "fork" call. It differs in that "sproc" causes all static (i.e., not stack based) storage to be implicitly shared between processes on a single SMP node. These slave threads loop endlessly in routine control (see runhyd3.m4). Synchronization is accomplished by testing shared variables and in some cases by use of Unix System V semaphore calls (see cpthreads_sppm_XXX.c). Parallel loops are self-scheduled using a shared integer index via the PLOOP macro in cliches.h. On the SGI, a fast, vendor-supplied implementation of an atomic fetch-and-add (test_and_add) was used. On the DEC system, a POSIX mutex implementation is used.

As an alternative, the basic method used for POSIX threads is now also available in the form of OpenMP directives. See the Makefile for the flags to use in order to use OpenMP rather than explicit threads library calls.

All message passing is done using standard MPI calls. Only a small set of MPI routines are actually required:

     MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize, MPI_Barrier,

     MPI_IRecv, MPI_Isend, MPI_Wait, MPI_Allreduce, MPI_Cancel

All threads participate to assemble and disassemble message arrays, but only thread 1 ever calls the MPI routines. Although non-blocking operations are used, the communications do not overlap the main X, Y, and Z sweep computations. Complete ghost cell updates occur three times per timestep (once before each of the X, Y, and Z sweeps). Once per double timestep a global reduction is used to compute the Courant number.

The solution progresses in double timesteps. Each double timestep involves the following 13 steps (see runhyd and control in runhyd3.m4):

  1. X bdrys
  2. X sweep
  3. Y bdrys
  4. Y sweep
  5. Z bdrys
  6. Z sweep
  7. Z bdrys
  8. Z sweep
  9. Y bdrys
  10. Y sweep
  11. X bdrys
  12. X sweep
  13. Courant

Ghost cells are updated for all six faces of each subdomain during the "bdrys" steps. Real boundary conditions are imposed where there isn't a neighbor subdomain. The giant m4 BDRYS macro in bdrys.m4 creates three nearly identical routines for the X, Y, and Z ghost cell updates. All threads help assemble and disassemble message arrays, but only thread 1 calls the MPI routines. During each "sweep" step, all cells are updated using the appropriate directionally split sPPM algorithm. These updates use a scheme of pencils and 1D strip updates. The entire sPPM physics is confined to a single, direction-independent 1D strip update routine sppm (see sppm.m4). A final global reduction is needed to determine the size of the next timestep based on the Courant condition.

The giant m4 CALCHYD macro in runhyd3.m4 creates three nearly identical routines for the X, Y, and Z sweeps that work as follows. The 3D mesh is logically mapped into pencils of 8x8xN cells with the long length, N, aligned along the sweep direction and the transverse directions tiled into 8x8 chunks. These logical pencils are then processed in a natural 2D order: Y then Z for X sweeps, X then Z for Y sweeps, and X then Y for Z sweeps. Each pencil is updated as a single unit with 64 calls to the sppm routine that updates a single 1D strip of N cells. The sppm "stencil" requires cells +/- nbdy=5 cells away in the principle direction and +/- nbdy2=2 cells away in the transverse directions.

In principle, a pencil update involves three steps. First, buffer arrays (rrho, pp, uux, uuy, and uuz) are loaded with "old" values from the full mesh data array (ddd). This data copy serves to separate the data that is stored interleaved in the ddd array for cache line efficiency and to transpose the data from the natural XYZ-order to the appropriate sweep order. Second, the sppm routine is called to perform a 1D strip update on each of the lines in the pencil. The "new" values returned by sppm (ronu, prnu, uxnu, uynu, and uznu) are stored into an interleaved temporary array (dddnu). Lastly, when the entire pencil is completed, the "new" values (dddnu) are copied and transposed back into the main mesh data array (ddd).

The actual multithreaded pencil implementation in sPPM isn't so simple. It uses a circular buffering of columns of pencils so no data ever needs to be read more than once from the ddd array and to reduce memory size. And it uses a read-ahead strategy of two columns to insure that each thread will have all necessary data available when it is time to update a given pencil. Basically, the load step (the first step in previous paragraph) is done two columns ahead of the other two steps, and columns of the buffer arrays are reused as soon as the "old" data they contain will never be needed again. A single, two-column extended pencil loop doles out work to each thread. The important parameters that control this implementation are in buffers.h. The two column default minimizes memory and works well as long as the number of threads is less than the number of pencils in a single column.

In the example shown below, we assume a two column read-ahead and a circular buffer containing four columns of pencils. A thread was just assigned to update pencil 14. The first step is to check that the data for pencil 14 and all of its neighbors are available; pencil 14 will require "old" data values from pencils 9,10,11,13,14,15,17,18, and 19 because of the nbdy2 reach of the sppm "stencil." A check of the ipencil flags for pencils 1,2,3,5,6,7,9,10, and 11 guarantees we have all the data we need to proceed because those pencils pre-loaded the pencils we want. The thread then pre-loads the "old" ddd data for pencil 22 (including 2 ghost strips) into the circular buffers overwriting some of the data for pencils 6 and 10. We know that pencils 9,10, and 11 have already been updated, so this means that we only need to preserve two strips of pencil 10. Finally, the thread updates all 64 1D strips in pencil 14 one at a time and stores "new" values into the main ddd array. This logic runs simultaneously across multiple threads, each assigned to update a different pencil.

                  Multi-threaded Pencil Updates
                (sweep direction is into the page)


      /        /        /        /        /        /        /

     /________/________/________/________/________/________/

     |        |        |        |        |        |        |

     |    4   |    8   |   12   |   16   |   20   |   24   |

     |        |        |        |        |        |        |

     |        |        |        |        |        |        |

     |________|________|________|________|________|________|

     |        |        |        |        |        |        |

     |    3   |    7   |   11   |   15   |   19   |   23   |

     |        |        |        |        |        |        |

     | check  |  check |  check |        |        |        |

     |________|________|________|________|________|________|

     |        |        |        |        |        |        |

     |    2   |    6   |   10   |   14   |   18   |   22   |

     |        |        |        |        |        |        |

     | check  |  check |  check | update |        |pre-load|

     |________|________|________|________|________|________|

     |        |        |        |        |        |        |

     |    1   |    5   |   9    |   13   |   17   |   21   |

     |        |        |        |        |        |        | /

     | check  |  check |  check |        |        |        |/

     |________|________|________|________|________|________/



                          Buffer Usage
                  (for the 2,6,10,14,18,22 row)


                   | | |        |        |        |

      before       |2|2|   8    |    8   |    8   |

                   | | |        |        |        |



                              | |        |        |        | |

      after                   |2|   8    |    8   |    8   |2|

                              | |        |        |        | |

C. Problem Description

The sPPM problem involves a shock propagating through a gas with a density discontinuity. The coordinates are -1:1 in x and y, and 0:zmax in z, where zmax depends on the overall aspect ratio prescribed. A plane shock traveling up the +z axis encounters a density discontinuity, at which the gas becomes denser. The shock is carefully designed to be simple, though strong (about Mach 5). The gas initially has density 0.1 ahead of the shock; over 5dz at the discontinuity, it changes to 1.0.

The density discontinuity surface is initially perturbed in +z via 3 sinusoidal and 2 image components:

      cos ( 3pi/2 * sqrt(x^2+y^2) )

      cos ( 3pi/2 * x*10 )

      cos ( 3pi/2 * y*12 )

      ppmimage (optional)

      teximage (optional)

with a scale factor applied to each. See the routine setup in file main.m4. The two image perturbations are optional (they depend on the presence of byte image files named "ppmimage" and "teximage") and are not required for the problem.

For a qualitative idea of what is being computed, here are some volume renderings made of a 128-cubed simulation.

 


Ln(density), time = 0.0
This initial state has been perturbed by "ppmimage128" and by a random noise texture image. The color mapping puts Ln(Density)=0 at the center of the scale, which is true of the other two Ln(Density) images as well.


Ln(Density), time = 0.5
Here a cutting plane has cut away part of the departing shock wave and exposes the center of the cube.


Ln(Density), time = 0.9


Vorticity magnitude, time = 0.6, center 1/3 of the cube


Density Laplacian, time = 0.6
This is an image of del^2(Density). The grayscale is centered, with zero at midscale.


Building Issues

A. Thread Issues

An attempt has been made to minimize the places where the thread interface is exposed. There is also an m4 switch THREADED to generate a version of sPPM that does not spawn any threads and just uses MPI. Use of this non-threaded version of sPPM is discouraged for SMPs, because that would not be as efficient an implementation. It may, however, be useful in porting it initially.

If OpenMP is not being used, there are three basic places to look in the source files for the SMP thread control:

  1. The thread initialization and spawning. These are located in main.m4 and in the routine spawn which is in runhyd3.m4. The THREADED switch will guide you to these locations. The non-threaded version simply comments these areas out.
  2. Actual thread control is done with four macros in the cliches.h file. PLOOP and PLOOPEND generate code for a fetch-and-add based do-across loop. These revert to a simple Fortran "do while" loop in the non-threaded version. In addition, there are macros for two barrier routines. These are no-ops in the non-threaded version. The IWAIT macro is a spin wait for the ipencil=1 test. These five macros should be replaced by the appropriate equivalent for the new platform.
  3. The code for the routines referenced in the cliches.h files is in the file cpthreads_sppm_XXX.c (XXX is your system). This should be replaced with the appropriate routines, if any, that your macros need.
If OpenMP is in use (i.e., the OPENMP switch is set) then the explicit threads code is removed by the preprocessor and the compiler is expected to process the OpenMP directives and OpenMP conditional statements. There is a single parallel region in main.m4, and then the code itself divides the work among the threads rather than using the OpenMP work-sharing constructs. This is a somewhat unusual use of OpenMP, but is a result of having done the POSIX threads version first. Note that setting OPENMP also requires you to turn on whatever Fortran compiler switches are needed to activate OpenMP compilation.

B. Problem Size Determination

The problem size must be determined before sPPM is built as the code uses static data arrays and assumes a static MPI domain decomposition. You must determine the size of the problem subdomain on each node manually and edit the entries in the file iq.h as described below. The information in iq.h, along with the node layout in the input file, determines the total problem size. For example, for this layout in inputdeck

	nodelayout 2 2 1

and this subdomain definition in iq.h

	define(IQX,192)

	define(IQY,192)

	define(IQZ,384)

	define(IQ,384)

each node updates a 192x192x384 (x,y,z) subdomain, and the total problem size is 384 by 384 by 384 (x,y,z). The IQ parameter must be set at least as big as the maximum of the IQX, IQY, and IQZ parameters, as it determines array sizes in the 1D strip update sppm routine. Each subdomain is updated by "nthreads" threads, as set in inputdeck.

C. Makefile Issues

The REAL switch controls single/double precision selection. For double precision set -DREAL=double. Note that "double" also requires you to turn on a Fortran compiler switch that will cause all Fortran REAL storage (implicitly or explicitly declared) to be taken as if it were declared double precision (i.e., 64-bit). The THREADED and OPENMP switches activate the multithreaded versions of the code as described earlier. Other switches exist simply to aid porting (e.g., F2C for Fortran to C call linkage in c_io.c).

To generate a Makefile for a new platform:

When you "make" sppm, note that the executable is actually put in the subdirectory called run. There are two different lines in Makefile: run/sppm has a link-line that expects cpthreads_sppm_XXX.o (for the threaded version), and run/sppm1 has a link-line that does not use cpthreads_sppm_XXX.o (for the non-threaded version).

Three additional m4 options, which can be set in the MOPT line of the makefile, control restart and graphics files:

-DDUMPS=1 enables the dumping of restart files. With this set, the code will always dump at the end of the run. If checkpoint is set in the inputdeck, it will dump every checkpoint steps as well as at the end of the run.

-DBOBOUT=1 enables the writing of "brick-of-bytes" image files and compressed image files. With this set, the code will always write before starting the first double time step. If dtdump variables are set in inputdeck, it will also write at those specified time intervals.

-DNOCHDIR=1 directs writing of restart dumps and graphics files to the current directory rather than requiring and using a prepared structure of subdirectories /0 /1 etc in the run directory. -DNOCHDIR also needs to be set in COPT as a conditional compile variable for the .c files. Behavior is modified by the gpfsdir variable in the inputdeck.

D. Further Build Options

Two further m4 options may only be adjusted in main.m4. The default values FILEIO=0 and TTYIO=1 send the summary program output to unit6 = standard out. You can instead, or in addition, write to a file, unit 7 = "output," by defining FILEIO=1.

Alternative coding of the hydro kernel (the sppm.m4 source file) can be selected through the m4 defines (ifsupr, ifcvmg, ifinln, and ifcray) found in the sppm.h file. Also note that changes outside of sppm.m4 can affect it. For example, it is possible to use the IQ parameter in iq.h to affect the array sizes in these routines.


Files in this Distribution

sppm.readme.html

              This file

sppm.readme.bm.html

              Details of use of this code for the current benchmark

sppm.tar      This file contains all the files listed below. 

Makefile      preprocessor/compile/link commands 

arrays.h      State variables and coord. mesh 

bdrys.m4      Ghost cell updates and boundary conditions 

buffers.h     Parameters used by CALCHYD pencil logic 

c_io.c        I/O related routines 

cliches.h     Thread macros 

constants.h   Common constants 

cpthread_sppm_XXX.c 

              System dependent calls for system XXX containing thread support routines 

iq.h          Definition of problem subdomain size 

main.m4       Most initialization and driver 

msgcom.h      Common shared variables 

params.m4     Constants, most derived automatically from iq.h 

runhyd3.m4    Contains double timestep and pencil logic 

sppm.h        Switches to select alternate code in sppm.m4 

sppm.m4       The hydro kernel sppm itself - 1D strip update 

timers.c      Timer routine 


run/ Runtime subdirectory 

inputdeck Sample input file 

0/            Directory for storing node-0 files 

1/            Directory for storing node-1 files 

2/            Directory for storing node-2 files 

3/            Directory for storing node-3 files 


Sample_Outputs/ Sample program outputs 


stdout.xmpi.yomp
              Standard output from a run with y OpenMP threads on each of x MPI processes

Execution Issues

The code is executed from a directory with a copy of the input file named "inputdeck" and with pre-existing subdirectories named for each of the MPI process ranks (i.e., 0, 1, ... ,N). The program attempts to change the current directory to one named based on the MPI node rank. For example, a two node run would execute "cd 0" on the rank 0 process and "cd 1" on the other. This is necessary for keeping output from different nodes separate. It is also useful to optimize I/O by symbolically linking them to local disks physically attached to each nodes. These node subdirectories are required for multinode runs in which the output, restart, and graphics dumps will be written by each node with the same file name; there is no other provision for keeping them separate.

The total problem size is determined by the subdomain settings in iq.h when the code was built and by the nodelayout line in inputdeck when the code is run. The subdomain settings are described in  Problem Size Determination. Total problem size is described in About the Data.


Timing Issues

The code is extensively instrumented. The individual steps in each double timestep are timed. Each double timestep has both "delta" and cumulative timings. Processing and I/O for each restart and graphics dump is timed. Code totals are also printed. Node 0 prints all of this information to stdout. Also, every node prints its own timing information to its own output file (in the node's subdirectory). The CPU/wall ratio gives an indication of multi-threaded efficiency. The graphics image I/O time includes necessary reformatting and scaling to write "bricks of bytes."


Storage Issues

All data arrays are statically dimensioned. Most are in COMMON for shared memory access. By far the largest is the ddd array in common block /vectr3/ (in arrays.h), which holds 5 physical state variables for each mesh point. Its dimensions are ghost cell extended by +/- nbdy=5. The /scratch/ common block is defined six different ways by the CALCHYD and BDRYS m4 macros and provides a large reusable scratch space to minimize memory requirements. The CALCHYD routines use /scratch/ for circular update buffers. The BDRYS routines use it for message arrays. All threads are synchronized on entry and exit to the BDRYS routines, which makes this reuse possible. Make sure that the linker/loader allocates the maximum of these six cases for /scratch/.


About the Data

The problem size is determined by the subdomain settings in iq.h and the node layout in the input file based on a selected static domain decomposition. The number of timesteps is controlled by the "nstop" variable in the input file. The configuration of those subdomains is set in the input file.

npx * npy * npz = total number of MPI tasks

npx * npy * npz * nthreads = total number of processors to be used

For example, with iq.h setting a 192^3 subdomain and nodelayout 2 2 2 and nthreads 4, you would have a structure of eight subdomains, each updated by four threads, with a total problem size of 384 x 384 x 384.

The input file must be named "inputdeck" and be in the current working directory. It contains lines of keyword strings and value pairs, the order of which is unimportant. The keyword strings and their descriptions are listed below, with indication of their values for the benchmark runs.


Keyword      Value(s)          Description
------------------------------------------------------------

nodelayout   npx npy npz       MPI node geometry

nthreads     nthreads          #threads on each sppm node

dtime        dtime             Initial timestep 

dtdump       dtdump dtcdump    Time between bob dumps, time
                               between compressed dumps.
                               
checkpoint   ndump             Time step cycles between restart

                               dumps (should be even).  



gpfsdir      directory path    Path to parallel file system for dumps (ending with /)

(* If gpfsdir is set the run/0 /1 structure is not used, regardless of the setting
   of -DNOCHDIR in the makefile. Writes v and s bob graphics files to the current directory 
   and the d bob graphics and c compressed graphics files to the head of the gpfsdir.)


stoptime     nstop tstop       Stop at simulation time tstop,
                               or timestep nstop, whichever
                               occurs first. 


restart                        Mere presence will cause 
                               the reloading of a checkpoint
                               file, rs_<step #>_<task #>
                               In this case, dtime is ignored
                               if it occurs in the inputdeck.

Expected Results

Sample output files are provided in the Sample_Outputs subdirectory for your reference in determining the correctness of answers.

We have run on SGI-IA64 (Linux), SGI (Irix), IBM (AIX), Compaq Sierra AlphaServer (Tru64 Unix) and SUN (Solaris) systems.

The critical numbers you should duplicate are the Courant numbers and energies in each timestep print. They should be duplicated exactly, because they are only printed to six significant digits. We have seen consistent results from all machines to date.


Last modified on February 8, 2002

For information about this page contact:
John Engle,
jengle@llnl.gov

UCRL-MI-144211