The ASCI sPPM Benchmark Code

Benchmark-specific Instructions and Constraints

Privacy & Legal Notice


Contents


Building Considerations

The problem size per process is established in iq.h. The layout of processes, number of threads, and number of time steps to run are established in run/inputdeck. The rest of the build is controlled by the Makefile.

For purposes of most of this benchmark, the code is to be run in double precision (64 bit) on a 192x192x192 per process job with no dumps or other output files being written, and it is to run for 20 timesteps. Any threading is to be done with OpenMP. The exception is the final run, which can use pthreads rather than OpenMP, if desired, can use single precision rather than double precision and can use a different problem size.

To build and run the code with the above general constraints,

Also in Makefile you should establish the system type (SYS), the preprocessors, compilers, and loader to use and the options to be passed to these various components. In particular, notice that:

Note that it may be necessary to modify timers.c in order to use appropriate timer routines.

To set up for builds that are to use OpenMP, Makefile should include "THMODE= -DTHREADED=1 -DOPENMP=1," OMPOPT should include any options needed by the Fortran compiler and/or loader to compile and load OpenMP code, COMPOPT should include any C compiler options needed for OpenMP, and THLD can be used for any additional loader options needed for OpenMP. The number of threads to use is determined at execution time by nthreads in the input deck. For builds that do not use OpenMP (or pthreads) you should set "THMODE= -DTHREADED=0."

To set up for builds that are to use MPI, Makefile should include "CPPOPT= -DMPI," and INCDIR, LIBDIR, and LIBS should be set up as necessary to indicate needed include paths, library paths, and libraries. The layout of nodes is established in the input deck at execution time. For builds that do not use MPI, you should set "CPPOPT= -DNOMPI."


Optimization Constraints

Performance optimizations to the sPPM source are encouraged as long as they don't specialize the functionality. However, the hydro kernel (i.e., the sppm.m4 source file) must be used UNMODIFIED except for the insertion of compiler directives. Alternative coding can be selected through the m4 defines (ifsupr, ifcvmg, ifinln, and ifcray) found in the sppm.h file. Also note that changes outside of sppm.m4 can affect it. For example, it is possible to use the IQ parameter in iq.h to affect the array sizes in these routines. The goal is to emphasize higher-level optimizations and/or compiler technology instead of relying on hand-tweaked specialized code.

The benchmark must use MPI message passing between nodes. It cannot overlap the boundary communications in the BDRYS routines (in bdrys.m4) with the directional sweeps in the CALCHYD routines (in runhyd3.m4) because we want to measure communication separately; however, overlapped communication and computation will be allowed for the final TeraFLOP demonstration.


Execution and Timing Issues

The wall-clock execution time required to execute sPPM should be reported. The minimum hardware memory required to run each problem should also be reported.

The code is extensively instrumented. The individual steps in each double timestep are timed. Each double timestep has both "delta" and cumulative timings. Processing and I/O for each restart and graphics dump is timed. Code totals are also printed. Node 0 prints all of this information to stdout. Also, every node prints its own timing information to its own output file (in the node's subdirectory). The cpu/wall ratio gives an indication of multithreaded efficiency. The graphics image I/O time includes necessary reformatting and scaling to write "bricks of bytes."


Required Problems

The problems that are required to be run are divided into three categories based on the type of parallelism. A fourth category is available for you to show off whatever configurations you consider to be to your advantage. The problems, along with an indication of the settings needed in run/inputdeck to set up the problems, are as follows:

OpenMP Tests
Test Needed Setup
1 OpenMP thread on one SMP node. Set "nodelayout 1 1 1" and "nthreads 1"
8 OpenMP threads on one SMP node. Set "nodelayout 1 1 1" and "nthreads 8"
All available processors on one SMP node with OpenMP threads. Set "nodelayout 1 1 1" and set nthreads to the number of processors on a node.

 

MPI Tests
1 MPI process. Set "nodelayout 1 1 1"
64 MPI processes. Set nodelayout so that the product of the three values is 64; for example, "nodelayout 4 4 4"
256 MPI processes over at least two SMP nodes. Set nodelayout so that the product of the three values is 256; for example, "nodelayout 4 8 8" and execute in a way that assures that at least two SMP nodes are used.

 

Combined OpenMP and MPI tests
256 processors over at least two SMP nodes. Set nodelayout and nthreads so that the product of the four values is 256. For example "nodelayout 4 4 1" and "nthreads 16" Note that nthreads must be greater than 1 and less than 256.
All available processors over all available SMP nodes. Set nodelayout and nthreads as appropriate. Note that this should be more than one MPI process and more than one thread.

 

Showtime
Job in whatever configuration best shows off the system. (This is an optional run.) If you wish to use pthreads directly rather than using OpenMP, set THMODE in Makefile to be -DTHREADED=1. You may need to provide a cpthreads_sppm_*.c file if an appropriate version is not available.

Note that for the first group the code should be compiled with OpenMP and without MPI, for the second group the code should be compiled with MPI and without OpenMP, and for the third group the code should be compiled with both OpenMP and MPI. In general, you should not need to recompile or reload for the runs within a particular group. Just establish the node layout (for MPI runs) and/or the number of threads (for OpenMP runs) in run/inputdeck. If you have limitations that preclude you from running the specified sizes, you should run what you can and explain the deviation.

 

Example "inputdeck" for the Benchmark Problems

nodelayout    4 4 1
nthreads      16
dtime         3.0e-04
dtdump        1.2e-03   5.0e-03
checkpoint    16
stoptime      20   1.0
(optional use of gpfsdir)

Only the nodelayout (npx, npy, and npz), nthreads, and (possibly) stoptime should be changed to run the benchmark problems.


Expected Results

Sample output files are provided in the Sample_Outputs subdirectory for your reference in determining correctness of answers.

We have run on SGI -IA64(Linux), SGI (Irix), IBM (AIX), Compaq Sierra AlphaServer (Tru64 Unix) and SUN (Solaris) systems.

The critical numbers you should duplicate are the Courant numbers and energies in each timestep print. They should be duplicated exactly, because they are only printed to 6 significant digits. We have seen consistent results from all machines to date.


Release and Modification Record

Modified 02/27/01 as follows:

Modified 02/8/02 as follows:


Last modified on February 8, 2002

For information about this page contact:
John Engle,
jengle@llnl.gov

UCRL-MI-144211