IRS: Implicit Radiation Solver 1.4 Benchmark Runs

UCRL-CODE-2001-010

NOTICE 1

This work was produced at the University of California, Lawrence Livermore National Laboratory (UC LLNL) under contract no. W-7405-ENG-48 (Contract 48) between the U.S. Department of Energy (DOE) and The Regents of the University of California (University) for the operation of UC LLNL. The rights of the Federal Government are reserved under Contract 48 subject to the restrictions agreed upon by the DOE and University as allowed under DOE Acquisition Letter 97-1.

DISCLAIMER

This work was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately-owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes.

NOTIFICATION OF COMMERCIAL USE

Commercialization of this product is prohibited without notifying the Department of Energy (DOE) or Lawrence Livermore National Laboratory (LLNL).

Privacy & Legal Notice

Running the Code
Expected Results
Interpreting Results

Running the Code

The code should be run in the following configurations.

Sequential
Parallel using OpenMP Threads
Parallel using MPI
Parallel using MPI and OpenMP Threads

In addition, to test scalability, several runs of configurations 3 and 4 will be run.

The decks used in the following tests may be found in ~/irs-1.4/decks

Each run will generate three files which should be saved. These will be variations on the following names.

irs.tmr - detailed timing information
irs.ult - curves to be plotted
irshsp - general run information and high level timing information

The "irs" in the above names will be different for each of the tests.

Running Sequential

This should be the first shakedown test. It should detect any obvious errors in the executable, such as mis-matched libraries or run-time errors inherent in the code.

All versions of this code should be able to be run sequentially. That is, versions compiled with MPI and/or OpenMP threads should run sequentially as well as in parallel mode.

Using input deck zrad.0001, run the code sequentially as follows

        irs -k seq zrad.0001

It may help to run each test in a separate directory, so that all created files will be saved and not overwritten by subsequent runs.

If the code ran successfully, something similar to the following will be printed to the screen at the end of the run.

     
        BENCHMARK microseconds per zone-iteration = 2.1199106478034
 
        wall        time used: 9.107000e+01 seconds
        total   cpu time used: 8.918000e+01 seconds
        physics cpu time used: 8.892000e+01 seconds

This BENCHMARK score should be printed to stdout (the screen) and also to a file named "seqhsp".

Running Parallel using OpenMP Threads

If the code was built to take advantage of OpenMP threads, then the following two runs should be made. The number of OpenMP threads spawned is usually controlled by an environment variable such as OMP_NUM_THREADS. For these tests, set this value to at least 4.

        irs -k omp_seq zrad.0008.seq

        irs -k omp_thr zrad.0008.seq -threads

The 1st run should run the code without threads. The second run will run the same file with threads.

Running Parallel using MPI

In general, MPI Parallel runs of codes are started by something akin to the following.

        mpirun -np 4 code.to.run arg1 arg2 arg3

This may be different on your platform. However, assuming this syntax, the following all-MPI tests may be run.

These tests exercise the code in MPI parallel mode using from 8 to 8000 processors.

mpirun -np    8  irs -k mpi.0008 zrad.0008

mpirun -np   64  irs -k mpi.0064 zrad.0064

mpirun -np  216  irs -k mpi.0216 zrad.0216

mpirun -np  512  irs -k mpi.0512 zrad.0512

mpirun -np 1000  irs -k mpi.1000 zrad.1000

mpirun -np 1728  irs -k mpi.1728 zrad.1728

mpirun -np 2744  irs -k mpi.2744 zrad.2744

mpirun -np 4096  irs -k mpi.4096 zrad.4096

mpirun -np 5832  irs -k mpi.5832 zrad.5832

mpirun -np 8000  irs -k mpi.8000 zrad.8000

Running Parallel using MPI and OpenMP Threads

This set of tests run the code in a mode which uses MPI and OpenMP threads. As in the previous simpler threaded case, the mechanism used to specify the number of threads is assumed to be the environment variable OMP_NUM_THREADS.

Since the architecture of the machines this code will run on is unknown, the runs which best exercise the machine are unknown. Decks are provided to test the following number of CPU's: 8, 64, 216, 512, 1000, 1728, 2744, 4096, 5832, 8000.

The number of MPI tasks and Threads needed to make use of the processors can be varied. The following runs show variations for 8 and 64 processor runs. Use corresponding arguments for the larger number of runs.

Utilizes 8 CPUs: 4 MPI Processes : 2 Threads each

export OMP_NUM_THREADS=2

mpirun -np 4 irs -k mpi.0004mpi.002thr zrad.0008 -threads

Utilizes 8 CPUs: 2 MPI Processes : 4 Threads each

export OMP_NUM_THREADS=4

mpirun -np 2 irs -k mpi.0002mpi.004thr zrad.0008 -threads

Utilizes 64 CPUs: 32 MPI Processes : 2 Threads each

export OMP_NUM_THREADS=2

mpirun -np 32 irs -k mpi.0032mpi.002thr zrad.0064 -threads

Utilizes 64 CPUs: 16 MPI Processes : 4 Threads each

export OMP_NUM_THREADS=4

mpirun -np 16 irs -k mpi.0016mpi.004thr zrad.0064 -threads

Utilizes 64 CPUs: 8 MPI Processes : 8 Threads each

export OMP_NUM_THREADS=8

mpirun -np 8 irs -k mpi.0008mpi.008thr zrad.0064 -threads

Utilizes 64 CPUs: 4 MPI Processes : 16 Threads each

export OMP_NUM_THREADS=16

mpirun -np 4 irs -k mpi.0004mpi.016thr zrad.0064 -threads

Utilizes 64 CPUs: 2 MPI Processes : 32 Threads each

export OMP_NUM_THREADS=32

mpirun -np 2 irs -k mpi.0002mpi.032thr zrad.0064 -threads

Expected Results

Each run will generate three files which should be saved. These will be variations on the following names.

irs.tmr - detailed timing information
irs.ult - curves to be plotted
irshsp - general run information and high level timing information

The "-k xxx" option in each of the above runs determines what the prefix of each file will be. For instance, the sequential run

        irs -k seq zrad.0008.seq

Interpreting Results

The top of the 'seq.tmr' file contains information which can be checked to verify the run. Important information is --

Number of MPI Processes used
Threading ON or OFF
Average per-cpu MFLOP / WALL SEC
Aggregate all-cpu MFLOP / WALL SEC

As the code is run on more processors, the .tmr file can be checked to ensure that

The number of MPI Processes is increasing
The Aggregate all-cpu MFLOP / WALL SEC is increasing

Sample listings from several runs is listed below, with the important information listed in bold.

Sample .tmr data from Sequential run

Function Timers Analysis Across Processors
Num of MPI Processes : 1
Num of Domains       : 8
Threading            : OFF
Sorted by Max Wall Seconds

                                        PROCESSOR      MFLOP /                                                    
                  ROUTINE               OR DOMAIN     WALL SEC        FLOPS     CPU SECS    WALL SECS    
------------------------- ---------- ------------ ------------ ------------ ------------ ------------ 
xirs (G)                   Aggregate          N/A 4.955942e+01 1.830046e+11 3.693100e+03 3.692630e+03 
                             Average          N/A 4.955942e+01 1.830046e+11 3.693100e+03 3.692630e+03 
                    Wall Seconds Max    Proc    0 4.955942e+01 1.830046e+11 3.693100e+03 3.692630e+03 
                    Wall Seconds Min    Proc    0 4.955942e+01 1.830046e+11 3.693100e+03 3.692630e+03

Sample .tmr data from MPI parallel run

Function Timers Analysis Across Processors
Num of MPI Processes : 2
Num of Domains       : 8
Threading            : OFF
Sorted by Max Wall Seconds

                                        PROCESSOR      MFLOP /                                                    
                  ROUTINE               OR DOMAIN     WALL SEC        FLOPS     CPU SECS    WALL SECS   
------------------------- ---------- ------------ ------------ ------------ ------------ ------------ 
xirs (G)                   Aggregate          N/A 8.965630e+01 1.830046e+11 2.023910e+03 2.041180e+03 
                             Average          N/A 4.482815e+01 9.150232e+10 2.023910e+03 2.041180e+03 
                    Wall Seconds Max    Proc    1 4.482815e+01 9.150232e+10 2.023910e+03 2.041180e+03 
                    Wall Seconds Min    Proc    0 4.482837e+01 9.150232e+10 2.013380e+03 2.041170e+03

Sample .tmr data from Threaded parallel run

Function Timers Analysis Across Processors
Num of MPI Processes : 1
Num of Domains       : 8
Threading            : ON
Sorted by Max Wall Seconds

                                        PROCESSOR      MFLOP /                                                    
                  ROUTINE               OR DOMAIN     WALL SEC        FLOPS     CPU SECS    WALL SECS   
------------------------- ---------- ------------ ------------ ------------ ------------ ------------ 
xirs (G)                   Aggregate          N/A 8.150262e+01 1.830043e+11 1.780673e+08 2.245380e+03 
                             Average          N/A 8.150262e+01 1.830043e+11 1.780673e+08 2.245380e+03 
                    Wall Seconds Max    Proc    0 8.150262e+01 1.830043e+11 1.780673e+08 2.245380e+03 
                    Wall Seconds Min    Proc    0 8.150262e+01 1.830043e+11 1.780673e+08 2.245380e+03