Sphinx Home Page

Privacy & Legal Notice

Code Description
Files in this Distribution
Building the Code
Running the Code
Input File Format
Output Format
Parallelism and Scalability Expectations
Timing Issues
References
Release and Modification Record
Code

Code Description

The Sphinx Parallel Microbenchmark Suite

A. General Description

Sphinx, an integrated parallel microbenchmark suite, consists of a harness for running performance tests and extensive tests of MPI, Pthreads and OpenMP. It was adapted from the Special Karlsruhe MPI (SKaMPI) Benchmark suite¹ by Bronis R. de Supinski and other members of the PSE/ASDE project, including John May and Bor Chan. LLNL adaptations include extensive tests of the Pthreads interface² and on-going integration of the LLNL OpenMP Performance Suite. In addition, several new MPI tests have been added, primarily focusing on the performance of collective operations, including the first widely available tests that accurately measure the operation latency of fan-out collective operations such as MPI_Bcast³. Sphinx was a significant aspect of the ASCI PSE Milepost Run Time Systems Performance Testing; a paper on this effort should be available in the near future. The entire suite is implemented in C and has been run on a wide variety of platforms.

The Sphinx test harness provides a flexible mechanism for running performance tests. The action being measured, such as a message pingpong, is accessed through a function pointer. Different threads or tasks can execute different functions, which supports the measurement of highly complex parallel actions. The Sphinx harness provides a flexible mechanism for varying parameters, or independent variables, of the action. The harness times repeated calls (iterations in Sphinx terminology) of the action. Sphinx measures several timings (or repetitions) and outputs for their arithmetic mean for a given set of parameters for the action. The timings are stopped when the standard deviation of the repetitions is less than user-determined percentage of their mean, given that a minimum number of repetitions have been measured. Since this cut-off may never be achieved, the harness guarantees test termination through a user-specified maximum number of repetitions. Results that do not achieve the cut-off are annotated as "UNSETTLED" in the output. Sphinx includes a facility that can optionally be used to correct for harness overhead including that of the function call to the action routines. Results that have anomalous corrections are annotated "UNRELIABLE" in the output. Anomalous corrections include the overhead exceeding the measurement value; complex tests, such as our accurate tests of the operation latency of fan-out collectives, can have more complicated anomalous correction conditions.

Sphinx is highly portable. The primary portability issue involves the use of processor bindings in the Pthreads tests. The mechanism for binding threads to processors varies widely between platforms, both in interface and in capabilities. Sphinx includes a module that abstracts most of these issues. However, care needs to be exercised as several platforms have failed to exhibit appropriate binding behavior. Another possible portability snag is that the code currently must be linked with MPI and Pthreads libraries; future changes will support a "no Pthreads" version; a "no MPI" version may also be implemented. Other issues for portability include the current Makefile mechanism; this will probably be replaced by an autoconf script in the future.

B. Coding

Sphinx is implemented in C. It includes tests of three popular parallelization mechanisms: MPI, Pthreads and OpenMP. The OpenMP tests are excluded by default; they are included only if the macro _HAVE_OMP_ is defined. Generally, the choice of which tests are available in the executable is decided by the makefile target chosen.

C. Parallelization

The code tests MPI, Pthreads and OpenMP. The MPI tests include tests of the full range of MPI collective communications and pingpong tests of a variety of MPI send and receive combinations; other MPI tests are planned. Tests of most Pthreads functions are included and of all OpenMP constructs, plus the auxilliary OpenMP locking functions.

Files in this Distribution

README	brief text description of the code
pdpta_pthreads.fm.ps	PDPTA'99 paper discussing Pthreads tests
hpdc.2col.color.fm.ps	HPDC 8 paper discussing accurate tests for fan-out MPI collective operations
autodist.c	code for main part of test harness, including automatic data point generation
autodist.h	header file for autodist.c functions
automeasure.c	code for determining if measurement is complete and computing basic result
automeasure.h	header file for automeasure.c functions
bind.c	code to make bind interface work correctly on Sun platforms (resolves a CPU numbering issue)
col_test1.h	header file for MPI collective tests
data_list.c	code for result lists functions
mw.c	code for MPI master/worker tests (SKaMPI artifact; use at your own risk)
mw_test1.h	header file for MPI master/worker tests
p2p_test1.h	header file for MPI point-to-point tests
pattern.h	header file for test patterns
pqtypes.c	code for priority queue functions
pqtypes.h	header file for priority queue functions
simple_test1.h	header file for Pthreads tests
sphinx.c	code for main function
sphinx.h	primary header file
sphinx_any.h	debugging,utility functions header file
sphinx_aux.h	extern variables header file
sphinx_aux_test.c	code for auxilliary measurements mechanism functions
sphinx_aux_test.h	header file for auxilliary measurements mechanism functions
sphinx_call.c	code for functions for setting up independent variable values in measurement structure
sphinx_call.h	header file for functions for setting up independent variable values in measurement structure
sphinx_col.c	code for MPI collective tests
sphinx_error.c	code for error functions
sphinx_error.h	header file for error functions
sphinx_mem.c	code for memory allocation, message buffer set up functions
sphinx_mem.h	header file for memory allocation, message buffer set up functions
sphinx_omp.c	code for OpenMP tests
sphinx_omp.h	header file for OpenMP tests
sphinx_p2p.c	code for MPI point-to-point tests
sphinx_params.c	code for reading in input file, setting up parameters and tests
sphinx_params.h	header file for sphinx_params.c functions
sphinx_post.c	code for post-processing (SKaMPI articfact; use at your own risk)
sphinx_post.h	header file for sphinx_post.c functions
sphinx_simple.c	code for simple pattern and tests
sphinx_threads.c	code for Pthreads tests
sphinx_threads.h	header file for processor binding routines needed for PThreads tests
sphinx_tools.c	code for utility functions
sphinx_tools.h	header file for utility functions
yieldtest.c	code for simple stand-alone test to determine if thread scheduling is round robin when threads are bound to same CPU and call sched_yield
sphinx_defaults	text file showing default values for Sphinx input file parameters
test.sphinx	sample input file for MPI tests
test.sphinx.old	sample old style input file for MPI tests
test.sphinx.threads	sample input file for Pthreads tests
test.sphinx.threads.old	sample old style input file for Pthreads tests
test.sphinx.root.acker	sample input file for accurate fan-out MPI collective tests
test.sphinx.pdpta.dec test.sphinx.pdpta.dec.timeslice test.sphinx.pdpta.ibm test.sphinx.pdpta.sgi test.sphinx.pdpta.sun	input files corresponding to PDTPA'99 paper
test.sphinx.col.15per.white.MPICH.scale.new test.sphinx.col.15per.white.scale test.sphinx.col.15per.white.scale.new test.sphinx.col.15per.white.scale.shmem.new test.sphinx.col.15per.white.scale.shmem.new.fillin test.sphinx.col.16per.white.MPICH.scale.new test.sphinx.col.16per.white.scale test.sphinx.col.16per.white.scale.new test.sphinx.col.16per.white.scale.shmem.new test.sphinx.col.1per.blue test.sphinx.col.1per.blue.new test.sphinx.col.1per.snow test.sphinx.col.1per.snow.MPICH test.sphinx.col.1per.snow.MPICH.new test.sphinx.col.1per.snow.new test.sphinx.col.1per.white.MPICH.scale.new test.sphinx.col.1per.white.scale test.sphinx.col.1per.white.scale.new test.sphinx.col.3per.blue test.sphinx.col.3per.blue.new test.sphinx.col.4per.blue test.sphinx.col.4per.blue.new test.sphinx.col.7per.snow test.sphinx.col.7per.snow.MPICH test.sphinx.col.7per.snow.MPICH.new test.sphinx.col.7per.snow.new test.sphinx.col.7per.snow.shmem test.sphinx.col.7per.snow.shmem.new test.sphinx.col.8per.snow test.sphinx.col.8per.snow.MPICH test.sphinx.col.8per.snow.MPICH.new test.sphinx.col.8per.snow.new test.sphinx.col.8per.snow.shmem test.sphinx.col.8per.snow.shmem.new test.sphinx.col2.1per.blue test.sphinx.col2.1per.blue.16 test.sphinx.col2.1per.blue.MPICH test.sphinx.col2.1per.snow test.sphinx.col2.1per.snow.8 test.sphinx.col2.1per.snow.MPICH test.sphinx.col2.1per.snow.MPICH.8 test.sphinx.col2.3per.blue test.sphinx.col2.4per.blue test.sphinx.col2.4per.blue.8 test.sphinx.col2.7per.snow test.sphinx.col2.7per.snow.MPICH test.sphinx.col2.7per.snow.shmem test.sphinx.col2.7per.snow.shmem.long test.sphinx.col2.8per.snow test.sphinx.col2.8per.snow.MPICH test.sphinx.col2.8per.snow.shmem test.sphinx.col2.8per.snow.shmem.long test.sphinx.p2p.1node.blue test.sphinx.p2p.1node.blue.shmem test.sphinx.p2p.1node.frost test.sphinx.p2p.1node.frost.shmem test.sphinx.p2p.1node.snow test.sphinx.p2p.1node.snow.MPICH test.sphinx.p2p.1node.snow.MPICH.shmem test.sphinx.p2p.1node.snow.shmem test.sphinx.p2p.2nodes.blue test.sphinx.p2p.2nodes.frost test.sphinx.p2p.2nodes.snow test.sphinx.p2p.2nodes.snow.MPICH test.sphinx.p2p.2nodes.snow.new test.sphinx.threads.blue test.sphinx.threads.frost test.sphinx.threads.snow test.sphinx.threads.snow.new	input files for ASCI PSE Milepost tests

Building the Code

To build the code, type "make ARCH_COMPILER_MPI_OPTION" where ARCH_COMPILER_MPI_OPTION names the platform/compiler/MPI implementation that you want. For example, to build the IBM SP version that uses IBM's MPI library, type "Make IBM" while "Make IBM_MPICH" uses MPICH instead. For the full list of currently supported ARCH_COMPILER_MPI_OPTION choices, see the Makefile. It may be necessary to edit the Makefile.ARCH.COMPILER.MPI_OPTION file depending on the location of your compiler, etc. or to add a new one.

Running the Code

The basic command for running the code is:

sphinx_version input_filename

If input_filename is omitted then a simple MPI pingpong test is run. The exact command to run the code depends on the system (i.e. use poe on IBM SPs). Use the machine's parallel job start up mechanism to determine the number of MPI tasks. The OpenMP function omp_get_max_threads determines the limit on OpenMP threads although a lower limit can be specified in the input file.

Input File Format

This section details the input file format for Sphinx and the tests included in this distribution. Sphinx determines parameters for a run by reading and parsing an input file. The parameters for the run determine which tests to perform and variables that control the harness operation, such as the number of iterations per repetition. The input file format consists of several different "modes" that determine what run parameters are being specified. The tests to perform are specified in one or more MEASUREMENTS modes. MEASUREMENTS modes consist of a list of tests to perform; test entries include test-specific parameters. The input file format may seem complex due to the flexibility provided by the test harness. However, experienced users can quickly create the desired input file, due in part to the flexibility of the input file format.

In general, Sphinx input file processing is very forgiving - mode or parameter identifiers must contain a string that uniquely determines them but can contain random other characters, allowing input file processing to complete even with most typos. Further, the determining string is not case specific. Each test in Sphinx has several different possible parameters. A sensible default is used if the input file doesn't specify anything for a given mode or parameter. The default values can be overridden for the entire input file or for a specific test.

The Sphinx input file format is fairly free form; an "@" as the first character of a line changes the input mode and modes can occur in any order. Most of the modes are optional; the only mode that is required is at least one MEASUREMENTS mode. The last occurence of a mode determines the value for that mode except for the COMMENT and MEASUREMENTS modes; for these two modes, multiple occurences are concatenated to form a single value. The following table describes the Sphinx input file modes (names listed in ALL CAPS for historic reasons):

SPHINX INPUT FILE MODES

MODE	DESCRIPTION	DEFAULT
COMMENT	any comments that you'd like to include in the input file; this mode can be used to omit test entries without deleting them from the file; not included in output file	NULL
USER	a text field with no semantic implications; can be used to provide a short description of the user running the tests; included in output file	NULL
MACHINE	a text field with no semantic implications; can be used to provide a short description of the machine on which the tests are run; included in output file	NULL
NETWORK	a text field with no semantic implications; can be used to provide a short description of the network on which the tests are run; included in output file	NULL
NODE	a text field with no semantic implications; can be used to provide a short description of the nodes of the machine on which the tests are run; included in output file	NULL
MPILIB_NAME	a text field with no semantic implications; can be used to provide a short description of the MPI library with which the tests are run; included in output file	NULL
OUTFILE	a text field that specifies the output filename	input_filename.out
LOGFILE	a text field that specifies the log filename	input_filename.log
CORRECT_FOR_OVERHEAD	A yes or no text field that specifies whether test results should be corrected for any test harness overhead incurred in the measurement; overhead is generally a function call but depends on the test	no
MEMORY	an integer field that specifies the size in kilobytes of the buffer to allocate in each task for message passing tests; maximum message lengths are a function of this parameter and the test being run; generally maximum message lengths are equal to this parameter or half of it or this parameter divided by the number of tasks	4096 (i.e. 4MB)
MAXREPDEFAULT	an integer field that specifies default limit on the number of timings until a test is declared "UNSETTLED"	20
MINREPDEFAULT	an integer field that specifies default minimum number of timings to average for a test result	4
ITERSPERREPDEFAULT	an integer field that specifies default number of iterations per timing of the code being measured	1
STANDARDDEVIATIONDEFAULT	A double field that specifies the default fraction of mean of timings that standard deviation must be less than for test to be declared settled. Sphinx uses standard deviation which may never achieve this mean, unlike SKaMPI which uses standard error which is guaranteed to achieve the percentage for a sufficiently large number of timings; thus MAXREPDEFAULT is more significant for Sphinx	0.05
DIMENSIONS_DEFAULT	an integer field that specifies default number of independent variables for a test	1
VARIATION	a text field that specifies the default independent variable; see below for valid independent variables	NO_VARIATION
VARIATION_LIST	a space-delimited text field that specifies the default independent variables	NO_VARIATION for all positions
SCALE	a text field that specifies the default scale to use for independent variable; see below for valid scale values	FIXED_LINEAR
SCALE_LIST	a space-delimited text field that specifies the default scales	FIXED_LINEAR for all positions
MAXSTEPSDEFAULT	an integer field that specifies default limit on the number of values for independent variables	16
MAXSTEPSDEFAULT_LIST	a space-delimited integers field that specifies default limits on the numbers of values for independent variables	16 for all positions
START	an integer field that specifies default minimum value to use for independent variables; MIN_ARGUMENT has Sphinx use the minimum value semanitically allowed for the independent variable (e.g. 1 for number of tasks)	MIN_ARGUMENT
START_LIST	a space-delimited integers field that specifies default minimum values to use for independent variables	MIN_ARGUMENT for all positions
END	an integer field that specifies default maximum value to use for independent variables; MAX_ARGUMENT has Sphinx use the maximum value semanitic ally allowed for the independent variable (e.g. size of MPI_COMM_WORLD for number of tasks)	MAX_ARGUMENT
END_LIST	a space-delimited integers field that specifies default maximum values to use for independent variables	MAX_ARGUMENT for all positions
STEPWIDTH	a double field that specifies default distance between independent variable values	1.00
STEPWIDTH_LIST	a space-delimited doubles field that specifies default distances between the values of independent variables	1.00 for all positions
MINDIST	SKaMPI artifact; an integer field that apparently was intended to specify a minimum distance between independent variable values; currently has no effect but may be supported in the future	1
MINDIST_LIST	a space-delimited integers field that specifies MIN_DIST values	1 for all positions
MAXDIST	SKaMPI artifact; an integer field that apparently was intended to specify a maximum distance between independent variable values; currently has no effect but may be supported in the future (less likely than MINDIST)	10 for all positions
MAXDIST_LIST	a space-delimited integers field that specifies MAX_DIST values	10
MESSAGELEN	an integer field that specifies default message length in bytes	256
MAXOVERLAP	an integer field that specifies default maximum iterations of the overlap for loop	0
THREADS	an integer field that specifies default number of threads	value returned by omp_get_max_threads
WORK_FUNCTION_DEFAULT	a text field that specifies the default function used inside OpenMP loops; see below for a list of valid work function values	SIMPLE_WORK
WORK_AMOUNT_DEFAULT	an integer field that specifies default duration of function used inside OpenMP loops	10
SCHEDULE_DEFAULT	a text field that specifies default OpenMP schedule option; see below for a list of valid schedule options	STATIC_SCHED
SCHEDULE_CAP_DEFAULT	an integer field that specifies default schedule cap for OpenMP tests	10
SCHEDULE_CHUNK_DEFAULT	an integer field that specifies default OpenMP schedule chunk size	1
OVERLAP_FUNCTION	a text field that specifies the default overlap function used in mixed non-blocking MPI/OpenMP tests; see below for valid overlap function values	SEQUENTIAL
CHUNKS	an integer field that specifies default number of chunks	6
MEASUREMENTS	a text field that determines the actual tests run including any default parameter overrides; see below for a descritpion of the format of this field	NULL

Many of the modes have list variants, as indicated. These allow the specification of different defaults for independent variables X0, X1, X2,... These variants are needed since Sphinx supports multiple independent variables per test, such as varying both the message size and the number of tasks for a MPI collective test. If a test uses more independent variables than specified in a corresponding list, then the non-list default is used for the additional defaults.

The MEASUREMENTS MODE is a structured text field. It describes the tests that will be run for the input file. The format is a series of test descriptions; blank lines between test descriptions are discarded. A test description consists of a name followed by a left curly brace ({) (optionally on a new line), followed by parameters specific to the test; a right curly brace (}) marks the end of the description. Each parameter field of a test description must be on a separate line; the general format of parameter field line is "parameter_name = value". The following table describes the parameter fields of the test description:

SPHINX TEST DESCRIPTION FIELDS

PARAMETER NAME	DESCRIPTION
Type	Type of test; this field determines the actual test run; see below for a description of the different test types available in Sphinx
Correct_for_overhead	See CORRECT_FOR_OVERHEAD mode
Max_Repetition	See MAXREPDEFAULT mode
Min_Repetition	See MINREPDEFAULT mode
Standard_Deviation	See STANDARDDEVIATIONDEFAULT mode
Dimensions	See DIMENSIONS_DEFAULT mode
Variation	See VARIATION_LIST mode
Scale	See SCALE_LIST mode
Max_Steps	See MAXSTEPSDEFAULT_LIST mode
Start	See START_LIST mode
End	See END_LIST mode
Stepwidth	See STEPWIDTH_LIST mode
Min_Distance	See MINDISTANCE_LIST mode
Max_Distance	See MAXDISTANCE_LIST mode
Default_Message_length	See MESSAGELEN mode
Default_Chunks	See CHUNKS mode

All fields of a test description other than the type field are optional. Test descriptions often consist only of a name, a {, a Type = X line and a }. Properly specified defaults enable this simplicity.

A measurement name is text string without spaces and need not have any relation to the type. This is unfortunate; future versions may augment names in the output file with a type specific string.

If the same name is used for several test descriptions, Sphinx will automatically extend the second and later occurences with a unique integer. This mechanism ensures that all test descriptions result in a test run.

SPHINX INDEPEDENT VARIABLE TYPES

NAME	SEMANTIC MEANING
NO_VARIATION	No independent variable
ITERS	Number of iterations per timing
NODES	Number of MPI tasks
LENGTH	Message length (output format is in bytes)
ROOT	Root task (relevant to asynchronous MPI collective tests)
ACKER	Task that sends acknowledgement message (relevant to fan-out MPI collective tests)
OVERLAP	Computational overlap time
SECOND_OVERLAP	Second computational overlap time
MASTER_BINDING	CPU to which master (i.e. first/timing) thread is bound
SLAVE_BINDING	CPU to which slave (i.e. second) thread is bound
THREADS	Number of threads
SCHEDULE	OpenMP scheduling option
SCHEDULE_CAP	Iterations per OpenMP THREAD(?)
SCHEDULE_CHUNK	OpenMP schedule chunk size option
WORK_FUNCTION	Function used inside OpenMP loops
WORK_AMOUNT	Parameter that determines duration of function used inside OpenMP loops
OVERLAP_FUNCTION	Overlap function for mixed non-blocking MPI/OpenMP tests
CHUNKS	Number of chunks for master/worker tests (SKaMPI artifact; use at your own risk)

Some independent variables do not alter anything for some test types. In general, an effort has been made to allow variation of these variables although some may lead to internally detected errors. In any event, care should be taken with independent variable selections to ensure that test descriptions test interesting variations and to ensure that overall run-time is not excessive.

SPHINX TEST TYPES

NUMBER	DESCRIPTION	TIMING RESULT
1	MPI Ping-pong using MPI_Send and MPI_Recv	Round trip latency
2	MPI Ping-pong using MPI_Send and MPI_Recv with MPI_ANY_TAG	Round trip latency
3	MPI Ping-pong using MPI_Send and MPI_Irecv	Round trip latency
4	MPI Ping-pong using MPI_Send and MPI_Iprobe/MPI_Recv combination	Round trip latency
5	MPI Ping-pong using MPI_Ssend and MPI_Recv	Round trip latency
6	MPI Ping-pong using MPI_Isend and MPI_Recv	Round trip latency
7	MPI Ping-pong using MPI_Bsend and MPI_Recv	Round trip latency
8	MPI bidirectional communication using MPI_Sendrecv in both tasks	Operation latency
9	MPI bidirectional communication using MPI_Sendrecv_replace in both tasks	Operation latency
10	SKaMPI artifact: master/worker with MPI_Waitsome	Not clear (use at own risk)
11	SKaMPI artifact: master/worker with MPI_Waitany	Not clear (use at own risk)
12	SKaMPI artifact: master/worker with MPI_Recv with MPI_ANY_SOURCE	Not clear (use at own risk)
13	SKaMPI artifact: master/worker with MPI_Send	Not clear (use at own risk)
14	SKaMPI artifact: master/worker with MPI_Ssend	Not clear (use at own risk)
15	SKaMPI artifact: master/worker with MPI_Isend	Not clear (use at own risk)
16	SKaMPI artifact: master/worker with MPI_Bsend	Not clear (use at own risk)
17	Round of MPI_Bcast over all tasks	Lower bound of operation latency
18	Repeated MPI_Barrier calls (provides a reasonable lower bound of operation latency)	Per task overhead at task zero
19	Round of MPI_Reduce over all tasks	Lower bound of operation latency
20	Repeated MPI_Alltoall calls (provides a reasonable lower bound of operation latency)	Per task overhead at task zero
21	Repeated MPI_Scan calls	Per task overhead at task zero
22	Repeated MPI_Comm_split (provides a reasonable lower bound of operation latency) (note: "leaks" MPI_Comm results; future changes will eliminate this problem)	Per task overhead at task zero
23	Repeated memcpy calls	Time per memcpy call
24	Repeated MPI_Wtime calls	Clock overhead
25	Repeated MPI_Comm_rank calls	Time per MPI_Comm_rank call
26	Repeated MPI_Comm_size calls	Time per MPI_Comm_size call
27	Repeated MPI_Iprobe calls with no message expected	MPI_Iprobe call overhead
28	Repeated MPI_Buffer_attach and MPI_BUffer_detach	MPI_Buffer_attach/detach call overhead
29	Empty function call with point to point pattern	Function call overhead
30	Empty function call with master/worker pattern	Function call overhead
31	Empty function call with collective pattern	Function call overhead
32	Empty function call with simple pattern	Function call overhead
33	Round of MPI_Gather over all tasks	Lower bound of operation latency
34	Round of MPI_Scatter over all tasks	Lower bound of operation latency
35	Repeated MPI_Allgather calls (provides a reasonable lower bound of operation latency)	Per task overhead at task zero
36	Repeated MPI_Allreduce calls (provides a reasonable lower bound of operation latency)	Per task overhead at task zero
37	Round of MPI_Gatherv over all tasks	Lower bound of operation latency
38	Round of MPI_Scatterv over all tasks	Lower bound of operation latency
39	Repeated MPI_Allgatherv calls (provides a reasonable lower bound of operation latency)	Per task overhead at task zero
40	Repeated MPI_Alltoallv calls (provides a reasonable lower bound of operation latency)	Per task overhead at task zero
41	Repeated MPI_Reduce_scatter calls	Per task overhead at task zero
42	Repeated calls to MPI_Bcast, each followed by an MPI_Barrier call	Upper bound of operation latency
43	Repeated calls to MPI_Bcast	Per task overhead at root task
44	Round of MPI_Bcast over all tasks (identical to type 17)	Lower bound of operation latency
45	Repeated calls to MPI_Bcast, each followed by an acknowledgement from every other task to root task	Upper bound of operation latency
46	Repeated calls to MPI_Bcast, each followed by an acknowledgement from one task to root task; tested over all acknowledgers provides accurate measure of operation latency	Operation latency to acknowledging task
47	Repeated calls to MPI_Alltoall, each call followed by a barrier implemented with MPI_Send and MPI_Recv operations (provides a reasonable upper bound of operation latency)	Upper bound of operation latency
48	Repeated calls to MPI_Gather, each call followed by a broadcast implemented with MPI_Send and MPI_Recv operations (provides a reasonable upper bound of operation latency)	Upper bound of operation latency
49	Repeated calls to MPI_Scatter, each followed by an acknowledgement from one task to root task; tested over all acknowledgers provides accurate measure of operation latency	Operation latency to acknowledging task
50	Repeated calls to MPI_Allgather, each call followed by a barrier implemented with MPI_Send and MPI_Recv operations (provides a reasonable upper bound of operation latency)	Upper bound of operation latency
51	Repeated calls to MPI_Allreduce, each call followed by a barrier implemented with MPI_Send and MPI_Recv operations (provides a reasonable upper bound of operation latency)	Upper bound of operation latency
52	Repeated calls to MPI_Gatherv, each call followed by a broadcast implemented with MPI_Send and MPI_Recv operations (provides a reasonable upper bound of operation latency)	Upper bound of operation latency
53	Repeated calls to MPI_Scatterv, each followed by an acknowledgement from one task to root task; tested over all acknowledgers provides accurate measure of operation latency	Operation latency to acknowledging task
54	Repeated calls to MPI_Allgatherv, each call followed by a barrier implemented with MPI_Send and MPI_Recv operations (provides a reasonable upper bound of operation latency)	Upper bound of operation latency
56	Repeated calls to MPI_Reduce_scatter, each call followed by a barrier implemented with MPI_Send and MPI_Recv operations	Upper bound of operation latency
57	Repeated calls to MPI_Alltoallv, each call followed by a barrier implemented with MPI_Send and MPI_Recv operations (provides a reasonable upper bound of operation latency)	Upper bound of operation latency
58	Repeated calls to MPI_Reduce, each call followed by a broadcast implemented with MPI_Send and MPI_Recv operations (provides a reasonable upper bound of operation latency)	Upper bound of operation latency
59	Function call with for loop of number of tasks iterations in collective pattern	Overhead of function call with for loop
60	Computation used for non-blocking MPI tests	Time of overlap computation
61	Overlap of computation with MPI_Isend (not fully tested; use at own risk)	Overlap potential of MPI_Isend
62	Overlap of computation with MPI_Isend plus overlap of acknowledgement message (not fully tested; use at own risk)	Overlap potential of MPI_Isend
63	Overlap of computation with MPI_Irecv (not fully tested; use at own risk)	Overlap potential of MPI_Irecv
64	Repeated MPI_Reduce calls	Per task overhead at root task
65	Repeated MPI_Gather calls	Per task overhead at root task
66	Repeated MPI_Gatherv calls	Per task overhead at root task
67	Repeated MPI_Comm_dup calls (internal difference improves scalability compared to test 69) (note: "leaks" MPI_Comm results; future changes will eliminate this problem)	Per task overhead at task zero
68	Repeated MPI_Comm_split calls (internal difference improves scalability compared to test 22) (note: "leaks" MPI_Comm results; future changes will eliminate this problem)	Per task overhead at task zero
69	Repeated MPI_Comm_dup calls (note: "leaks" MPI_Comm results; future changes will eliminate this problem)	Per task overhead at task zero
70	Repeated calls to MPI_Scan, each followed by an acknowledgement from one task to task zero; tested over all acknowledgers provides accurate measure of operation latency	Operation latency to acknowledging task
71	Repeated MPI_Scan calls (internal difference improves scalability compared to test 21)	Per task overhead at task zero
101	Ping-pong using pthread_cond_signal and pthread_cond_wait	"Round trip" latency
102	Repeated calls to pthread_cond_signal; as many as the slave thread can wait for are caught	Overhead of pthread_cond_signal
103	Repeated uncaught calls to pthread_cond_signal	Overhead of pthread_cond_signal
104	Repeated calls to pthread_cond_wait; matching calls to pthread_cond_signal are made "as quickly as possible"	Overhead of pthread_cond_wait
105	Ping-pong using pthread_mutex_lock and pthread_mutex_unlock (four separate locks)	"Round trip" latency
106	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (four separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
107	Repeated pthread_mutex_lock and pthread_mutex_unlock calls (one lock)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
108	Repeated spin on shared variable; measures per thread time slice when bound to the same CPU	Per thread time slice
109	Chain of pthread_create calls for detached process scope threads	Overhead of pthread_create
110	Repeated calls to sched_yield (thr_yield for Suns); measures thread context switch time when bound to the same CPU (use with care, depends on OS thread scheduling)	Thread context switch time
111	Repeated pthread_mutex_lock calls (large array of locks) then repeated calls of pthread_mutex_unlock calls (large array of locks); each set of calls is timed separately	Overhead of pthread_mutex_lock and overhead of pthread_mutex_unlock (separate measurements)
112	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (two separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
113	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (three separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
114	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (five separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
115	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (six separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
116	Ping-pong using pthread_mutex_lock and pthread_mutex_unlock (array of four locks)	"Round trip" latency
117	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (array of four locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
118	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (large array of locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
119	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (seven separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
120	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (eight separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
121	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (nine separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
122	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (ten separate locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
123	Repeated interleaved pthread_mutex_lock and pthread_mutex_unlock calls (large array of locks, round robin access order)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
124	Repeated pthread_mutex_lock and pthread_mutex_unlock calls (one lock, two tight pairs of calls)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
125	Repeated pthread_mutex_lock and pthread_mutex_unlock calls (one lock, three tight pairs of calls)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
126	Repeated pthread_mutex_lock and pthread_mutex_unlock calls (one lock, four tight pairs of calls)	Overhead of pthread_mutex_lock and pthread_mutex_unlock
127	Repeated calls to sched_yield (thr_yield for Suns); measures thread context switch time when bound to the same CPU (uses two "new" threads; can overcome some scheduling quirks) (use with care, depends on OS thread scheduling)	Thread context switch time
128	Chain of pthread_create calls for detached system scope threads	Overhead of pthread_create
129	Chain of pthread_create calls for undetached process scope threads	Overhead of pthread_create
130	Chain of pthread_create calls for undetached system scope threads	Overhead of pthread_create
131	Function call with for loop of number of tasks iterations in simple pattern	Overhead of function call with for loop
201	Repeated calls to work function (not fully tested, use at own risk)	Reference measurement for OpenMP parallel construct
202	Repeated calls to an OpenMP parallel region of work function	Overhead of OpenMP parallel construct
203	Repeated calls to for loop over work function (not fully tested, use at own risk)	Reference measurement for OpenMP parallel for construct
204	Repeated calls to an OpenMP parallel for over work function	Overhead of OpenMP parallel for construct
205	Repeated calls to an OpenMP parallel for with variable chunk sizes over work function	Overhead of OpenMP parallel for with variable chunk sizes construct
206	Repeated calls to an OpenMP parallel for loop over work function (not fully tested, use at own risk)	Reference measurement for OpenMP ordered construct
207	Repeated calls to an OpenMP parallel for with ordered clause over work function	Overhead of OpenMP parallel for with ordered clause
208	Repeated calls to an OpenMP parallel for with ordered work function calls	Overhead of OpenMP ordered construct
209	Repeated calls to for loop over work function (not fully tested, use at own risk)	Reference measurement for OpenMP single and barrier constructs
210	Repeated calls to for loop over work function inside OpenMP single construct (not fully tested, use at own risk)	Overhead of OpenMP single construct
211	Repeated calls to for loop over work function following an OpenMP barrier construct (not fully tested, use at own risk)	Overhead of OpenMP barrier construct
212	Repeated calls to for loop over work function, results summed (not fully tested, use at own risk)	Reference measurement for OpenMP reduction construct
213	Repeated calls to an OpenMP parallel for loop with reduction clause over work function (not fully tested, use at own risk)	Overhead of OpenMP reduction construct
214	Repeated calls to integer increment and work function (not fully tested, use at own risk)	Overhead of OpenMP single construct
215	Repeated calls to for loop over integer increment inside an OpenMP atomic construct and work function (not fully tested, use at own risk)	Overhead of OpenMP barrier construct
301	Repeated calls to a mixed OpenMP/MPI barrier followed by work function call (provides a reasonable lower bound of operation latency)	OpenMP-test-style overhead of mixed OpenMP/MPI barrier
302	Repeated mixed OpenMP/MPI barrier calls (provides a reasonable lower bound of operation latency)	Per task overhead at task zero
303	Repeated calls to a mixed OpenMP/MPI reduce across all threads in all tasks followed by work function call (provides a reasonable lower bound of operation latency)	OpenMP-test-style overhead of mixed OpenMP/MPI all reduce
304	Repeated calls to mixed OpenMP/MPI reduce across all threads in all tasks (provides a reasonable lower bound of operation latency)	Per task overhead at task zero
305	Repeated calls to mixed OpenMP/MPI reduce across all threads in all tasks (essentially) followed by a mixed OpenMP/generic MPI barrier	Upper bound of operation latency
306	Computation used for non-blocking MPI mixed with OpenMP tests	Time of threaded overlap computation
307	Overlap of OpenMP threaded computation with MPI_Isend (not fully tested; use at own risk)	Overlap potential of MPI_Isend
308	Overlap of OpenMP threaded computation with MPI_Isend plus overlap of acknowledgement message (not fully tested; use at own risk)	Overlap potential of MPI_Isend
309	Overlap of OpenMP threaded computation with MPI_Irecv (not fully tested; use at own risk)	Overlap potential of MPI_Irecv

The preceding table provides only a brief description of the tests and their results. The referenced papers provide further detail. Of course, a complete understanding can only result from careful consideration of the code. For details of the corrections used when the correct for overhead mode is used, consult the code.

SPHINX INDEPEDENT VARIABLE SCALES

NAME	DESCRIPTION
FIXED_LINEAR	Fixed linear scale; use up to MAXTSTEPS values exactly STEPWIDTH apart
DYNAMIC_LINEAR	Dynamic linear scale; use values exactly STEPWIDTH apart, then fill in into either exactly MAXSTEPS values or no "holes"
FIXED_LOGARITHMIC	Fixed logarithmic scale; use up to MAXTSTEPS values "logarithmically" exactly STEPWIDTH apart
DYNAMIC_LOGARITHMIC	Dynamic linear scale; use values "logarithmically" exactly STEPWIDTH apart, then fill in into either exactly MAXSTEPS values or no "holes"

Linear scales are reasonably intuitive; a STEPWIDTH of the square root of 2 will result in doubling the previous value with a fixed logarithmic scale. The default stepwidth actually varies with the scale since a STEPWIDTH of 1.00 would not result in any variation with logarithmic scales. The default is the ssquare root of two with logarithmic scales.

SPHINX WORK FUNCTIONS

NAME	DESCRIPTION
SIMPLE_WORK	A simple for loop of WORK_AMOUNT iterations, each iteration has a single FMA plus a few branch statements based on mod tests and possibly an integer shift
BORS_WORK	Complex set of array operations; duration per WORK_AMOUNT unit is relatively long
SPIN_TIMED_WORK	Loop over checks to see if work function has lasted WORK_AMOUNT nanoseconds
SLEEP_TIMED_WORK	Loop over checks to see if work function has lasted WORK_AMOUNT nanoseconds followed by sleep and usleep of remaining time

The effect of varying work functions should be limited to cache effects. A future paper will present results addressing the validity of this expectation.

SPHINX SCHEDULE VALUES

VALUE	DESCRIPTION
STATIC_SCHED	static
DYNAMIC_SCHED	static
GUIDED_SCHED	static

The standard OpenMP names for the scheduling options are actually sufficient since Sphinx uses a non-case specific minimum string mechanism to determine the value. Support for the OpenMP runtime schedule option may be added in the future.

SPHINX OVERLAP FUNCTION VALUES

VALUE	DESCRIPTION
SEQUENTIAL	sequential work function (i.e. not in an OpenMP parallel region)
PARALLEL	work function inside OpenMP parallel region
PARALLEL_FOR	work function inside OpenMP parallel for construct
PARALLEL_FOR_CHUNKS	work function inside OpenMP parallel for construct with variable chunks

The log mechanism inherited from SKaMPI allows multiple runs of the same input file to run the full set of test descriptions to completion. If the log file contains and end of run message, then the log file and output file are moved to file names extended with an integer and a new run of the full set is started. Sphinx includes corrections to some bugs in this mechanism. These corrections ensure that each test description is run to completion exactly once, regardless of its name or the status of partial runs.

Output Format

Sphinx retains a somewhat cumbersome output mechanism from SKaMPI. In particular, Sphinx can generate output to four different places: an output file, a log file, stdout and stderr. The stream to stdout comprises only simple informational messages; generally it can be redirected to /dev/null without consequence. The stream to stderr includes any error messages, in addition to some informational messages; since most of the errors are also directed to the log file, it can also be redirected to /dev/null without consequence in MOST cases (no promise that all error messages are duplicated). The log file also includes mostly informational messages. It is used for the restart mechanism, which can be very useful when tests are run through a batch system.

The output file is the most interesting of the Sphinx output streams. It is divided into two sections. The first section contains a summary of the default parameters, a listing of several of the text input modes and (often most importantly) a dump of the environment pointer. The second section contains the output records corresponding to the test descriptions in the input file. These output records again are divided into two sections. The first section contains a summary of the test description. The second section contains the results for the data points specified by the test description (run-time settings, such as number of MPI tasks also affect data point selection). Each line of this section is an output result; the first n fields are the independent variable values, the next field is the test timing result in microseconds. The timing result is followed by the number of timings, the standard deviation of the timings that produced the result and then any informational flags for the timing set, such as UNSETTLED, which indicates that the standard deviation of the timings did not achieve the cut-off percentage.

Timing results are in a per "operation" form; thus the standard deviation generally applies to the timing result times the number of iterations of the operation. In addition, results may also have been corrected for overhead, which can be significant, particularly for the accurate tests of fan-out MPI collectives.

Parallelism and Scalability Expectations

This benchmarks have been run on a wide variety of platforms: IBM SPs with four-way, eight-way and sixteen-way SMP nodes; Clusters based on Compaq Alpha four-way and eight-way SMP nodes; A cluster based on SGI Origin 2000 256-way nodes; and several Sun Sparc based systems with SMP nodes. It has also been used with MPICH-g for Grid-based computing tests. The largest runs have used over 1500 MPI tasks. The current test harness supports these large runs although some of the collective tests do not scale well. Care should be exercised in the choice of tests when running large scale MPI tests; even then poorly scaling MPI implementations will be difficult to test at large scales. The Pthreads tests are generally designed to require no more than three active threads at a time; additional tests that use more concurrent threads may be implemented in the future. All of the OpenMP tests scale up to at least 16 threads; their scaling should depend solely on the scaling of the implementation being tested.

Timing Issues

The code uses MPI_Wtime to measure wall clock time throughout its tests. Future versions may include the option to use the Unix gettimeofday of call so a no MPI version can be built. The time to run the code depends on the tests selected, the implementations being tested and the scale of the system being tested. The Pthreads tests typically can all be run in a few minutes; the OpenMP and MPI tests can take much longer.

References

1. R.H. Reussner, "User Manual of SKaMPI, Special Karlsruher MPI-Benchmark," Tech. Report, University of Karlsruhe, 1998.

2. B.R. de Supinski and J. May, "Benchmarking Pthreads Performance," Proc. of the 1999 Intl. Conf. on Parallel and Distributed Processing Techniques and Applications, 1999, pp. 1985-1991.

3. B.R. de Supinski and N. Karonis, "Accurately Measuring Broadcasts in a Computational Grid," Proc. of the 8th Intl. Symp. on High Performance Distributed Computing, 1999, pp. 29-37.

Release and Modification Record

Version 1.0 (April 4, 2001)

For information about this page contact:
Bronis R. de Supinski, bronis@llnl.gov, (925) 422-1062.

and LLNL Disclaimers

UCRL-MI-143659
Revised April 4, 2001

UCRL-CODE-99026

Contents