White PSE Milepost

Sphinx Microbenchmark Suite Description

About Sphinx

Sphinx is an integrated parallel microbenchmark suite. It was adapted from the Special Karlsruhe MPI (SKaMPI) Benchmark suite¹ by Bronis R. de Supinski and other members of the PSE/ASDE project, including John May and Bor Chan. LLNL adaptations include extensive tests of the Pthreads interface² and on-going integration of the LLNL OpenMP Performance Suite. In addition, several new MPI tests have been added, primarily focusing on the performance of collective operations, including the first widely available tests that accurately measure the operation latency of fan-out collective operations such as MPI_Bcast³. The entire suite is implemented in C and has been run on a wide variety of platforms.

Sphinx is highly portable. The primary portability issue involves the use of processor bindings in the Pthreads tests. The mechanism for binding threads to processors varies widely between platforms, both in interface and in capabilities. Sphinx includes a module that abstracts most of these issues. However, care needs to be exercised as several platforms have failed to exhibit appropriate binding behavior. Problems in this area that we previously detected under AIX have been resolved, in part due to our previously published results on Blue. Tests that are auxilliary to Sphinx demonstrate that AIX now alternates the scheduling of threads bound to the same CPU, while our tests have always indicated that AIX reliably binds individual threads (and not the entire process) to the CPU selected.

PSE's Role in Sphinx's Development / Availability

As discussed above, this suite was developed by Bronis R. de Supinski and other members of the PSE/ASDE project. Sphinx has been released for unrestricted distribution (UCRL CODE-99026). Sphinx source is available for download.

Results of PSE Milepost Sphinx Tests

Sphinx currently includes tests for measuring the performance of MPI, Pthreads, and OpenMP. We ran the full set of MPI and Pthreads tests, including scaling runs of most of the tests of MPI collective operations. This testing not only provides a performance base for these important run-time systems but also fully exercised their functionality. For OpenMP testing, we chose to use the stand-alone OpenMP tests since we are currently in the process of validating the OpenMP tests in Sphinx.

Our results on the Power3 systems (Snow, Frost and White) generally conformed to our expectations or exceeded them. For example, we were happy to see that the cost of creating a thread under Pthreads has decreased significantly compared to its cost on Blue. Similarly, the results of our various point-to-point MPI communication pingpong tests showed significantly increased bandwidth on the new systems and were consistent with IBM's performance claims. Despite the generally positive results, we did find some problems. For example, Pthreads mutexes perform significantly worse as more CPUs are added to the system, despite our tests only using two threads. We also observed scaling problems with the MPI_Allreduce implementations. We are continuing to investigate these problems and are actively working with IBM to resolve them. Output files from all of our Sphinx runs are available below. The remainder of this section will discuss our Pthreads and MPI results separately and in detail.

Pthreads Results

We ran the full set of Pthreads tests included in Sphinx on Blue, Snow and Frost. Our results generally indicate that IBM has improved their Pthreads implementation significantly. For example, we have observed an approximately 60% reduction in the number of cycles required to create a thread. However, we observed a substantial increase in the cost of exchanging information between threads either through condition variables or mutexes when the threads are not bound to specific CPUs by the user. Further testing revealed that setting the YIELDLOOPTIME environment variable to any value eliminates the problems for mutexes. We are working with IBM in identifying the reason for this anomaly, as well as continuing to investigate the results for condition variables. The rest of this section presents graphs of our Pthreads results; again, the actual output records are available below.

Our Pthreads tests break down into roughly two categories: thread utility tests and thread synchronization tests. Our utility tests measure the times for thread creation, an active thread context switch and of the time slice alloted to a thread. We do not discuss the time slice test in detail; this is generally an OS parameter and our measurements agree with the OS installation settings. Our synchronization tests measure the costs of using Pthreads condition variables and mutexes.

Throughout this section, we focus on results normalized to account for the different clock rates of the CPUs on the machines. As with the OpenMP testing, this normalization is reasonable since we expect that the Pthreads routines to offer little opportunity to exploit the superscalar features of the PowerPC architecture. In all cases, links accompany the cycle count graphs that also make graphs available that directly compare the wall clock times.

We correct the measurements of our Pthreads tests for the overhead of the Sphinx measurement harness. This correction is important since this overhead can be a substantial portion of the total measurement for short operations like those used in our Pthreads tests. For most of our tests, the measurement harness overhead is a function call (it is slightly different for ones that use a chain of calls, like our thread creation test). We measure the function call overhead by repeatedly executing a function with an empty body. Our results for function call overhead on the three systems are all within a 3 cycle range, indicating that this cost has not changed between any of the machines.

We measure the cost of creating threads by starting a timer and then creating a thread. The created thread creates another thread and then exits. This process is repeated for a large number of threads. Before it exits, the final thread in the chain signals a condition variable that causes the original thread to stop the timer. Our measurment is then the total time divided by the number of threads created, i.e. the length of the chain. As stated above, our results indicate that this cost has decreased significantly on Snow compared to Blue; while the cost in cycles on Frost is essentially equal to the cost on Snow.

We bind two threads to the same CPU in our active context switch test. Each thread then repeatedly calls sched_yield. We obtain the time for the operating system to remove a yielding thread and schedule another one by dividing the time for these operations by the total number of yields. Our results indicate that this cost has decreased slightly on the new systems.

Our Pthreads synchronization tests break down into two categories. One set of tests measures the overhead of using condition variables or mutexes within a thread; the other measures the cost of using these mechanisms to communicate between threads. Results for the communication tests have revealed some problems. For example, results of our condition pingpong test, in which two threads alternate signalling and waiting on a condition variable have increased by nearly a factor of two on Frost compared to Blue or Snow.

Problems also occur with using mutexes for interthread communication. In our mutex pingpong test, we use a set of mutexes to guarantee that the lock and unlock operations are ordered between two threads, resulting in a communication pattern similar to a message passing pingpong. Our initial results revealed a significant increase in the cost of these operations when the threads are not bound, the normal mode of operation for most applications. The degradation becomes worse as more CPUs are added to the system; this is counter-intuitive since our code has very few active threads (two user threads in addition to three or four system threads, such as the MPI progress thread). However, we have found that setting the YIELDLOOPTIME environment variable eliminates the problem. Setting the SPINLOOPTIME environment variable to a large value has a similar effect; unfortunately that choice significantly increases the cost of interthread communication for threads bound to the same CPU. We theorize that these tests indicate problems with AIX thread scheduling. These problems may result from the operating system being optimized for client server applications (as are common with web servers) and not for scientific computing using threads. Note that these environment variable settings do not significantly improve the Frost condition pingpong results.

Result of our synchronization overhead measurement are generally fairly good. The costs of a condition signal and of mutex operations have decreased slightly. Note that the ILP results for the mutex operations use a large array of mutexes instead of single mutex; this reduces the cost slightly on most systems, indicating that instruction-level parallelism has a small positive effect on these operations. The only synchronization overhead for which we get a negative result is the cost of a condition wait on Frost. We have found that setting the YIELDLOOPTIME environment variable reduces this cost on Frost by about 20%; setting SPINLOOPTIME appears to have no effect.

MPI Results

We have run the full range of MPI performance tests included in Sphinx on Blue, Snow, Frost and White using both IBM's MPI implementation and the MPICH implementation that uses the MPL device. Our results for the point-to-point tests, which consist primarily of pingpong measurements for various combinations of MPI send and receive operations, are consistent with the results of our NEWS05 testing. In particular, using on-node shared memory provided significantly greater bandwidth on all machines, which is in contrast to previous testing on Blue which had shown no significant difference between using on-node shared memory or the switch for communication between tasks located on the same node. Our comparisons of MPICH and IBM's MPI generally showed significantly better performance for IBM's implementation on all of the machines, both for point-to-point and collective operations, although there were some exceptions. The output records for our MPI tests are available below; the remainder of this section presents a detailed analysis of our results.

Results for MPI Point-to-Point Operations

We ran the Sphinx point-to-point tests on Blue, Snow and Frost. We used several different configurations for the tests: communicating tasks on the same node, using shared memory and the switch for communication as well as communicating tasks on different nodes (which requires using the switch). See below for output records of our point-to-point tests.

As mentioned above, most of the point-to-point tests included in Sphinx are simply variants of pingpong measurements. The tests measure the pingpong times for several message sizes; we then calculated the bandwidth implied by the timings. We focus our results on the bandwidth results; graphs of the raw times are available as links from the bandwidth graphs. The results for the MPI_Send/MPI_recv pingpong, MPI_Send/MPI_Recv with MPI_ANY_TAG pingpong, MPI_Isend/MPI_Recv pingpong, MPI_Send/MPI_Irecv pingpong, MPI_Send/MPI_Iprobe followed by MPI_Recv pingpong and MPI_Ssend/MPI_Recv pingpong are similar: peak bandwidth using on-node shared memory on Frost is about 500MB/s, which is signifcantly higher than any of the other peak bandwidths. We note that using shared memory on Frost and Snow has a significantly greater effect than on Blue. The MPI_Bsend/MPI_Recv pingpong is the one test for which bandwidths are significantly lower; this results from the local completion semantics of the MPI_Bsend operation which motivates implementations to copy the data between local memory before sending the message to it's destination. In addition to these pingpong tests, the suite includes tests that measure bidirectional bandwidth. When both tasks use MPI_Sendrecv, we see that peak bidirectional is slightly higher than the peak unidirectional bandwidth, as expected. The bandwidth decreases significantly when MPI_Sendrecv_replace is used, which is again due to the semantics of operation motivating the implementations to use local memory copies.

Results for MPI Collective Operations

We ran the full range of the collective tests in Sphinx on Blue and Snow (output records are available below). In addition, we ran a subset of the tests at scale on Frost and White (output records of scaling tests are also available below). In preparation for the Milepost testing, we made several changes to the Sphinx harness to improve its scalability. The most important of these was the elimination of several MPI_Bcast calls. Our initial results at scale indicated that these changes had significantly improved Sphinx scalability. However, several of the collective tests have inherently poor scaling characteristics. Often these problems result from the operation being tested (e.g. MPI_Alltoall). Nonetheless, our initial results motivated the implementation of several new, scalable tests, including the first accurate test of the operation latency of MPI_Scan.

Two different aspects of collective operation performance may be of interest to the application programmer. One of these, per-task overhead, is simply the processing time required to complete the operation at each task. The other, operation latency, is the total (minimum) time from when the first task starts the operation until the last task completes it. The tests in Sphinx focus on measuring operation latency; overhead tests are included from the original SKaMPI tests. However, these only measure overhead at task zero and not at each task. We are considering augmenting Sphinx with a full set of per-task overhead tests.

Semantically, we classify MPI collective operations as synchronous or asynchronous. In the synchronous operations, such as MPI_Barrier and MPI_Allreduce, every task supplies information to every other task. Thus, every task semantically begins and finishes the operation at the same time (implementations may not be synchronous despite the synchronous semantics). We further divide the asynchronous collectives into fan-in and fan-out operations. In fan-in operations, such as MPI_Reduce, one task receives information from every other task; in fan-out operations, such as MPI_Bcast, one task supplies information to every other task. The one exception to this classification scheme is MPI_Scan, which is asynchronous but has both fan-in (to task N-1) and fan-out (from task 0) characteristics. The type of a collective operation determines how we measure its performance.

ASDE research produced the design and implementation of the first accurate tests of the operation latency of fan-out operations³. The tests consist of repeated calls to the operation with the root receiving an acknowledgment from one other task between calls. For each task sending the acknowledgment, this test measures the operation latency to that acknowledger. The overall operation latency is then the maximum of these per-task latencies. We present results of the operation latency for all three fan-out collective operations on Blue and Snow:

MPI_Bcast: Small messages; Large Messages;

MPI_Scatter: Small messages; Large Messages;

MPI_Scatterv: Small messages; Large Messages;

One observation consistently emerges from these results: using shared memory for on-node communication does not provide much benefit. In fact, performance is actually significantly worse for MPI_Scatterv when shared memory is used. This result is counter-intuitive and indicates that optimizations that take advantage of the shared memory could significantly improve performance.

Since we do not know a priori which task receives the information last, we must measure the per-task latency of every task. As a result, these tests require a separate measurement for each task and, thus, inherently scale poorly. Note that when the communication topology is known, as with MPICH, we can determine a limited set of tasks that may determine the overall latency. We are investigating mechanisms to deduce the communication topology from small scale tests and, thus, to support accurate measurements of fan-out operations at scale.

As a result of our Milepost testing, Sphinx now includes two tests for MPI_Scan. Our initial results motivated the implementation of a test for MPI_Scan that accurately measures its operation latency, i.e. the time required from the start of the operation until it is completed at all participating MPI tasks. The original test of MPI_Scan in Sphinx simply measured the time at task zero for repeated calls to MPI_Scan. Both our small scale and large scale results for this test, which measures the overhead at task zero of the MPI_Scan operation, indicate that the performance of MPICH's MPI_Scan implementation was constant as the number of tasks is increased. The MPICH implementation uses a linear algorithm in which task zero sends to task one which then sends to task two and so on until finally task N-1 receives the partial result from task N-2. Thus, our original test suffers from the pipelining effect⁴ in which the messages from successive MPI_Scan calls are completely overlapped. We solved this problem similarly to our method for accurately measuring fan-out operation latency: have task 0 wait to receive an acknowledgment from task N-1 between successive calls to MPI_Scan. Small scale and large scale results for this test demonstrate that the cost of MPI_Scan under MPICH increases linearly with the number of tasks, as we expect from our examination of the implementation. We note that the semantics of MPI_Scan support an accurate test that is also scalable.

No accurate tests of operation latency for fan-in and synchronous operations currently exist. For these operations, we instead use a combination of tests that produce bounds of the operation latency. Graphs of all of our results for the fan-in and synchronous operations are available below. The remainder of this section will focus on a few significant results after we discuss the form of the bounding tests.

We measure an upper bound for both types of operations by performing the operation followed by a barrier. This test produces an upper bound because the barriers ensure that no overlap occurs between successive calls to the operation. The barrier is implemented in Sphinx with point-to-point messages to avoid differences in results arising from differences in MPI_Barrier implementations instead of the operation being measured.

For a lower bound of the operation latency of a synchronous operation, we measure the time at task zero for repeated calls to the operation. This actually measures the overhead at task zero. However, the semantics of synchronous operaions makes this close to the operation latency for reasonable implementations.

For a lower bound of the operation latency of a fan-in operation, our original test measured repeated rounds of the operation in which each round of the operation consisted of several calls to the operation - one with each task as root. This test often reduces overlap between successive calls and thus provides a better lower bound. However, this test does not scale since the number of calls to the operation grows linearly with the number of tasks. For this reason, we implemented tests of the overhead of a single task for MPI_Reduce, MPI_Gather and MPI_Gatherv. We present results for task zero in our single root lower bound results.

The new lower bound, task zero overhead tests for MPI_Gatherv revealed poor performance on Snow for the IBM implementation but not MPICH. The MPICH was coded by Bronis de Supinski as part of the PSE/ASDE project. The original implementation, which we suspect is that used by the IBM implementation, simply has every task send its information directly to the root. Our implementation uses a tree topology to filter the information up to the root.

Another interesting fact emerges from our results for MPI_Gather and MPI_Gatherv. Our test of MPI_Gatherv does not use the operation's ability to gather different data sizes from different tasks. Thus, we are simply using it when we could use MPI_Gather. However, MPI_Gatherv often outperforms MPI_Gather, particularly at small scale, as shown in the following graphs:

Comparison of MPI_Gather/Gatherv Lower Bounds: Small Scale; At Scale;

Comparison of MPI_Gather/Gatherv Upper Bounds: Small Scale; At Scale;

Comparison of MPI_Gather/Gatherv Single Root Lower Bounds: Small Scale; At Scale.

Results for our tests of other MPI_Allreduce verify performance deficiencies in IBM's implementation that were discovered through separate testing focusing on this important operation. In particular, performance degrades substantially when the number of tasks is a power of two. This problem was caused by an optimization that was appropriate for systems with uniprocessor nodes but actually degraded performance with SMP nodes. A new library that eliminates this optimization has since been installed on our systems.

We are still working with IBM to resolve some other performance problems for their collective implementations. Testing continues to indicate problems with MPI_Allreduce. In addition, as shown throughout our results, the performance gain with on-node shared memory compared to using the switch for all MPI communication is not as significant as we would expect, given the approximate reduction of three levels in tree height with sixteen-way SMP nodes.

Bounds of Operation Latencies for Fan In Collective Operations:

MPI_Reduce Lower Bound: Small Scale; At Scale;

MPI_Reduce Upper Bound: Small Scale; At Scale;

MPI_Reduce Single Root Lower Bound (test implementation motivated by initial milepost results): Small Scale; At Scale;

MPI_Gather Lower Bound: Small Scale; At Scale;

MPI_Gather Upper Bound: Small Scale; At Scale;

MPI_Gather Single Root Lower Bound (test implementation motivated by initial milepost results): Small Scale; At Scale;

MPI_Gatherv Lower Bound: Small Scale;

MPI_Gatherv Upper Bound: Small Scale; At Scale;

MPI_Gatherv Single Root Lower Bound (test implementation motivated by initial milepost results): Small Scale; At Scale;

Bounds of Operation Latencies for Synchronous Collective Operations:

MPI_Allreduce Lower Bound: Small Scale; At Scale;

MPI_Allreduce Upper Bound: Small Scale; At Scale;

MPI_Barrier Lower Bound: Small Scale; At Scale;

MPI_Alltoall Lower Bound: Small Scale; At Scale;

MPI_Allgather Lower Bound: Small Scale; At Scale;

MPI_Allgather Upper Bound: Small Scale; At Scale;

MPI_Alltoallv: Lower Bound; Upper Bound;

MPI_Allgatherv: Lower Bound; Upper Bound;

MPI_Reduce_scatter: Lower Bound; Upper Bound.

What Problems Using Sphinx Did We Find / Fix During the Milepost Activities

As discussed throughout our analysis rof our results, the Milepost testing led to several improvements in the scalability of Sphinx. Prior to testing, we made changes to the Sphinx measurement harness that improved its scalability. These changes elimiinated several unnecessary collective operations from the harness. In addition, the initial results of our Milepost test motivated the design and implementation of several new, scalable tests.

What Is the General Availability of Sphinx

As stated above, Sphinx had previously been released for unlimited distribution and the source is available through a link off of the ASDE home page

What Are the Known Limitations of Sphinx

There are several limitations to this performance suite including:

Most of the Pthreads tests require a capability to bind individual threads to processors;

OpenMP tests have not yet been validated;

Tests to measure stencil communication patterns are not included;

Tests to measure bisection bandwidth are not included;

Tests that provide accurate measurements of the operation latency for fan-out collective operations require at least two tasks;

Tests that provide accurate measurements of the operation latency for fan-out collective operations do not scale;

Tests that provide accurate measurements of the operation latency for synchronous or fan-in collective operations are not included (we are working on a design for accurate measurement of fan-in operations; to the best of our knowledge no such test has yet been designed for synchronous operations);

Tests that include the effects of mixing work with message operations are not fully implemented;

We are continuing the development of Sphinx. Work in the current fiscal year will address many, if not all, of these limitations.

Actual output records from Sphinx runs:

PTHREADS TESTS

blue

snow

frost

MPI TESTS

Point to Point Tests:

	Blue	Snow	Snow MPICH	Frost
Switch-Based Communication	Same Node Different Nodes	Same Node Different Nodes	Same Node Different Nodes	Same Node Different Nodes
Shared Memory Communication	Same Node	Same Node	Same Node	Same Node

Collective Operations Tests:

	Blue	Snow	Snow, on-node shared memory communication	Snow MPICH
Fan Out and Synchronous (Original)	1 task per node 3 tasks per node 4 tasks per node	1 task per node 7 tasks per node 8 tasks per node	7 tasks per node 8 tasks per node	1 task per node 7 tasks per node 8 tasks per node
Fan Out and Synchronous (Implementation Motivated by Observations of Milepost Runs at Scale)	1 task per node 3 tasks per node 4 tasks per node	1 task per node 7 tasks per node 8 tasks per node	7 tasks per node 8 tasks per node	1 task per node 7 tasks per node 8 tasks per node
Fan In	1 task per node 3 tasks per node 4 tasks per node 4 tasks per node (continued)	1 task per node 7 tasks per node 8 tasks per node	7 tasks per node 8 tasks per node	1 task per node 7 tasks per node 8 tasks per node

Collective Operations Tests at Scale:

	No shared memory communication	On-node shared memory communication	MPICH
Fan Out and Synchronous (Original) (Run on White)	1 task per node 15 tasks per node 16 tasks per node 16 tasks per node (continued)	15 tasks per node 16 tasks per node 16 tasks per node (continued)	1 task per node 15 tasks per node 15 tasks per node (continued) 16 tasks per node 16 tasks per node (continued)
Fan Out and Synchronous (Implementation Motivated by Observations of Milepost Runs at Scale) (Run on Frost)	1 task per node 15 tasks per node 16 tasks per node	15 tasks per node 15 tasks per node (continued; run on 01/04/01) 16 tasks per node	1 task per node 15 tasks per node 16 tasks per node

The Milepost motivated tests were run after the PTF5 upgrade. Examination of other runs from after the PTF5 upgrade revealed no significant performance differences for these tests; the output is available upon request.

References

1. R.H. Reussner, "User Manual of SKaMPI, Special Karlsruher MPI-Benchmark," Tech. Report, University of Karlsruhe, 1998.

2. B.R. de Supinski and J. May, "Benchmarking Pthreads Performance," Proc. of the 1999 Intl. Conf. on Parallel and Distributed Processing Techniques and Applications, 1999, pp. 1985-1991.

3. B.R. de Supinski and N. Karonis, "Accurately Measuring Broadcasts in a Computational Grid," Proc. of the 8th Intl. Symp. on High Performance Distributed Computing, 1999, pp. 29-37.

4. M. Bernaschi and G. Iannello, "Collective Communication Operations: Experimental Results vs. Theory," Concurrency: Practice and Experience, 1998, Vol. 10, No. 5, pp. 359-386.

For more information, contact: Bronis R. de Supinski, bronis@llnl.gov, (925) 422-1062.