Hashed Oct-Tree N-body Code: HOT

Background

There are several different computational approaches to model structure formation in the dark matter component. These include tree-based N-body, particle-in-cell (PIC) or particle-mesh (PM), and hybrid methods. Fundamentally, each of these methods attempts to solve the Vlasov-Poisson equations for a collisionless gas of gravitating particles. The structure of galaxy clusters and galactic halos depends on baryon hydrodynamics, as does the analysis of Lyman-alpha clouds, therefore, a complete cosmological simulation framework must incorporate this physics.

One approach is to divide the the inter-particle interactions into near and far sets. The forces due to distant particles can then be updated less frequently, or their forces can be ignored completely if one decides the effects of distant particles are negligible. However, this risks significant errors which are hard to analyze. Also, any significant clustering in the system will reduce the efficiency of the method, since a large fraction of the particles end up being in a near set. Over the past several years, a number of methods have been introduced which allow N-body simulations to be performed on an arbitrary collections of bodies in time much less than N^2, where N is the number of particles, without imposition of a lattice. They all have in common the use of a truncated expansion (e.g., Taylor expansion, Legendre expansion, Poisson expansion) to approximate the contribution of many bodies with a single interaction. The resulting complexity is usually cited as N or NlogN , but a careful analysis of what dependent variables should be held constant cann often lead to different conclusions about the scaling; nevertheless, the final scaling is a major improvement over N^2.

The basic algorithm underlying the HOT code may be divided into several stages. First, particles are domain decomposed into spatial groups. Second, a distributed tree data structure is constructed. In the main stage of the algorithm, this tree is traversed independently in each processor, with requests for non-local data being generated as needed. In our implementation, we assign a key to each particle, which is based on Morton ordering. This maps the points in 3-dimensional space to a 1-dimensional list, maintaining as much spatial locality as possible. The domain decomposition is obtained by splitting this list into N_proc (number of processors) pieces. An efficient mechanism for latency hiding in the tree traversal phase of the algorithm is critical. To avoid stalls during non-local data access, we effectively do explicit 'context switching' using a software queue to keep track of which computations have been put aside waiting for messages to arrive.

HOT has defined the state of the art in high-resolution cosmological N-body simulations over the last decade (see Performance section below). An SPH hydro capability for HOT has been developed and preliminary tests have been conducted successfully.

The main developers of HOT are Michael Warren (LANL) and John Salmon.

Code Performance

HOT has been run on a variety of parallel platforms over the last decade, e.g., on QB and QSC at LANL (Alpha clusters), on Seaborg at NERSC (IBM Power3), and on Beowulf clusters. Historical performance information is provided in the table below. Simulations with HOT have garnered Gordon Bell awards in 1992 and 1997 (First place, performance), and 1997 (First place, price-performance).

Year Computer Processors GFlop/s Mflops/proc
2003 Space Simulator (2.5 GHz P4 cluster) 288 1166 4050
2003 QB (HP AlphaServer ES45) 3600 2793 775.8
2002 Seaborg (IBM SP3) 256 57.7 225
2002 Green Destiny 212 38.9 183.5
2000 SGI Origin 2000 64 13.1 205
1998 Avalon 128 16.16 126
1996 ASCI Red 6800 464.9 68.4
1995 TMC CM-5 512 14.06 27.5
1993 Intel Delta 512 10.02 19.6

Recent improvements of the HOT code inner loop using SSE instructions allow the calculation of about 200 million gravitational interactions per second per processor on a 2.53 GHz P4, for a computational rate of over 6 Gflops in single precision. With typical accuracy constraints requiring about 10000 interactions per particle, a single timestep with 1 billion particles requires 10^(13)$ interactions, which is a couple of minutes per timestep on 1024 processors, including overheads. Overall, simulations requiring a few times 10^(17) flops should be achievable in about a day.

Execution time for HOT is dominated by the force calculation in the inner loop. We have collected performance figures on a variety of processors in the table below (Mflop/s on the gravitational micro-kernel benchmark). Significantly, for running on Pink, the SSE capability of the Intel Xeon/P4 currently puts it well ahead of any other processor family.

Processor libm Karp
1250 MHz Alpha 21264C 935.2 1141.0
933 MHz Transmeta TM5800 189.5 373.2
375 MHz IBM Power3 298.5 514.4
1800 MHz AMD Athlon XP 609.9 951.9
2530 MHz Intel P4 (SSE) 1170.0 6514.0

Back to Cosmology/Beams Main Page.

Back to ICP Main Page.

Salman Habib / LANL / revised November 03
Valid HTML 4.0!