Hashed Oct-Tree N-body Code: HOT

Background

There are several different computational approaches to model structure formation in the dark matter component. These include tree-based N-body, particle-in-cell (PIC) or particle-mesh (PM), and hybrid methods. Fundamentally, each of these methods attempts to solve the Vlasov-Poisson equations for a collisionless gas of gravitating particles. The structure of galaxy clusters and galactic halos depends on baryon hydrodynamics, as does the analysis of Lyman-alpha clouds, therefore, a complete cosmological simulation framework must incorporate this physics.

One approach is to divide the the inter-particle interactions into near and far sets. The forces due to distant particles can then be updated less frequently, or their forces can be ignored completely if one decides the effects of distant particles are negligible. However, this risks significant errors which are hard to analyze. Also, any significant clustering in the system will reduce the efficiency of the method, since a large fraction of the particles end up being in a near set. Over the past several years, a number of methods have been introduced which allow N-body simulations to be performed on an arbitrary collections of bodies in time much less than N^2, where N is the number of particles, without imposition of a lattice. They all have in common the use of a truncated expansion (e.g., Taylor expansion, Legendre expansion, Poisson expansion) to approximate the contribution of many bodies with a single interaction. The resulting complexity is usually cited as N or NlogN , but a careful analysis of what dependent variables should be held constant cann often lead to different conclusions about the scaling; nevertheless, the final scaling is a major improvement over N^2.

The basic algorithm underlying the HOT code may be divided into several stages. First, particles are domain decomposed into spatial groups. Second, a distributed tree data structure is constructed. In the main stage of the algorithm, this tree is traversed independently in each processor, with requests for non-local data being generated as needed. In our implementation, we assign a key to each particle, which is based on Morton ordering. This maps the points in 3-dimensional space to a 1-dimensional list, maintaining as much spatial locality as possible. The domain decomposition is obtained by splitting this list into N_proc (number of processors) pieces. An efficient mechanism for latency hiding in the tree traversal phase of the algorithm is critical. To avoid stalls during non-local data access, we effectively do explicit 'context switching' using a software queue to keep track of which computations have been put aside waiting for messages to arrive.

HOT has defined the state of the art in high-resolution cosmological N-body simulations over the last decade (see Performance section below). An SPH hydro capability for HOT has been developed and preliminary tests have been conducted successfully.

The main developers of HOT are Michael Warren (LANL) and John Salmon.

Code Performance

HOT has been run on a variety of parallel platforms over the last decade, e.g., on QB and QSC at LANL (Alpha clusters), on Seaborg at NERSC (IBM Power3), and on Beowulf clusters. Historical performance information is provided in the table below. Simulations with HOT have garnered Gordon Bell awards in 1992 and 1997 (First place, performance), and 1997 (First place, price-performance).

Year	Computer	Processors	GFlop/s	Mflops/proc
2003	Space Simulator (2.5 GHz P4 cluster)	288	1166	4050
2003	QB (HP AlphaServer ES45)	3600	2793	775.8
2002	Seaborg (IBM SP3)	256	57.7	225
2002	Green Destiny	212	38.9	183.5
2000	SGI Origin 2000	64	13.1	205
1998	Avalon	128	16.16	126
1996	ASCI Red	6800	464.9	68.4
1995	TMC CM-5	512	14.06	27.5
1993	Intel Delta	512	10.02	19.6

Recent improvements of the HOT code inner loop using SSE instructions allow the calculation of about 200 million gravitational interactions per second per processor on a 2.53 GHz P4, for a computational rate of over 6 Gflops in single precision. With typical accuracy constraints requiring about 10000 interactions per particle, a single timestep with 1 billion particles requires 10^(13)$ interactions, which is a couple of minutes per timestep on 1024 processors, including overheads. Overall, simulations requiring a few times 10^(17) flops should be achievable in about a day.

Execution time for HOT is dominated by the force calculation in the inner loop. We have collected performance figures on a variety of processors in the table below (Mflop/s on the gravitational micro-kernel benchmark). Significantly, for running on Pink, the SSE capability of the Intel Xeon/P4 currently puts it well ahead of any other processor family.

Processor	libm	Karp
1250 MHz Alpha 21264C	935.2	1141.0
933 MHz Transmeta TM5800	189.5	373.2
375 MHz IBM Power3	298.5	514.4
1800 MHz AMD Athlon XP	609.9	951.9
2530 MHz Intel P4 (SSE)	1170.0	6514.0

Back to Cosmology/Beams Main Page.

Back to ICP Main Page.

Salman Habib / LANL / revised November 03