"Return to Steve Plimpton's home page"_main.html :c

:line

MPPs versus Clusters - performance comparisons :h3

Sandia has a cluster-computing effort to build and use a series of
large-scale cluster supercomputers with off-the-shelf commodity
parts. Collectively these machines are called CPlant, which is meant
to invoke images of a factory (producing computational cycles) and of
a living/growing heterogeneous entity.  Full details on the CPlant
design, hardware, and system software are given on the "CPlant home
page"_Cplant. Sandia also has a traditional massively parallel machine
with 9000 Intel Pentiums called the Intel Tflops (ASCI Red), which has
been a production MPP platform for us the last few years.

:link(Cplant,http://www.cs.sandia.gov/cplant)

I'm just a CPlant user, so here are the salient features of the
machine as compared to Tflops from an application point-of-view. These
numbers and all the timing results below are for the "alaska" version
of CPlant which has about 300 500-MHz DEC Alpha processors connected
by Myrinet switches. A newer version "siberia" with more processors
and a richer interconnect topology is just coming on-line (summer
2000).

, Intel Tflops, DEC CPlant
Processor, Pentium (333 MHz), Alpha (500 MHz)
Peak Flops, 333 Mflops, 1000 Mflops
MPI Latency, 15 us, 105 us
MPI Bandwidth, 310 Mbytes/sec, 55 Mbytes/sec :c,tb

In a nutshell the difference a cluster means for applications is
faster single- processor performance and slower communication. As
you'll see below, this has a negative impact on application
scalability, particularly for fixed-size problems. Fixed-size speed-up
is running the same size problem on varying numbers of
processors. Scaled-size speed-up is running a problem which doubles in
size when you double the number of processors. The former is a more
stringent test of a machine's (and code's) scalability.

:line

Below are scalability results for several of my parallel codes, run on
both Tflops and CPlant. A paper with more details is listed below. A
whole group of people making these comparisons for their codes are
listed in the "applications / section"_apps of the "CPlant home
page"_Cplant.

:link(apps,http://www.cs.sandia.gov/cplant/apps/apps.html)

[IMPORTANT NOTE:] These timing results are mostly plotted as parallel
efficiency which is scaled to 100% on a single processor for both
machines. The raw speed of a CPlant processor is typically 2x or 3x
faster than the Intel Pentium on Tflops. Thus on a given number of
processors, the CPlant machine is often faster (in raw CPU time) than
Tflops even if its efficiency is lower.

:line

This data is for a molecular dynamics (MD) benchmark of a
Lennard-Jones system of particles. A 3-d box of atoms is simulated at
liquid density using a standard 2.5 sigma cutoff. The simulation box
is partitioned across processors using a spatial-decomposition
algorithm -- look "here"_bench for more info on this benchmark.

:link(bench,http://lammps.sandia.gov/bench.html)

The 1st plot is for a N = 32,000 atom system (fixed-size
efficiency). The 2nd plot is scaled-size efficiency with 32,000
atoms/processor -- i.e. on one processor an N = 32,000 simulation was
run while on 256 processors, 8.2 million atoms were simulated. In both
cases, one processor timings (per MD timestep) are shown at the lower
left of the plot.

:c,image(images/cluster_lj_fixed.gif)
:c,image(images/cluster_lj_scaled.gif)
 
:line
 
This data is for running a full-blown molecular dynamics simulation
with the "LAMMPS"_http://lammps.sandia.gov code of a lipid bilayer
solvated by water. A 12 Angstrom cutoff was used for the van der Waals
forces; long-range Coulombic forces were solved for using the
particle-mesh Ewald method.

The 1st plot of fixed-size results is for a N = 7134 atom bilayer. The
2nd plot is scaled-size results for 7134 atoms/processor -- i.e. on
one processor an N = 7134 simulation was run while on 1024 processors,
7.1 million atoms were simulated. Again, one processor timings for
both machines (per MD timestep) are shown at the lower left of the
plots. On the scaled-size problem, the CPlant machine does well until
256 processors when the "parallel FFTs"_ffts take a bit hit in
parallel efficiency.

:link(ffts,algorithms.html#ffts)
:c,image(images/cluster_lammps_fixed.gif)
:c,image(images/cluster_lammps_scaled.gif)

:line
 
This performance data is for a "radiation transport
solver"_rad.html. A 3-d unstructured (finite element) hexahedral mesh
is constructed, then partitioned across processors. A direct solution
to the Bolztmann equation for radiation flux is formulated by sweeping
across the grid in each of many ordinate directions
simultaneously. The communication required during the solve involves
many tiny messages being sent asynchronously to neighboring
processors. Thus it is a good test of communication capabilities on
CPlant.
 
Solutions were computed for three different grid sizes on varying
numbers of processors. For each simulation 80 ordinate directions (Sn
= 8) were computed using 2 energy groups. CPU times for 3 different
grid sizes running on each machine are shown in the figure. The CPlant
processors are about 2x faster, but the CPlant timings are not as
efficient on large numbers of processors (dotted lines are perfect
efficiency for CPlant).

:c,image(images/cluster_rad_all.gif)
 
:line
 
These plots are timing data for the "QuickSilver electromagnetics
code"_qs.html solving Maxwell's equations on a large 3-d structured
grid that is partitioned across processors. The first plot is for a
fields-only finite-difference solution to a pulsed wave form traveling
across the grid. The efficiencies are for a scaled- size problem with
27,000 grid cells/processor. Thus on 256 processors, 6.9 million grid
cells are used. The timings in the lower-left corner of the plot are
one-processor CPU times for computing the field equations in a single
grid- cell for a single timestep. Single-processor CPlant performance
for the DEC Alpha is about 5x faster than the Tflop's Intel
Pentium. The 2nd plot is for a QuickSilver problem with fields and
particles. Again, scaled-size efficiencies are shown for a problem
with 27,000 grid cells and 324,000 particles per processor. Thus on
256 processors, 83 million particles are being pushed across the grid
at every timestep. The one processor timings are now CPU time per
particle per timestep.

:c,image(images/cluster_qs_field.gif)
:c,image(images/cluster_qs_part.gif)
 
:line
 
This paper has been accepted for a special issue of JPDC on cluster
computing.  It contains some of the application results shown
above. It also has a concise description of CPlant, some low-level
communication benchmarks, and results for several of the NAS
benchmarks running on CPlant.

[Scalability and Performance of a Two Large Linux Clusters],
R. Brightwell and S.  J. Plimpton, J of Parallel and Distributed
Computing, 61, 1546-1569 (2001).  ("abstract"_abs) ("postscript"_PS)
("ps.gz"_ps)

:link(abs,abstracts/jpdc01.html)
:link(PS,papers/jpdc01.ps)
:link(ps,papers/jpdc01.ps.gz)