LAMMPS Benchmarks

This page lists LAMMPS performance on several benchmark problems, run on various machines, both in serial and parallel.

Lennard Jones = atomic fluid with Lennard-Jones potential
Polymer = bead-spring polymer melt of 100-mer chains
Metal = metallic solid
Granular = granular chute flow
Protein = rhodopsin protein in solvated lipid bilayer

One processor = relative CPU cost of the 5 benchmarks
Billion-atom = billion-atom Lennard-Jones benchmarks

These are the parallel machines for which benchmark data is given. The "Processors" column is the most number of processors on that machine that LAMMPS was run on. Message passing bandwidth and latency is in units of Mb/sec and microsecs at the MPI level, i.e. what a program like LAMMPS sees. More information on machine characteristics is given here.

Nickname	Vendor/Machine	Processors	Site	CPU	Interconnect	Bandwidth	Latency
Spirit	HP Linux cluster	512	SNL	3.4 GHz dual Xeons (64-bit)	Myrinet	230	9
HPCx	IBM p690+	512	Daresbury	1.7 GHz Power4+	custom	1450	6
Blue Gene Light	IBM	65536	LLNL	700 MHz PowerPC 440	custom	150	3
Red Storm	Cray XT3	10000	SNL	2.0 GHz Opteron	custom	1100	7
Intel Xeon Dual Quad Core	Dell Precision 690	8	SNL	2.66 GHz Xeon	on-chip	??	??

One-processor timings are also listed for some older machines whose characteristics are also given here.

ASCI Red	Intel	1500	SNL	333 MHz Pentium III	custom	310	18
Ross	custom Linux cluster	64	SNL	500 MHz DEC Alpha	Myrinet	100	65
Liberty	HP Linux cluster	64	SNL	3.0 GHz dual Xeons (32-bit)	Myrinet	230	9
Cheetah	IBM p690	64	ORNL	1.3 GHz Power4	custom	1490	7

For each of the 5 benchmarks, fixed- and scaled-size timings are shown in tables and in comparative plots. Fixed-size means that the same problem with 32,000 atoms was run on varying numbers of processors. Scaled-size means that when run on P processors, the number of atoms in the simulation was P times larger than the one-processor run. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms.

All listed CPU times are in seconds for 100 timesteps. Parallel efficiencies refer to the ratio of ideal to actual run time. For example, if perfect speed-up would have given a run-time of 10 seconds, and the actual run time was 12 seconds, then the efficiency is 10/12 or 83.3%. In most cases parallel runs were made on production machines while other jobs were running, which can sometimes degrade performance.

The files needed to run these benchmarks are part of the LAMMPS distribution. If your platform is sufficiently different from the machines listed, you can send your timing results and machine info and we'll add them to this page. Note that the CPU time (in seconds) for a run is what appears in the "Loop time" line of the output log file, e.g.

Loop time of 3.89418 on 8 procs for 100 steps with 32000 atoms

These benchmarks are meant to span a range of simulation styles and computational expense for interaction forces. Since LAMMPS run time scales roughly linearly in the number of atoms simulated, you can use the timing and parallel efficiency data to estimate the CPU cost for problems you want to run on a given number of processors. As the data below illustrates, fixed-size problems generally have parallel efficiencies of 50% or better so long as the atoms/processor is a few hundred or more. Scaled-size problems generally have parallel efficiencies of 80% or more across a wide range of processor counts.

Thanks to the following individuals for running the various benchmarks:

Paul Crozier (Sandia), Blue Gene Light results
Fiona Reid (Edinburgh Parallel Computing Centre), HPCx results
Courtenay Vaughan (Sandia), Red Storm results

One processor comparisons

This is a summary of single-processor LAMMPS performance in CPU secs per atom per timestep for the 5 benchmark problems. This is on a Dell 690 desktop Red Hat linux box with dual quad-core 2.66 GHz Intel Xeon processor using the Intel icc compiler. The ratios indicate that if the atomic LJ system has a normalized cost of 1.0, the bead-spring chains and granular systems run 2x faster, while the EAM metal and solvated protein models run 2.7x and 18x slower respectively. These differences are primarily due to the expense of computing a particular pairwise force field for a given number of neighbors per atom.

Problem:	LJ	Chain	EAM	Chute	Rhodopsin
CPU/atom/step:	1.35E-6	6.25E-7	3.62E-6	5.91E-7	2.47E-5
Ratio to LJ:	1.0	0.46	2.69	0.44	18.4

Lennard-Jones liquid benchmark

Input script for this problem.

Atomic fluid:

32,000 atoms for 100 timesteps
reduced density = 0.8442 (liquid)
force cutoff = 2.5 sigma
neighbor skin = 0.3 sigma
neighbors/atom = 55 (within force cutoff)
NVE time integration

Performance data:

One-processor performance

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.

Polymer chain melt benchmark

Input script for this problem.

Bead-spring polymer melt with 100-mer chains and FENE bonds:

32,000 atoms for 100 timesteps
reduced density 0.8442 (liquid)
force cutoff of 2^(1/6) sigma
neighbor skin = 0.4 sigma
neighbors/atom = 5 (within force cutoff)
NVE time integration

Performance data:

One-processor performance

EAM metallic solid benchmark

Input script for this problem.

Cu metallic solid with embedded atom method (EAM) potential:

32,000 atoms for 100 timesteps
force cutoff of 4.95 Angstroms
neighbor skin = 1.0 Angstrom
neighbors/atom = 45 (within force cutoff)
NVE time integration

Performance data:

One-processor performance

Granular chute flow benchmark

Input script for this problem.

Chute flow of packed granular particles with frictional history potential:

32,000 atoms for 100 timesteps
force cutoff of 1.0 sigma
neighbor skin of 0.1 sigma
neighbors/atom = 7
NVE time integration

Performance data:

One-processor performance

Rhodopsin protein benchmark

Input script for this problem.

All-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, long-range Coulombics via PPPM (particle-particle particle mesh), SHAKE constraints. This model contains counter-ions and a reduced amount of water to make a 32K atom system:

32,000 atoms for 100 timesteps
LJ force cutoff of 10.0 Angstroms
neighbor skin of 1.0 sigma
neighbors/atom = 440 (within force cutoff)
NPT time integration

Performance data:

One-processor performance

Billion-atom LJ benchmarks

The Lennard-Jones benchmark problem described above (100 timesteps, reduced density of 0.8442, 2.5 sigma cutoff, etc) has been run on different machines for billion-atom tests. For the LJ benchmark LAMMPS requires a little less than 1/2 Terabyte of memory per billion atoms, which is used mostly for neighbor lists.

Machine	# of Atoms	Processors	CPU Time (secs)	Parallel Efficiency	Flop Rate
Red Storm	1 million	1	235.3	100%	270 MFlop
Red Storm	1 billion	10000	25.1	93.6%	2.53 Tflop
Red Storm	10 billion	10000	246.8	95.2%	2.57 Tflop
Red Storm	40 billion	10000	979.0	96.0%	2.59 Tflop
Blue Gene Light	1 million	1	898.3	100%	70.7 Mflop
Blue Gene Light	1 billion	4096	227.6	96.3%	279 Gflop
Blue Gene Light	1 billion	32K	30.2	90.7%	2.10 Tflop
Blue Gene Light	1 billion	64K	16.0	85.6%	3.97 Tflop
Blue Gene Light	10 billion	64K	148.9	92.0%	4.26 Tflop
Blue Gene Light	40 billion	64K	585.4	93.6%	4.34 Tflop
ASCI Red	750 million	1500	1156	85.2%	41.2 Gflop

The parallel efficiencies are estimated from the per-atom CPU time for a large single processor run on each machine:

2.35e-6 sec/atom/timestep on Red Storm
8.98e-6 sec/atom/timestep on Blue Gene Light
1.97e-5 sec/atom/timestep on ASCI Red

The aggregate flop rate is estimated using the following values for the pairwise interactions, which dominate the run time. This is a conservative estimate in the sense that flops computed for atom pairs outside the force cutoff, building neighbor lists, and time integration are not counted.

27.6 pairwise interactions per atom (with Newton's 3rd law for this density and cutoff)
23 flops per LJ interaction

Machines

This section lists characteristics of machines used in the benchmarking along with options used in compiling LAMMPS. The communication parameters are for bandwidth and latency at the MPI level, i.e. what a program like LAMMPS sees.

Linux desktop = desktop workstation running Red Hat linux

processor: 1.7 GHz Pentium
Intel compiler: icc -O

Mac laptop = PowerBook G4 running OS X 10.3

processor: 1 GHz G4 PowerPC
Mac compiler (g++): c++ -O

Intel Xeon Dual Quad Core = Dell 690 desktop workstation running Red Hat linux

processor: dual quad-core 2.66 GHz Intel Xeons
communication: on chip and between procs, ?? MB/sec bandwidth, ?? usec latency
Intel compiler: icc -O

ASCI Red = ASCI Intel Tflops MPP

~9000 processors
sited at Sandia National Labs
processor: 333 MHz Pentium III
communication: Intel custom interconnect, 310 MB/sec bandwidth, 18 usec latency
PGI compiler: CiCC -O4 -Knoieee

Ross = CPlant DEC Alpha/Myrinet cluster

1000+ processors
sited at Sandia National labs
processor: 500 MHz EV6 DEC Alpha
communication: Myrinet, 100 MB/sec bandwidth, 65 usec latency
DEC compiler: c++ -O

Liberty = Intel/Myrinet cluster packaged by HP

472 procs = 236 dual-processors nodes
sited at Sandia National Labs
processor: 3.06 GHz Xeon
communication: Myrinet, 230 MB/sec bandwidth, 9 usec latency
Intel compiler: mpiCC -O

Spirit = Intel/Myrinet cluster packaged by HP

1024 procs = 512 dual-processors nodes
sited at Sandia National Labs
processor: 3.4 GHz EM64T Xeon
communication: Myrinet, 230 MB/sec bandwidth, 9 usec latency
Intel compiler: mpiCC -O

Cheetah = IBM p690 cluster

864 procs = 27 32-processor nodes
sited at Oak Ridge National Labs
processor: 1.3 GHz Power4
communication: IBM Federation interconnect, 1.49 GB/sec bandwidth, 7 usec latency
IBM compiler: mpCC_r -O4 -qnoipa

HPCx = IBM p690+ cluster

1600 procs = 50 32-processor nodes
sited at CCLRC Daresbury Laboratory
processor: 1.7 GHz Power4+
communication: IBM Federation interconnect, 1.45 GB/sec bandwidth for one link, 4.5 GB/sec bandwidth for all 4 links, 6 usec latency
IBM compiler: mpCC_r -q64 -O3 -qarch=pwr4 -qtune=pwr4

Blue Gene Light = IBM MPP

65536 (64K) procs
sited at Lawrence Livermore National Labs
processor: 700 MHz PowerPC 440
communication: IBM custom interconnect, 150 MB/sec bandwidth, 3 usec latency
IBM compiler: blrts_xlC -O3

Red Storm = Cray MPP

10368 processors
sited at Sandia National Labs
processor: 2.0 GHz AMD Opteron
communication: Cray custom interconnect, 1100 MB/sec bandwidth, 7 usec latency
PGI 6.0 compiler: CC -fastsse