LAMMPS WWW Site

LAMMPS Benchmarks

This page lists LAMMPS performance on several benchmark problems, run on various machines, both in serial and parallel.

These are the parallel machines for which benchmark data is given. The "Processors" column is the most number of processors on that machine that LAMMPS was run on. Message passing bandwidth and latency is in units of Mb/sec and microsecs at the MPI level, i.e. what a program like LAMMPS sees. More information on machine characteristics is given here.

Nickname Vendor/Machine Processors Site CPU Interconnect Bandwidth Latency
Spirit HP Linux cluster 512 SNL 3.4 GHz dual Xeons (64-bit) Myrinet 230 9
HPCx IBM p690+ 512 Daresbury 1.7 GHz Power4+ custom 1450 6
Blue Gene Light IBM 65536 LLNL 700 MHz PowerPC 440 custom 150 3
Red Storm Cray XT3 10000 SNL 2.0 GHz Opteron custom 1100 7
Intel Xeon Dual Quad Core Dell Precision 690 8 SNL 2.66 GHz Xeon on-chip ?? ??

One-processor timings are also listed for some older machines whose characteristics are also given here.

ASCI Red Intel 1500 SNL 333 MHz Pentium III custom 310 18
Ross custom Linux cluster 64 SNL 500 MHz DEC Alpha Myrinet 100 65
Liberty HP Linux cluster 64 SNL 3.0 GHz dual Xeons (32-bit) Myrinet 230 9
Cheetah IBM p690 64 ORNL 1.3 GHz Power4 custom 1490 7

For each of the 5 benchmarks, fixed- and scaled-size timings are shown in tables and in comparative plots. Fixed-size means that the same problem with 32,000 atoms was run on varying numbers of processors. Scaled-size means that when run on P processors, the number of atoms in the simulation was P times larger than the one-processor run. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms.

All listed CPU times are in seconds for 100 timesteps. Parallel efficiencies refer to the ratio of ideal to actual run time. For example, if perfect speed-up would have given a run-time of 10 seconds, and the actual run time was 12 seconds, then the efficiency is 10/12 or 83.3%. In most cases parallel runs were made on production machines while other jobs were running, which can sometimes degrade performance.

The files needed to run these benchmarks are part of the LAMMPS distribution. If your platform is sufficiently different from the machines listed, you can send your timing results and machine info and we'll add them to this page. Note that the CPU time (in seconds) for a run is what appears in the "Loop time" line of the output log file, e.g.

Loop time of 3.89418 on 8 procs for 100 steps with 32000 atoms 

These benchmarks are meant to span a range of simulation styles and computational expense for interaction forces. Since LAMMPS run time scales roughly linearly in the number of atoms simulated, you can use the timing and parallel efficiency data to estimate the CPU cost for problems you want to run on a given number of processors. As the data below illustrates, fixed-size problems generally have parallel efficiencies of 50% or better so long as the atoms/processor is a few hundred or more. Scaled-size problems generally have parallel efficiencies of 80% or more across a wide range of processor counts.

Thanks to the following individuals for running the various benchmarks:


One processor comparisons

This is a summary of single-processor LAMMPS performance in CPU secs per atom per timestep for the 5 benchmark problems. This is on a Dell 690 desktop Red Hat linux box with dual quad-core 2.66 GHz Intel Xeon processor using the Intel icc compiler. The ratios indicate that if the atomic LJ system has a normalized cost of 1.0, the bead-spring chains and granular systems run 2x faster, while the EAM metal and solvated protein models run 2.7x and 18x slower respectively. These differences are primarily due to the expense of computing a particular pairwise force field for a given number of neighbors per atom.

Problem: LJ Chain EAM Chute Rhodopsin
CPU/atom/step: 1.35E-6 6.25E-7 3.62E-6 5.91E-7 2.47E-5
Ratio to LJ: 1.0 0.46 2.69 0.44 18.4

Lennard-Jones liquid benchmark

Input script for this problem.

Atomic fluid:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Polymer chain melt benchmark

Input script for this problem.

Bead-spring polymer melt with 100-mer chains and FENE bonds:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


EAM metallic solid benchmark

Input script for this problem.

Cu metallic solid with embedded atom method (EAM) potential:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Granular chute flow benchmark

Input script for this problem.

Chute flow of packed granular particles with frictional history potential:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Rhodopsin protein benchmark

Input script for this problem.

All-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, long-range Coulombics via PPPM (particle-particle particle mesh), SHAKE constraints. This model contains counter-ions and a reduced amount of water to make a 32K atom system:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Billion-atom LJ benchmarks

The Lennard-Jones benchmark problem described above (100 timesteps, reduced density of 0.8442, 2.5 sigma cutoff, etc) has been run on different machines for billion-atom tests. For the LJ benchmark LAMMPS requires a little less than 1/2 Terabyte of memory per billion atoms, which is used mostly for neighbor lists.

Machine # of Atoms Processors CPU Time (secs) Parallel Efficiency Flop Rate
Red Storm 1 million 1 235.3 100% 270 MFlop
Red Storm 1 billion 10000 25.1 93.6% 2.53 Tflop
Red Storm 10 billion 10000 246.8 95.2% 2.57 Tflop
Red Storm 40 billion 10000 979.0 96.0% 2.59 Tflop
Blue Gene Light 1 million 1 898.3 100% 70.7 Mflop
Blue Gene Light 1 billion 4096 227.6 96.3% 279 Gflop
Blue Gene Light 1 billion 32K 30.2 90.7% 2.10 Tflop
Blue Gene Light 1 billion 64K 16.0 85.6% 3.97 Tflop
Blue Gene Light 10 billion 64K 148.9 92.0% 4.26 Tflop
Blue Gene Light 40 billion 64K 585.4 93.6% 4.34 Tflop
ASCI Red 750 million 1500 1156 85.2% 41.2 Gflop

The parallel efficiencies are estimated from the per-atom CPU time for a large single processor run on each machine:

The aggregate flop rate is estimated using the following values for the pairwise interactions, which dominate the run time. This is a conservative estimate in the sense that flops computed for atom pairs outside the force cutoff, building neighbor lists, and time integration are not counted.


Machines

This section lists characteristics of machines used in the benchmarking along with options used in compiling LAMMPS. The communication parameters are for bandwidth and latency at the MPI level, i.e. what a program like LAMMPS sees.

Linux desktop = desktop workstation running Red Hat linux

Mac laptop = PowerBook G4 running OS X 10.3

Intel Xeon Dual Quad Core = Dell 690 desktop workstation running Red Hat linux

ASCI Red = ASCI Intel Tflops MPP

Ross = CPlant DEC Alpha/Myrinet cluster

Liberty = Intel/Myrinet cluster packaged by HP

Spirit = Intel/Myrinet cluster packaged by HP

Cheetah = IBM p690 cluster

HPCx = IBM p690+ cluster

Blue Gene Light = IBM MPP

Red Storm = Cray MPP