This page lists LAMMPS performance on several benchmark problems, run on various machines, both in serial and parallel.
These are the parallel machines for which benchmark data is given. The "Processors" column is the most number of processors on that machine that LAMMPS was run on. Message passing bandwidth and latency is in units of Mb/sec and microsecs at the MPI level, i.e. what a program like LAMMPS sees. More information on machine characteristics is given here.
Nickname | Vendor/Machine | Processors | Site | CPU | Interconnect | Bandwidth | Latency |
Spirit | HP Linux cluster | 512 | SNL | 3.4 GHz dual Xeons (64-bit) | Myrinet | 230 | 9 |
HPCx | IBM p690+ | 512 | Daresbury | 1.7 GHz Power4+ | custom | 1450 | 6 |
Blue Gene Light | IBM | 65536 | LLNL | 700 MHz PowerPC 440 | custom | 150 | 3 |
Red Storm | Cray XT3 | 10000 | SNL | 2.0 GHz Opteron | custom | 1100 | 7 |
Intel Xeon Dual Quad Core | Dell Precision 690 | 8 | SNL | 2.66 GHz Xeon | on-chip | ?? | ?? |
One-processor timings are also listed for some older machines whose characteristics are also given here.
ASCI Red | Intel | 1500 | SNL | 333 MHz Pentium III | custom | 310 | 18 |
Ross | custom Linux cluster | 64 | SNL | 500 MHz DEC Alpha | Myrinet | 100 | 65 |
Liberty | HP Linux cluster | 64 | SNL | 3.0 GHz dual Xeons (32-bit) | Myrinet | 230 | 9 |
Cheetah | IBM p690 | 64 | ORNL | 1.3 GHz Power4 | custom | 1490 | 7 |
For each of the 5 benchmarks, fixed- and scaled-size timings are shown in tables and in comparative plots. Fixed-size means that the same problem with 32,000 atoms was run on varying numbers of processors. Scaled-size means that when run on P processors, the number of atoms in the simulation was P times larger than the one-processor run. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms.
All listed CPU times are in seconds for 100 timesteps. Parallel efficiencies refer to the ratio of ideal to actual run time. For example, if perfect speed-up would have given a run-time of 10 seconds, and the actual run time was 12 seconds, then the efficiency is 10/12 or 83.3%. In most cases parallel runs were made on production machines while other jobs were running, which can sometimes degrade performance.
The files needed to run these benchmarks are part of the LAMMPS distribution. If your platform is sufficiently different from the machines listed, you can send your timing results and machine info and we'll add them to this page. Note that the CPU time (in seconds) for a run is what appears in the "Loop time" line of the output log file, e.g.
Loop time of 3.89418 on 8 procs for 100 steps with 32000 atoms
These benchmarks are meant to span a range of simulation styles and computational expense for interaction forces. Since LAMMPS run time scales roughly linearly in the number of atoms simulated, you can use the timing and parallel efficiency data to estimate the CPU cost for problems you want to run on a given number of processors. As the data below illustrates, fixed-size problems generally have parallel efficiencies of 50% or better so long as the atoms/processor is a few hundred or more. Scaled-size problems generally have parallel efficiencies of 80% or more across a wide range of processor counts.
Thanks to the following individuals for running the various benchmarks:
This is a summary of single-processor LAMMPS performance in CPU secs per atom per timestep for the 5 benchmark problems. This is on a Dell 690 desktop Red Hat linux box with dual quad-core 2.66 GHz Intel Xeon processor using the Intel icc compiler. The ratios indicate that if the atomic LJ system has a normalized cost of 1.0, the bead-spring chains and granular systems run 2x faster, while the EAM metal and solvated protein models run 2.7x and 18x slower respectively. These differences are primarily due to the expense of computing a particular pairwise force field for a given number of neighbors per atom.
Problem: | LJ | Chain | EAM | Chute | Rhodopsin |
CPU/atom/step: | 1.35E-6 | 6.25E-7 | 3.62E-6 | 5.91E-7 | 2.47E-5 |
Ratio to LJ: | 1.0 | 0.46 | 2.69 | 0.44 | 18.4 |
Input script for this problem.
Atomic fluid:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
Input script for this problem.
Bead-spring polymer melt with 100-mer chains and FENE bonds:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
Input script for this problem.
Cu metallic solid with embedded atom method (EAM) potential:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
Input script for this problem.
Chute flow of packed granular particles with frictional history potential:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
Input script for this problem.
All-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, long-range Coulombics via PPPM (particle-particle particle mesh), SHAKE constraints. This model contains counter-ions and a reduced amount of water to make a 32K atom system:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for 1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
The Lennard-Jones benchmark problem described above (100 timesteps, reduced density of 0.8442, 2.5 sigma cutoff, etc) has been run on different machines for billion-atom tests. For the LJ benchmark LAMMPS requires a little less than 1/2 Terabyte of memory per billion atoms, which is used mostly for neighbor lists.
Machine | # of Atoms | Processors | CPU Time (secs) | Parallel Efficiency | Flop Rate |
Red Storm | 1 million | 1 | 235.3 | 100% | 270 MFlop |
Red Storm | 1 billion | 10000 | 25.1 | 93.6% | 2.53 Tflop |
Red Storm | 10 billion | 10000 | 246.8 | 95.2% | 2.57 Tflop |
Red Storm | 40 billion | 10000 | 979.0 | 96.0% | 2.59 Tflop |
Blue Gene Light | 1 million | 1 | 898.3 | 100% | 70.7 Mflop |
Blue Gene Light | 1 billion | 4096 | 227.6 | 96.3% | 279 Gflop |
Blue Gene Light | 1 billion | 32K | 30.2 | 90.7% | 2.10 Tflop |
Blue Gene Light | 1 billion | 64K | 16.0 | 85.6% | 3.97 Tflop |
Blue Gene Light | 10 billion | 64K | 148.9 | 92.0% | 4.26 Tflop |
Blue Gene Light | 40 billion | 64K | 585.4 | 93.6% | 4.34 Tflop |
ASCI Red | 750 million | 1500 | 1156 | 85.2% | 41.2 Gflop |
The parallel efficiencies are estimated from the per-atom CPU time for a large single processor run on each machine:
The aggregate flop rate is estimated using the following values for the pairwise interactions, which dominate the run time. This is a conservative estimate in the sense that flops computed for atom pairs outside the force cutoff, building neighbor lists, and time integration are not counted.
This section lists characteristics of machines used in the benchmarking along with options used in compiling LAMMPS. The communication parameters are for bandwidth and latency at the MPI level, i.e. what a program like LAMMPS sees.
Linux desktop = desktop workstation running Red Hat linux
Mac laptop = PowerBook G4 running OS X 10.3
Intel Xeon Dual Quad Core = Dell 690 desktop workstation running Red Hat linux
ASCI Red = ASCI Intel Tflops MPP
Ross = CPlant DEC Alpha/Myrinet cluster
Liberty = Intel/Myrinet cluster packaged by HP
Spirit = Intel/Myrinet cluster packaged by HP
Cheetah = IBM p690 cluster
HPCx = IBM p690+ cluster
Blue Gene Light = IBM MPP
Red Storm = Cray MPP