LAMMPS (“Large-scale Atomic/Molecular Massively Parallel Simulator”) is a molecular dynamics program from Sandia National Laboratories. LAMMPS makes use of MPI for parallel communication and is free, open-source software, distributed under the terms of the GNU General Public License. source
Developer's Benchmarks http://lammps.sandia.gov/bench.html
HPCAC Best Practices http://www.hpcadvisorycouncil.com/pdf/LAMMPS_Best_Practice.pdf
mkdir LAMMPS cd LAMMPS makedir tars
source /opt/intel/bin/compilervars.sh intel64 source /opt/intel/mkl/bin/mklvars.sh intel64 source /opt/intel/impi/4.1.0/bin64/mpivars.sh export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/lib export C_INCLUDE_PATH=/usr/local/cuda/include
source bashrc cd tar
Download the LAMMPS source (15 May 2015) here: lammps-stable.tar.gz
tar -xf lammps.stable.tar.gz mv lammps-15May15 .. cd ../lammps-15May15/
=Building x86 CPU Benchmark=
To build the Intel compiled LAMMPS binary, first edit the Makefile:
cd src cp MAKE/OPTIONS/Makefile.intel_cpu MAKE/Makefile.intel_cpu vim MAKE/Makefile.intel.cpu
FFT_INC = -DFFT_MKL
make yes-user-intel make yes-user-omp make intel_cpu
The lmp_intel_cpu binary should be produced.
cp lmp_intel_cpu ../bench
Build this binary using the python script:
Edit the build.py file:
Edit the lmp_dir variable in line 21:
Correct the whitespace error in line 79:
cpu = opt = omp = 0
Next, edit the CUDA Makefile
CC = mpiicpc CCFLAGS = -O3 -xHost SHFLAGS = -fPIC DEPFLAGS = -M
LINK = mpiicpc LINKFLAGS = -O3 -xHost LIB = -lstdc++ SIZE = size
LMP_INC = -DLAMMPS_GZIP -DLAMMPS_JPEG
MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1 MPI_PATH = MPI_LIB =
FFT_INC = FFT_PATH = FFT_LIB =
Build the binary with:
python build.py cuda cp lmp_cuda ../
The stock “3d Lennard-Jones melt” test problem is used as a benchmark for this code. A fixed number of particles/core is used for the problem size.
A pre-configured input script is available here cpu.tar.gz
For x86 CPU benchmarks, this is 500K particles per core. Therefore, if you with to run the benchmark on 24 x86 cores, a total of 12,000K (500×24) particles is required.
The run the benchmark use:
mpirun -np <N> -hostfile <HF> ./lmp_intel_cpu -sf intel -v x <X> -v y <Y> -v z <Z> -v t 100 < in.lj
where: <N> is the number of cores, <HF> is the hostfile amd <X>, <Y> and <Z> are the problem scaling factors - used to reach 500K particles/core. The benchmark has been pre-configured to operate using 500K particles. Therefore, the X,Y and Z values are using to scale the number of particles up - for more cores. In order to run the benchmark on more cores simply scale X,Y and Z accordingly, such that their product equals the number of cores desired.
For example, running on 4 nodes with 24 cores each, the run command would be:
mpirun -np 96 -hostfile hosts ./lmp_intel_cpu -sf intel -v x 6 -v y 4 -v z 4 -v t 100 < in.lj
Giving a total particle count of 500K * (6*4*4) = 48×10^6. Which conforms to the 500K particles per core, (48×10^6 / 96 = 500K).
For the GPU benchmark, the same process is used as above, however with a larger number of particles per GPU. For GPU benchmarks, the particles per GPU are set as 8M particles per GPU. Download the pre-configured input file here
mpirun -n <N> -hostfile <HF> ./lmp_cuda_mixed -c on -sf cuda -pk cuda 1 -v x 1 -v y 1 -v z 1 < in.lj
The results of the benchmark are reported as particle-timesteps per second. To calculate this value, take the number of particles in the simulation, multiple by simulation timesteps and divide by the runtime. See below for an example:
500 000 [particles] * 300 [simulation timesteps] / 25.4 [seconds] = 5.9 x10^6 [particle-timesteps/second]