This guide describes how to configure and run High Performance LinPACK on CUDA compatible GPUs. The HPL benchmark is run ONLY on the GPUs in this guide, however then are simple steps to take in order to distribute work between both CPUs and GPUs. A CPU (x86) only version of the HPL benchmark is performed by HPCC, presented in this guide: HPCC.
HPL Calculator http://hpl-calculator.sourceforge.net/
In order to obtain optimal results, a pre-compiled binary is used. This binary has been tested with Keplar architecture Nvidia GPUs only (K20, K40 and K80). Download the latest version of the CUDA toolkit for your Linux distribution from here the Nvida site [https://developer.nvidia.com/cuda-downloads].
Note that if HPL is to be run across multiple hosts, your have two options: (1) manually install the CUDA driver on each node, (2) install the driver to a nonstandard shared directory (eg. /opt).
sudo su chmod +x cuda_x_linux_64.run ./cuda_x_linux_64.run
Select to install the driver and CUDA toolkit when prompted, the SDK examples are not required. You may also specify the installation locations if required. After the installation is complete, source the necessary paths (replacing the path if you have changed it).
export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/lib export C_INCLUDE_PATH=/usr/local/cuda/include
The HPL binary has a hard dependency on CUDA 5.5, however newer versions are compatible (tested up to CUDA 7.0). Therefore creating a symbolic link resolves the issue (correct paths and x.x version):
ln -s /usr/local/cuda/lib64/libcublas.so.x.x /usr/local/cuda/lib64/libcublas.so.5.5 ln -s /usr/local/cuda/lib64/libcudart.so.x.x /usr/local/cuda/lib64/libcudart.so.5.5
Return to your user account.
Ensure that you have an Intel compiler and MKL sourced:
source /opt/intel/bin/compilervars.sh intel64 source /opt/intel/mkl/bin/mklvars.sh intel64
As well as an up-to-date GCC and OpenMPI (compiled with ICS):
export PATH=/opt/gcc-4.9.2/bin/:$PATH export LD_LIBRARY_PATH=/opt/gcc-4.9.2/lib64:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/opt/mpc-1.0.2/lib/:/opt/mpfr-3.1.2/lib/:/opt/gmp-6.0.0/lib/:$LD_LIBRARY_PATH
export PATH=/opt/openmpi-1.8.4-intel/bin:$PATH export LD_LIBRARY_PATH=/opt/openmpi-1.8.4-intel/lib:$LD_LIBRARY_PATH
Download the source here:
tar -xf Cuda-hpl.tar.gz cd CUDAHPL
Confirm that the CUDA HPL binary, xhpl, has all its dependencies sourced:
Edit the HPL.dat file to a suitable N (problem size value) - which should to chosen to fit into the GPU's VRAM (though increasing this value further can increase performance). Use the calculator linked above. The P and Q values should be chosen such that their product equals the number of MPI ranks (processing cores) desired, where a single GPU is considered 1 core. Q should be assigned the larger value if the values for P and Q are not equal, however attempt to keep them as close as possible (ie. P=3,Q=4 rather than P=2,Q=6). For x86 only runs, the NB value is typically chosen to be in the range of 192-256, however for GPU runs, a higher value of 892-1024 is ideal.
A number of scripts are included in the tarball, corresponding to the number of GPUs per host. Before running the benchmark, edit the 'CPU_CORES_PER_GPU' variable in the script. This value should be chosen such that the total number of assigned cores for all GPUs does not exceed the number of CPU cores available on the host. (ie. if the host has 16 cores and there are 2 GPUs, the CPU_CORES_PER_GPU should not be greater than 8). In practice, this value should be 5 or greater to avoid bottle-necking the GPUs.
Before running HPL, insure that the GPUs are in persistence mode and ECC is disabled.
sudo nvidia-smi --persistence-mode=1 sudo nvidia-smi --ecc-config=0
The benchmark can then be run using this script:
mpirun -np <N> -hostfile <HF> ./run_linpack_1_gpu_per_node
In order to change the workload split between the GPUs and CPUs, the variable 'GPU_DGEMM_SPLIT' can be changed. By default it is set to 1.0 (100% GPU offload).