There are several versions of WRF, using different combinations of compiler and MPI implementation, installed on the filesystem in
/apps/chpc/earth/. Tests have indicated a very large benefit from using the Intel compiler and MPI rather than Gnu compiler and OpenMPI or MPICH. MPICH versions need the mpirun argument
-iface ib0 to force it to use the Infiniband network. Please note that it is essential to set the unlimited stack size for the Intel-compiled version, as done in the script below. To set up an appropriate environment, “source” the setWRF file in the required directory with the following type of command:
. /apps/chpc/earth/WRF-3.8-impi/setWRF . This command should be placed in the PBS-Pro job submission script. Users need to develop their own workflows, but it is also practical to execute the pre-processing steps
geogrid.exe, ungrib.exe, metgrid.exe and real.exe in single node mode with an interactive session. Simply give the command
qsub -I -q smp -P <AAAA0000>, where <AAAA0000> should be replaced with your project code, to obtain an interactive session. Do not try to run these pre-processing steps from the login shell, as the shared login node cannot sustain a high work load. The real.exe pre-processing step for large cases may run into memory constraints. In that case, run real.exe in parallel over the requested number of nodes, but with only one process per node, as per the example script.
WRF-4.0 has been installed with support for OpenMP. It is therefore possible to run using the same number of cores in total, but fewer MPI processes. By default, the environment variable OMP_NUM_THREADS is set to 1 in the setWRF script. Testing on Lengau has confirmed that there are substantial performance benefits to be obtained from using OpenMP. Benchmark results are given below, but it appears to be close to optimal to use 6 MPI ranks per node, with 4 OpenMP threads per MPI rank. If you want to experiment with OpenMP, set this variable in your job script after sourcing the setWRF script. Although the WRF-4 / gcc-8.3.0 / mpich-3.3 installation also supports OpenMP, performance testing indicates that this version does not benefit from using OpenMP. The version compiled with the PGI compiler is not yet functional.
WRF versions compiled with support for parallel netcdf have
pnc in the directory name. If a sufficient number of CPU cores can be used, WRF's run time is severely restricted by the time taken to produce hourly outputs. Appropriate use of parallel netcdf can dramatically reduce the I/O time.
Making effective use of Parallel NetCDF with I/O quilting requires some changes to the namelist.input file as well as the PBS script.
With acknowledgement to John Michalakes (NREL) and Andrew Porter (STFC Daresbury).
Configure with PNETCDF set in shell env. to installation dir. for pnetcdf
WRF installations with the added chemistry model and kinetic pre-processor are also available in
/apps/chpc/earth/, and contain, surprisingly,
CHEM in the directory name. As per the above instructions, source the setWRF script in that directory to set up a suitable environment.
#!/bin/bash #### For the distributed memory versions of the code that we use at CHPC, mpiprocs should be equal to ncpus #### Here we have selected the maximum resources available to a regular CHPC user #### Obviously provide your own project identifier #### For your own benefit, try to estimate a realistic walltime request. Over-estimating the #### wallclock requirement interferes with efficient scheduling, will delay the launch of the job, #### and ties up more of your CPU-time allocation untill the job has finished. #PBS -l select=10:ncpus=24:mpiprocs=24 -q normal -P TEST1234 #PBS -l walltime=3:00:00 #PBS -o /home/username/scratch/WRFV3_test/run/stdout #PBS -e /home/username/scratch/WRFV3_test/run/stderr #PBS -m abe #PBS -M firstname.lastname@example.org ### Source the WRF-3.8 environment: export WRFDIR=/apps/chpc/earth/WRF-3.8-impi_hwl . $WRFDIR/setWRF # Set the stack size unlimited for the intel compiler ulimit -s unlimited ##### Running commands # Set PBS_JOBDIR to where YOUR simulation will be run export PBS_JOBDIR=/home/username/scratch/WRFV3_test/run # First though, change to YOUR WPS directory export WPS_DIR=/export/home/username/scratch/WPS_test cd $WPS_DIR # Clean the directory of old files rm FILE* rm GRIB* rm geo_em* rm met_em* # Link to the grib files, obviously use the location of YOUR grib files ./link_grib.csh ../DATA_test/GFS_* # Run geogrid.exe geogrid.exe &> geogrid.out # Run ungrib.exe ungrib.exe &> ungrib.out # Run metgrid.exe metgrid.exe &> metgrid.out # Now change to the main job directory cd $PBS_JOBDIR # Link the met_em* data files into this directory ln -s $WPS_DIR/met_em* ./ # Figure out how many processes to use for wrf.exe nproc=`cat $PBS_NODEFILE | wc -l` # Now figure out how many nodes are being used cat $PBS_NODEFILE | sort -u > hosts # Number of nodes to be used for real.exe nnodes=`cat hosts | wc -l` # Run real.exe with one process per node exe=$WRFDIR/WRFV3/run/real.exe mpirun -np $nnodes -machinefile hosts $exe &> real.out # Run wrf.exe with the full number of processes exe=$WRFDIR/WRFV3/run/wrf.exe mpirun -np $nproc -machinefile $PBS_NODEFILE $exe &> wrf.out
The following script runs wrf.exe only, with Parallel NetCDF:
#!/bin/bash #### For the distributed memory versions of the code that we use at CHPC, mpiprocs should be equal to ncpus #### Here we have selected the maximum resources available to a regular CHPC user #### Obviously provide your own project identifier #### For your own benefit, try to estimate a realistic walltime request. Over-estimating the #### wallclock requirement interferes with efficient scheduling, will delay the launch of the job, #### and ties up more of your CPU-time allocation untill the job has finished. #PBS -l select=10:ncpus=24:mpiprocs=24 -q normal -P TEST1234 #PBS -l walltime=3:00:00 #PBS -o /home/username/scratch/WRFV3_test/run/stdout #PBS -e /home/username/scratch/WRFV3_test/run/stderr #PBS -m abe #PBS -M email@example.com ### Source the WRF-3.8 environment with parallel NetCDF: export WRFDIR=/apps/chpc/earth/WRF-3.8-pnc-impi . $WRFDIR/setWRF # Set the stack size unlimited for the intel compiler ulimit -s unlimited ##### Running commands # Set PBS_JOBDIR to where YOUR simulation will be run export PBS_JOBDIR=/home/username/scratch/WRFV3_test/run cd $PBS_JOBDIR exe=$WRFDIR/WRFV3/run/wrf.exe # Clear and re-set the lustre striping for the job directory. For the lustre configuration # used by CHPC, a stripe size of 12 should work well. lfs setstripe -d $PBS_JOBDIR lfs setstripe -c 12 ./ ## For this example, assume that nproc_x=8, nproc_y=28, nio_tasks_per_group=4 and nio_groups=4, for a total ## of 16 I/O processes and 228 solver processes, therefore 240 MPI processes in total. mpirun -np 240 -machinefile $PBS_NODEFILE $exe &> wrf.out
The following script runs wrf.exe only, with Parallel NetCDF and OpenMP. Please note that this should provide close to optimal performance:
#!/bin/bash ### Request 10 compute nodes with 6 MPI processes per node #PBS -l select=10:ncpus=24:mpiprocs=6:nodetype=haswell_reg #PBS -q normal #PBS -P ERTH1234 #PBS -l walltime=06:00:00 #PBS -N WRF4-10X6X4 #PBS -o /home/userid/lustre/WRFrun/wrf4.out #PBS -e /home/userid/lustre/WRFrun/wrf4.err ### These two stack size settings are essential for use with Intel-compiled code ulimit -s unlimited export OMP_STACKSIZE=2G ### Source the appropriate environment script . /apps/chpc/earth/WRF-4.0-pnc-impi/setWRF export PBSJOBDIR=/home/userid/lustre/WRFrun/wrf4.out cd $PBSJOBDIR ### Get total number of MPI ranks nproc=`cat $PBS_NODEFILE | wc -l` ### Clear and set the lustre stripes for the working directory lfs setstripe -d $PBSJOBDIR lfs setstripe -c 12 ./ ### Issue the command line, passing the number of OpenMP threads. ### These affinity settings work OK, but may be unnecessary. YMMV. time mpirun -machinefile $PBS_NODEFILE -np $nproc -genv OMP_NUM_THREADS 4 -genv KMP_AFFINITY "verbose,granularity=core,compact,0,1" -bind-to socket -map-by socket wrf.exe > runWRF.out
For use with Parallel NetCDF, the namelist.input file has to contain the following: If PnetCDF is not used, but OpenMP is used, set nproc_x and nproc_y to -1, and numtiles equal to the number of OpenMP threads. In this particular example, we are using 10 nodes with 6 MPI processes per node, for a total of 60. However, four of these processes are consumed by the I/O tiling / quilting (2 nio_groups and 2 nio_tasks_per_group) leaving a total of only 56 MPI processes for performing the calculations, hence the nproc_x and nproc_y partitioning of 7×8. It should be noted that in this case, using relatively few nodes, the performance sacrifice in favour of setting processing cores aside for I/O is not justified. However, if more nodes are used, there will be a substantial overall saving from activating parallel I/O.
. &time_control . . io_form_history = 11 io_form_restart = 11 io_form_input = 2 io_form_boundary = 2 nocolons = T . . &domains . . numtiles = 1, nproc_x = 7, nproc_y = 8, . . &namelist_quilt nio_tasks_per_group = 2, nio_groups = 2, /
For post-processing, ARW-Post, NCL and GrADS have been installed, and the necessary paths and environment variables set up by sourcing the setWRF file. In addition, ncview is also available as
/apps/chpc/earth/ncview-2.1.7-gcc. Source the script file setNCView in that directory in order to set up a suitable environment. Alternatively, use just the binary
/apps/chpc/earth/ncview-2.1.7-gcc/utils/bin/ncview. For graphics, refer to the Remote Visualization page for instructions on setting up a VNC session.
The four line graphs below illustrate parallel scaling of WRF-3.7 and WRF-3.8 on the cluster, using MPI parallelisation only. Please note that if you are using a large number of cores, writing hourly output files will significantly slow down the run. Use a version of WRF compiled with parallel NetCDF support, and an appropriate input file to overcome this. Check in the rsl.out.0000 file to see how much time is being used for writing output files. If it takes much more than 2 or 3 seconds to write an output file, use parallel NetCDF. Using all cores per node produces the best performance per node, but it is also a case of diminishing returns, with very little advantage gained from the last few cores per node.
The bar graph explores the different parallelisation options with WRF-4.0. In this case, the number of nodes used was kept to the standard 10 nodes available on the normal queue. The following conclusions are made from this study: