User Tools

Site Tools


howto:wrf

Running ARW / WRF at the CHPC

There are several versions of WRF, using different combinations of compiler and MPI implementation, installed on the filesystem in /apps/chpc/earth/. The latest version is WRF-4.1.1, built with the Intel compiler. Tests have indicated a very large benefit from using the Intel compiler and MPI rather than Gnu compiler and OpenMPI or MPICH. MPICH versions need the mpirun argument -iface ib0 to force it to use the Infiniband network. Please note that it is essential to set the unlimited stack size for the Intel-compiled version, as done in the script below. To set up an appropriate environment, “source” the setWRF file in the required directory with the following type of command: . /apps/chpc/earth/WRF-3.8-impi/setWRF . This command should be placed in the PBS-Pro job submission script. Users need to develop their own workflows, but it is also practical to execute the pre-processing steps geogrid.exe, ungrib.exe, metgrid.exe and real.exe in single node mode with an interactive session. Simply give the command qsub -I -q smp -P <AAAA0000>, where <AAAA0000> should be replaced with your project code, to obtain an interactive session. Do not try to run these pre-processing steps from the login shell, as the shared login node cannot sustain a high work load. The real.exe pre-processing step for large cases may run into memory constraints. In that case, run real.exe in parallel over the requested number of nodes, but with only one process per node, as per the example script.

OpenMP

WRF-4.0 and WRF-4.1.1 have been installed with support for OpenMP. It is therefore possible to run using the same number of cores in total, but fewer MPI processes. By default, the environment variable OMP_NUM_THREADS is set to 1 in the setWRF script. Testing on Lengau has confirmed that there are substantial performance benefits to be obtained from using OpenMP. Benchmark results are given below, but it appears to be close to optimal to use 6 MPI ranks per node, with 4 OpenMP threads per MPI rank. If you want to experiment with OpenMP, set this variable in your job script after sourcing the setWRF script. Although the WRF-4 / gcc-8.3.0 / mpich-3.3 installation also supports OpenMP, performance testing indicates that this version does not benefit from using OpenMP. The version compiled with the PGI compiler is competitive with the Intel version when using MPI only, but also does not benefit from adding OpenMP.

WRF, Parallel NetCDF and I/O Quilting

WRF versions compiled with support for parallel netcdf have pnc in the directory name. If a sufficient number of CPU cores can be used, WRF's run time is severely restricted by the time taken to produce hourly outputs. Appropriate use of parallel netcdf can dramatically reduce the I/O time.

Making effective use of Parallel NetCDF with I/O quilting requires some changes to the namelist.input file as well as the PBS script.

  • It is recommended that the domain decomposition be specified manually by setting appropriate values for nproc_x, nproc_y, nio_tasks_per_group and nio_groups in namelist.input.
  • However, for the real.exe pre-processing step, it is necessary to have nproc_x and nproc_y set to -1
  • There is conflicting advice on suitable values for the above parameters. Our experimentation shows that setting nio_groups=2 works quite well, and nio_tasks_per_group should divide into nproc_y. For example, if nproc_y=24, nio_tasks_per_group=12 should work acceptably well. However, the benefit of using so many nio_tasks_per_group is minimal. Using as few as 2 or 4 also works well.
  • The nocolons flag should be set to T. This should also be done in the namelist.wps file to ensure that met_em* files without colons are built for WRF.
  • Set lustre stripe count for the output files (see example script). Using multiples of 12 works well on the CHPC cluster.
  • Issue the mpirun command for nproc=(nproc_x*nproc_y) + (nio_tasks_per_group * nio_groups)

Notes on compiling WRF for use with Parallel NetCDF and quilting

With acknowledgement to John Michalakes (NREL) and Andrew Porter (STFC Daresbury).

Configure with PNETCDF set in shell env. to installation dir. for pnetcdf

  • Edit configure.wrf and
  • modify ARCHFLAGS to add/remove -DPNETCDF_QUILT
  • touch frame/module_io_quilt.F and frame/module_quilt_outbuf_ops.F
  • recompile (no need to recompile the whole code if it has already been built)

WRFCHEM

WRF installations with the added chemistry model and kinetic pre-processor are also available in /apps/chpc/earth/, and contain, surprisingly, CHEM in the directory name. As per the above instructions, source the setWRF script in that directory to set up a suitable environment.

Example scripts

runWRF.qsub
#!/bin/bash 
#### For the distributed memory versions of the code that we use at CHPC, mpiprocs should be equal to ncpus
#### Here we have selected the maximum resources available to a regular CHPC user
####  Obviously provide your own project identifier
#### For your own benefit, try to estimate a realistic walltime request.  Over-estimating the 
#### wallclock requirement interferes with efficient scheduling, will delay the launch of the job,
#### and ties up more of your CPU-time allocation untill the job has finished.
#PBS -l select=10:ncpus=24:mpiprocs=24 -q normal -P TEST1234
#PBS -l walltime=3:00:00
#PBS -o /home/username/scratch/WRFV3_test/run/stdout
#PBS -e /home/username/scratch/WRFV3_test/run/stderr
#PBS -m abe
#PBS -M username@unseenuniversity.ac.za
### Source the WRF-4.1.1 environment:
export WRFDIR=/apps/chpc/earth/WRF-4.1.1-pnc-impi
. $WRFDIR/setWRF
# Set the stack size unlimited for the intel compiler
ulimit -s unlimited
##### Running commands
# Set PBS_JOBDIR to where YOUR simulation will be run
export PBS_JOBDIR=/home/username/scratch/WRFV3_test/run
# First though, change to YOUR WPS directory
export WPS_DIR=/export/home/username/scratch/WPS_test
cd $WPS_DIR
# Clean the directory of old files
rm FILE*
rm GRIB*
rm geo_em*
rm met_em*
# Link to the grib files, obviously use the location of YOUR grib files
./link_grib.csh ../DATA_test/GFS_* 
# Run geogrid.exe
geogrid.exe &> geogrid.out
# Run ungrib.exe
ungrib.exe &> ungrib.out
# Run metgrid.exe
metgrid.exe &> metgrid.out
# Now change to the main job directory
cd $PBS_JOBDIR
# Link the met_em* data files into this directory
ln -s $WPS_DIR/met_em* ./
# Figure out how many processes to use for wrf.exe
nproc=`cat $PBS_NODEFILE | wc -l`
# Now figure out how many nodes are being used
cat $PBS_NODEFILE | sort -u > hosts
# Number of nodes to be used for real.exe
nnodes=`cat hosts | wc -l`
# Run real.exe with one process per node
exe=$WRFDIR/WRF/run/real.exe
mpirun -np $nnodes -machinefile hosts $exe &> real.out
# Run wrf.exe with the full number of processes
exe=$WRFDIR/WRF/run/wrf.exe
mpirun -np $nproc -machinefile $PBS_NODEFILE $exe &> wrf.out

The following script runs wrf.exe only, with Parallel NetCDF:

runWRF_pnc.qsub
#!/bin/bash 
#### For the distributed memory versions of the code that we use at CHPC, mpiprocs should be equal to ncpus
#### Here we have selected the maximum resources available to a regular CHPC user
####  Obviously provide your own project identifier
#### For your own benefit, try to estimate a realistic walltime request.  Over-estimating the 
#### wallclock requirement interferes with efficient scheduling, will delay the launch of the job,
#### and ties up more of your CPU-time allocation untill the job has finished.
#PBS -l select=10:ncpus=24:mpiprocs=24 -q normal -P TEST1234
#PBS -l walltime=3:00:00
#PBS -o /home/username/scratch/WRFV3_test/run/stdout
#PBS -e /home/username/scratch/WRFV3_test/run/stderr
#PBS -m abe
#PBS -M username@unseenuniversity.ac.za
### Source the WRF-4.1.1 environment with parallel NetCDF:
export WRFDIR=/apps/chpc/earth/WRF-4.1.1-pnc-impi
. $WRFDIR/setWRF
# Set the stack size unlimited for the intel compiler
ulimit -s unlimited
##### Running commands
# Set PBS_JOBDIR to where YOUR simulation will be run
export PBS_JOBDIR=/home/username/scratch/WRFV3_test/run
cd $PBS_JOBDIR
exe=$WRFDIR/WRF/run/wrf.exe
# Clear and re-set the lustre striping for the job directory.  For the lustre configuration 
# used by CHPC, a stripe size of 12 should work well.
lfs setstripe -d $PBS_JOBDIR 
lfs setstripe -c 12 ./
## For this example, assume that nproc_x=8, nproc_y=28, nio_tasks_per_group=4 and nio_groups=4, for a total
## of 16 I/O processes and 228 solver processes, therefore 240 MPI processes in total.
mpirun -np 240 -machinefile $PBS_NODEFILE $exe &> wrf.out

The following script runs wrf.exe only, with Parallel NetCDF and OpenMP. Please note that this should provide close to optimal performance:

runWRF_pnc_omp.qsub
#!/bin/bash
### Request 10 compute nodes with 6 MPI processes per node
#PBS -l select=10:ncpus=24:mpiprocs=6:nodetype=haswell_reg
#PBS -q normal
#PBS -P ERTH1234
#PBS -l walltime=06:00:00
#PBS -N WRF4-10X6X4
#PBS -o /home/userid/lustre/WRFrun/wrf4.out
#PBS -e /home/userid/lustre/WRFrun/wrf4.err
### These two stack size settings are essential for use with Intel-compiled code
ulimit -s unlimited
export OMP_STACKSIZE=2G
### Source the appropriate environment script
. /apps/chpc/earth/WRF-4.1.1-pnc-impi/setWRF
export PBSJOBDIR=/home/userid/lustre/WRFrun/wrf4.out
cd $PBSJOBDIR
### Get total number of MPI ranks
nproc=`cat $PBS_NODEFILE | wc -l`
### Clear and set the lustre stripes for the working directory
lfs setstripe -d $PBSJOBDIR
lfs setstripe -c 12 ./
### Issue the command line, passing the number of OpenMP threads.
### These affinity settings work OK, but may be unnecessary. YMMV.
time mpirun -machinefile $PBS_NODEFILE -np $nproc -genv OMP_NUM_THREADS 4 -genv KMP_AFFINITY "verbose,granularity=core,compact,0,1" -bind-to socket -map-by socket wrf.exe > runWRF.out

For use with Parallel NetCDF, the namelist.input file has to contain the following: If PnetCDF is not used, but OpenMP is used, set nproc_x and nproc_y to -1, and numtiles equal to the number of OpenMP threads. In this particular example, we are using 10 nodes with 6 MPI processes per node, for a total of 60. However, four of these processes are consumed by the I/O tiling / quilting (2 nio_groups and 2 nio_tasks_per_group) leaving a total of only 56 MPI processes for performing the calculations, hence the nproc_x and nproc_y partitioning of 7×8. It should be noted that in this case, using relatively few nodes, the performance sacrifice in favour of setting processing cores aside for I/O is not justified. However, if more nodes are used, there will be a substantial overall saving from activating parallel I/O.

namelist.input
.
&time_control
.
.
 io_form_history                     = 11
 io_form_restart                     = 11
 io_form_input                       = 2
 io_form_boundary                    = 2
 nocolons                            = T
.
.
&domains
.
.
 numtiles                   = 1,
 nproc_x                    = 7,
 nproc_y                    = 8,
.
.
&namelist_quilt
 nio_tasks_per_group        = 2,
 nio_groups                 = 2,
 /

For post-processing, ARW-Post, NCL and GrADS have been installed, and the necessary paths and environment variables set up by sourcing the setWRF file. In addition, ncview is also available as /apps/chpc/earth/ncview-2.1.7-gcc. Source the script file setNCView in that directory in order to set up a suitable environment. Alternatively, use just the binary /apps/chpc/earth/ncview-2.1.7-gcc/utils/bin/ncview. For graphics, refer to the Remote Visualization page for instructions on setting up a VNC session.

Parallel scaling

The four line graphs below illustrate parallel scaling of WRF-3.7 and WRF-3.8 on the cluster, using MPI parallelisation only. Please note that if you are using a large number of cores, writing hourly output files will significantly slow down the run. Use a version of WRF compiled with parallel NetCDF support, and an appropriate input file to overcome this. Check in the rsl.out.0000 file to see how much time is being used for writing output files. If it takes much more than 2 or 3 seconds to write an output file, use parallel NetCDF. Using all cores per node produces the best performance per node, but it is also a case of diminishing returns, with very little advantage gained from the last few cores per node.

The bar graph explores the different parallelisation options with WRF-4.0. In this case, the number of nodes used was kept to the standard 10 nodes available on the normal queue. The following conclusions are made from this study:

  • Using 24 cores per node instead of 12 cores produces a relatively modest improvement
  • Trading off MPI processes in favour of more OpenMP threads improves performance up to 4 OpenMP threads, when using the Intel-compiled version
  • Although there are many options for setting process and thread affinity, the defaults generally work quite well
  • For this particular problem set on only 10 nodes, there is a small penalty associated with using parallel I/O, but this will be reversed if more nodes are used
  • Although the PGI version is competitive with the Intel version when using MPI only, it does not benefit from using OpenMP processes as well. The GCC version is generally not competitive, and gets worse when OpenMP is also used.

/var/www/wiki/data/pages/howto/wrf.txt · Last modified: 2019/06/18 14:55 by ccrosby