User Tools

Site Tools


howto:wrf

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
howto:wrf [2019/04/09 14:41]
ccrosby [OpenMP]
howto:wrf [2019/06/18 14:55] (current)
ccrosby [WRF, Parallel NetCDF and I/O Quilting]
Line 1: Line 1:
 ====== Running ARW / WRF at the CHPC ====== ====== Running ARW / WRF at the CHPC ======
-There are several versions of WRF, using different combinations of compiler and MPI implementation,​ installed on the filesystem in ''/​apps/​chpc/​earth/''​. ​ Tests have indicated a very large benefit from using the Intel compiler and MPI rather than Gnu compiler and OpenMPI or MPICH. ​ MPICH versions need the mpirun argument ''​-iface ib0''​ to force it to use the Infiniband network. ​ Please note that it is essential to set the unlimited stack size for the Intel-compiled version, as done in the script below. ​ To set up an appropriate environment,​ "​source"​ the setWRF file in the required directory with the following type of command: ''​. ​  /​apps/​chpc/​earth/​WRF-3.8-impi/​setWRF''​ .  This command should be placed in the PBS-Pro job submission script. ​ Users need to develop their own workflows, but it is also practical to execute the pre-processing steps ''​geogrid.exe,​ ungrib.exe, metgrid.exe and real.exe''​ in single node mode with an interactive session. ​ Simply give the command ''​qsub -I -q smp -P <​AAAA0000>'',​ where <​AAAA0000>​ should be replaced with **your** project code, to obtain an interactive session. ​ Do not try to run these pre-processing steps from the login shell, as the shared login node cannot sustain a high work load.  The real.exe pre-processing step for large cases may run into memory constraints. ​ In that case, run real.exe in parallel over the requested number of nodes, but with only one process per node, as per the example script.+There are several versions of WRF, using different combinations of compiler and MPI implementation,​ installed on the filesystem in ''/​apps/​chpc/​earth/''​.  The latest version is WRF-4.1.1, built with the Intel compiler.  Tests have indicated a very large benefit from using the Intel compiler and MPI rather than Gnu compiler and OpenMPI or MPICH. ​ MPICH versions need the mpirun argument ''​-iface ib0''​ to force it to use the Infiniband network. ​ Please note that it is essential to set the unlimited stack size for the Intel-compiled version, as done in the script below. ​ To set up an appropriate environment,​ "​source"​ the setWRF file in the required directory with the following type of command: ''​. ​  /​apps/​chpc/​earth/​WRF-3.8-impi/​setWRF''​ .  This command should be placed in the PBS-Pro job submission script. ​ Users need to develop their own workflows, but it is also practical to execute the pre-processing steps ''​geogrid.exe,​ ungrib.exe, metgrid.exe and real.exe''​ in single node mode with an interactive session. ​ Simply give the command ''​qsub -I -q smp -P <​AAAA0000>'',​ where <​AAAA0000>​ should be replaced with **your** project code, to obtain an interactive session. ​ Do not try to run these pre-processing steps from the login shell, as the shared login node cannot sustain a high work load.  The real.exe pre-processing step for large cases may run into memory constraints. ​ In that case, run real.exe in parallel over the requested number of nodes, but with only one process per node, as per the example script.
  
 ==== OpenMP ==== ==== OpenMP ====
-WRF-4.0 ​has been installed with support for OpenMP. It is therefore possible to run using the same number of cores in total, but fewer MPI processes. ​ By default, the environment variable **OMP_NUM_THREADS** is set to 1 in the setWRF script. ​ Testing on Lengau has confirmed that there are substantial performance benefits to be obtained from using OpenMP. ​ Benchmark results are given below, but it appears to be close to optimal to use 6 MPI ranks per node, with 4 OpenMP threads per MPI rank. If you want to experiment with OpenMP, set this variable in your job script ** after ** sourcing the setWRF script. Although the WRF-4 / gcc-8.3.0 / mpich-3.3 installation also supports OpenMP, performance testing indicates that this version does not benefit from using OpenMP. ​ The version compiled with the PGI compiler is not yet functional.+WRF-4.0 ​and WRF-4.1.1 have been installed with support for OpenMP. It is therefore possible to run using the same number of cores in total, but fewer MPI processes. ​ By default, the environment variable **OMP_NUM_THREADS** is set to 1 in the setWRF script. ​ Testing on Lengau has confirmed that there are substantial performance benefits to be obtained from using OpenMP. ​ Benchmark results are given below, but it appears to be close to optimal to use 6 MPI ranks per node, with 4 OpenMP threads per MPI rank. If you want to experiment with OpenMP, set this variable in your job script ** after ** sourcing the setWRF script. Although the WRF-4 / gcc-8.3.0 / mpich-3.3 installation also supports OpenMP, performance testing indicates that this version does not benefit from using OpenMP. ​ The version compiled with the PGI compiler is competitive with the Intel version when using MPI only, but also does not benefit from adding OpenMP.
  
 ==== WRF, Parallel NetCDF and I/O Quilting ==== ==== WRF, Parallel NetCDF and I/O Quilting ====
Line 47: Line 47:
 #PBS -m abe #PBS -m abe
 #PBS -M username@unseenuniversity.ac.za #PBS -M username@unseenuniversity.ac.za
-### Source the WRF-3.environment:​ +### Source the WRF-4.1.1 environment:​ 
-export WRFDIR=/​apps/​chpc/​earth/​WRF-3.8-impi_hwl+export WRFDIR=/​apps/​chpc/​earth/​WRF-4.1.1-pnc-impi
 . $WRFDIR/​setWRF . $WRFDIR/​setWRF
 # Set the stack size unlimited for the intel compiler # Set the stack size unlimited for the intel compiler
Line 82: Line 82:
 nnodes=`cat hosts | wc -l` nnodes=`cat hosts | wc -l`
 # Run real.exe with one process per node # Run real.exe with one process per node
-exe=$WRFDIR/​WRFV3/​run/​real.exe+exe=$WRFDIR/​WRF/​run/​real.exe
 mpirun -np $nnodes -machinefile hosts $exe &> real.out mpirun -np $nnodes -machinefile hosts $exe &> real.out
 # Run wrf.exe with the full number of processes # Run wrf.exe with the full number of processes
-exe=$WRFDIR/​WRFV3/​run/​wrf.exe+exe=$WRFDIR/​WRF/​run/​wrf.exe
 mpirun -np $nproc -machinefile $PBS_NODEFILE $exe &> wrf.out mpirun -np $nproc -machinefile $PBS_NODEFILE $exe &> wrf.out
 </​file>​ </​file>​
Line 105: Line 105:
 #PBS -m abe #PBS -m abe
 #PBS -M username@unseenuniversity.ac.za #PBS -M username@unseenuniversity.ac.za
-### Source the WRF-3.environment with parallel NetCDF: +### Source the WRF-4.1.1 environment with parallel NetCDF: 
-export WRFDIR=/​apps/​chpc/​earth/​WRF-3.8-pnc-impi+export WRFDIR=/​apps/​chpc/​earth/​WRF-4.1.1-pnc-impi
 . $WRFDIR/​setWRF . $WRFDIR/​setWRF
 # Set the stack size unlimited for the intel compiler # Set the stack size unlimited for the intel compiler
Line 114: Line 114:
 export PBS_JOBDIR=/​home/​username/​scratch/​WRFV3_test/​run export PBS_JOBDIR=/​home/​username/​scratch/​WRFV3_test/​run
 cd $PBS_JOBDIR cd $PBS_JOBDIR
-exe=$WRFDIR/​WRFV3/​run/​wrf.exe+exe=$WRFDIR/​WRF/​run/​wrf.exe
 # Clear and re-set the lustre striping for the job directory. ​ For the lustre configuration ​ # Clear and re-set the lustre striping for the job directory. ​ For the lustre configuration ​
 # used by CHPC, a stripe size of 12 should work well. # used by CHPC, a stripe size of 12 should work well.
Line 140: Line 140:
 export OMP_STACKSIZE=2G export OMP_STACKSIZE=2G
 ### Source the appropriate environment script ### Source the appropriate environment script
-. /​apps/​chpc/​earth/​WRF-4.0-pnc-impi/​setWRF+. /​apps/​chpc/​earth/​WRF-4.1.1-pnc-impi/​setWRF
 export PBSJOBDIR=/​home/​userid/​lustre/​WRFrun/​wrf4.out export PBSJOBDIR=/​home/​userid/​lustre/​WRFrun/​wrf4.out
 cd $PBSJOBDIR cd $PBSJOBDIR
Line 190: Line 190:
 The bar graph explores the different parallelisation options with WRF-4.0. ​ In this case, the number of nodes used was kept to the standard 10 nodes available on the normal queue. ​ The following conclusions are made from this study: The bar graph explores the different parallelisation options with WRF-4.0. ​ In this case, the number of nodes used was kept to the standard 10 nodes available on the normal queue. ​ The following conclusions are made from this study:
   * Using 24 cores per node instead of 12 cores produces a relatively modest improvement   * Using 24 cores per node instead of 12 cores produces a relatively modest improvement
-  * Trading off MPI processes in favour of more OpenMP threads improves performance up to 4 OpenMP threads+  * Trading off MPI processes in favour of more OpenMP threads improves performance up to 4 OpenMP threads, when using the Intel-compiled version
   * Although there are many options for setting process and thread affinity, the defaults generally work quite well   * Although there are many options for setting process and thread affinity, the defaults generally work quite well
-  * For this particular problem set on only 10 nodes, there is a small penalty associated with using parallel I/O, but this will be reversed if more nodes are used.+  * For this particular problem set on only 10 nodes, there is a small penalty associated with using parallel I/O, but this will be reversed if more nodes are used 
 +  * Although the PGI version is competitive with the Intel version when using MPI only, it does not benefit from using OpenMP processes as well.  The GCC version is generally not competitive,​ and gets worse when OpenMP is also used.
  
-{{:howto:wrf4_omp_pnc_scaling.png?​direct&​500|}}+{{:howto:wrf4_omp_pnc_scaling2.png?​direct&​500|}}
 {{:​howto:​wrf_01.png?​direct&​500|}} {{:​howto:​wrf_01.png?​direct&​500|}}
 {{:​howto:​wrf_02.png?​direct&​500|}} {{:​howto:​wrf_02.png?​direct&​500|}}
/var/www/wiki/data/attic/howto/wrf.1554813705.txt.gz · Last modified: 2019/04/09 14:41 by ccrosby