User Tools

Site Tools


quick:start

CHPC Quick Start Guide

This guide is intended for experienced HPC users and provides a summary of the essential components of the systems available at the CHPC. For more detailed information on the subjects below see the full User Guide.

NOTE: the new system is still under construction and information here and in the User Guide is incomplete and subject to sudden change.

docti cave

Overview: 24 472 cores

The CHPC's brand new Dell Linux cluster is up and running.

The new system is an homogeneous cluster, comprising Intel 5th generation CPUs. As of February 2016 it has 1008 compute nodes with 24 cores and 128 GiB memory each, and five large memory “fat” nodes with 56 cores and 1TiB each, all interconnected using FDR 56 GHz InfiniBand accessing 4 PB of shared storage over the Lustre filesystem.

Logging in

To connect to the new systems ssh to lengau.chpc.ac.za and log in using the username and password sent to you by the CHPC:

ssh username@lengau.chpc.ac.za

The new system is running CentOS 7.0 and uses the Bash shell by default.

You should change your password after logging in the first time. To change your password, use the passwd command. Rules are: 10 characters, with at least one of the following character types: upper and lower case, numbers, and special characters. Use ssh keys wherever possible.

scp/rsync

To transfer data onto or off the CHPC cluster use scp or rsync comands and connect to or from the scp server scp.chpc.ac.za and not the login node.

Examples

From the command line on your Linux workstation:

scp filetocopy.tar.gz yourusername@scp.chpc.ac.za:/mnt/lustre/users/yourusername/run15/

transfers the file filetocopy.tar.gz from your disk on your computer to the Lustre file system on the CHPC cluster, under the run15/ subdirectory of your scratch directory /mnt/lustre/users/yourusername/ (where yourusername is replaced by your user name on the CHPC cluster).

Read more on connecting to the CHPC...

Shared Filesystems

The new cluster has both NFS and the Lustre filesystems over Infiniband:

Mount point File System Size Quota Backup Access
/home NFS 80 TB 15 GB Yes Yes
/mnt/lustre Lustre 4 PB none NO Yes
/lustre/SCRATCH5 Lustre 1 PB none NO No longer available
/apps NFS 20 TB none Yes On request
/lustre/data Lustre 1 PB none NO On request only

Quotas

The /home file system is managed by quotas and a strict limit of 15 GB (15 000 000 000 bytes) is applied to it. Please take care to not fill up your home directory. Use /mnt/lustre/users/yourusername to store large files. If your project requires access to large files over a long duration (more than 60 days) then please submit a request to helpdesk.

You can see how much you are currently using with the du command:

du --si -s $HOME

IMPORTANT

Make sure that all jobs use a working directory on the Lustre file system. Do not use your home directory for the working directory of your job. Use the directory allocated to you on the fast Lustre parallel file system:

/mnt/lustre/users/USERNAME/

where USERNAME is replace by your user name on the CHPC cluster.

Always provide the full absolute path to your Lustre sub-directories. Do not rely on a symbolic link from your home directory.

Software

Software resides in /apps which is an NFS file system mounted on all nodes:

/apps/ Description Comment
chpc/ Application codes supported by CHPC (See below)
compilers/ Compilers, other programming languages and development tools
libs/ Libraries
scripts/ Modules and other environment setup scripts
tools/ Miscellaneous software tools
user/ Code installed by a user research programme Not supported by CHPC.

Application Codes Scientific Domains

/apps/chpc/ Scientific Domain
astro/ Astrophysics & Cosmology
bio/ BioInformatics
chem/ Chemistry
compmech/ Mechanics
cs/ Computer Science
earth/ Earth
image/ Image Processing
material Material Science
phys/ Physics
space/ Space

Modules

CHPC uses the GNU modules utility, which manipulates your environment, to provide access to the supported software in /apps/.

Each of the major CHPC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. You need only to invoke a single command to configure the application/programming environment properly. The general format of this command is:

module load <module_name>

where <module_name> is the name of the module to load. It also supports Tab-key completion of command parameters.

For a list of available modules:

module avail

The module command may be abbreviated and optionally be given a search term, eg.:

module ava chpc/open

To see a synopsis of a particular modulefile's operations:

module help <module_name>

To see currently loaded modules:

module list

To remove a module:

module unload <module_name>

After upgrades of software in /apps/, new modulefiles are created to reflect the changes made to the environment variables.

Disclaimer: Codes in /apps/user/ are not supported by the CHPC and the TE for each research programme is required to create the appropriate module file or startup script.

Compilers

Supported compilers for C, C++ and Fortran are found in /apps/compilers along with interpreters for programming languages like Python.

For MPI programmes, the appropriate library and mpi* compile scripts are also available.

GNU Compiler Collection

The recommended combination of compiler and MPI library is GCC 5.1.0 and OpenMPI 1.8.8 and is accessed by loading both modules:

module add gcc/5.1.0
module add chpc/openmpi/1.8.8/gcc-5.1.0

Intel compiler and Intel MPI

The module for the Intel compiler and Intel MPI is loaded with

module load chpc/parallel_studio_xe/64/16.0.1/2016.1.150

Scheduler

The CHPC cluster uses PBSPro as its job scheduler. With the exception of interactive jobs, all jobs are submitted to a batch queuing system and only execute when the requested resources become available. All batch jobs are queued according to priority. A user's priority is not static: the CHPC uses the “Fairshare” facility of PBSPro to modify priority based on activity. This is done to ensure the finite resources of the CHPC cluster are shared fairly amongst all users.

Queues

workq is no longer to be used.

The available queues are:

Queue Name Max. cores Min. cores Max. jobs Max. time Notes Access
per job in queue running hrs
serial 24 1 ??? ??? 48 For single-node non-parallel jobs.
smp 24 1 20 10 96 For single-node parallel jobs.
normal 240 48 20 10 48 The standard queue for parallel jobs
large 2400 264 10 5 48 For large parallel runs Restricted
bigmem 280 28 4 1 48 For the large memory (1TiB RAM) nodes. Restricted
vis 24 1 1 1 3 Visualisation node
test 24 1 1 1 3 Normal nodes, for testing only

Notes:

  • A standard compute node has 24 cores and 128 GiB of memory (RAM).
  • Each large memory node has 56 cores and 1 TiB of memory.
  • Access to the large and bigmem queues is restricted and by special application only.
  • Additional restrictions:
Queue Name Max. total simultaneous running cores
normal 480
large 4800

PBS Pro commands

qstat View queued jobs.
qsub Submit a job to the scheduler.
qdel Delete one of your jobs from queue.

Job script parameters

Parameters for any job submission are specified as #PBS comments in the job script file or as options to the qsub command. The essential options for the CHPC cluster include:

 -l select=10:ncpus=24:mpiprocs=24

sets the size of the job in number of processors:

select=N number of nodes needed.
ncpus=N number of cores per node
mpiprocs=N number of MPI ranks (processes) per node
 -l walltime=4:00:00

sets the total expected wall clock time in hours:minutes:seconds. Note the wall clock limits for each queue.

The job size and wall clock time must be within the limits imposed on the queue used:

 -q normal

to specify the queue.

Each job will draw from the allocation of cpu-hours granted to your Research Programme:

 -P PRJT1234

specifies the project identifier short name, which is needed to identify the Research Programme allocation you will draw from for this job. Ask your PI for the project short name and replace PRJT1234 with it.

Restricted queues

The large and bigmem queues are restricted to users who have need for them. If you are granted access to these queues then you should specify that you are a member of the largeq or bigmemq groups. For example:

#PBS -q large
#PBS -W group_list=largeq

Example job scripts

An MPI program using 240 cores

Using the normal queue to run WRF:

#!/bin/bash
#PBS -l select=10:ncpus=24:mpiprocs=24:nodetype=haswell_reg
#PBS -P PRJT1234
#PBS -q normal
#PBS -l walltime=4:00:00
#PBS -o /mnt/lustre/users/USERNAME/WRF_Tests/WRFV3/run2km_100/wrf.out
#PBS -e /mnt/lustre/users/USERNAME/WRF_Tests/WRFV3/run2km_100/wrf.err
#PBS -m abe
#PBS -M your.email@address
ulimit -s unlimited
. /apps/chpc/earth/WRF-3.7-impi/setWRF
cd /mnt/lustre/users/USERNAME/WRF_Tests/WRFV3/run2km_100
rm wrfout* rsl*
nproc=`cat $PBS_NODEFILE | wc -l`
echo nproc is $nproc
cat $PBS_NODEFILE
time mpirun -np $nproc wrf.exe > runWRF.out

Assuming the above job script is saved as the text file example.job, the command to submit it to the PBSPro scheduler is:

qsub example.job

No additional parameters are needed for the qsub command since all the PBS parameters are specified within the job script file.

IMPORTANT

Note that in the above job script example the working directory is on the Lustre file system. Do not use your home directory for the working directory of your job. Use the directory allocated to you on the fast Lustre parallel file system:

/mnt/lustre/users/USERNAME/

where USERNAME is replace by your user name on the CHPC cluster.

Always provide the full absolute path to your Lustre sub-directories. Do not rely on a symbolic link from your home directory.

Hybrid MPI/OpenMP

For example, to request an MPI job on one node with 12 cores per MPI rank, so that each MPI process can launch 12 OpenMP threads, change the -l parameters:

#PBS -l select=1:ncpus=24:mpiprocs=2:nodetype=haswell_reg

There are two MPI ranks, so mpirun -n 2 … .

Example interactive job request

To request an interactive session on a single node, the full command for qsub is:

qsub -I -P PROJ0101 -q smp -l select=1:ncpus=24:mpiprocs=24:nodetype=haswell_reg

Note:

  • -I selects an interactive job
  • you still must specify your project
  • the queue must be smp, serial or test
  • interactive jobs only get one node: select=1
  • for the smp queue you can request several cores: ncpus=24
  • you can run MPI code: indicate how many ranks you want with mpiprocs=

If you find your interactive session timing out too soon then add -l walltime=4:0:0 to the above command line to request the maximum 4 hours.

/var/www/wiki/data/pages/quick/start.txt · Last modified: 2016/08/22 13:43 by wikiadmin