
Welcome to the bioinformatics at the CHPC wiki page! This page describes the basic procedures involved in getting your programs running at the CHPC rather than a description of how to do any particular bioinformatics analysis. If anything is unclear please hover your mouse over the superscripts! 1) For the most part we will be assuming you have at least a little familiarity with Linux. Much of this information is available elsewhere in the CHPC's wiki (probably in more detail), but here we are trying to have everything accessible in one place for the bioinformatics community. Please do read the quick start guide before continuing and pay special attention to the sections on queues and parameters2).
The CHPC has a Globus endpoint: Look for CHPC-Lengau to transfer data to/from the cluster storage. Access it via http://globus.org/ and your cluster username/password.
Galaxy GUI access to the cluster has been provided in the past and may be made available in the future if there is enough demand for it from our users.
To transfer files inward using gridftp, the http://globus.org/ system can be used,
and is accessible via our endpoint named CHPC-Lengau
. You should
use the same credentials used to log in via ssh.
Before one gains access to the command line, you should have an account. In order to get an account you and your PI should both follow the instructions to apply for resources.
Once your registration has been approved then Linux and OSX users can simply open a terminal and connect via ssh to the server using a command of the form3):
localuser@my_linux:~ $ ssh username@lengau.chpc.ac.za Last login: Mon Feb 29 14:05:35 2016 from 10.128.23.235 username@login1:~ $
where user is the username you are assigned upon registration. Windows users can download the putty client 4).
Once connected users can: use the modules system to get access to bioinformatics programs; create job scripts using editors such as vim5) or nano6); and finally submit and monitor their jobs.
For now a quick and simple way of getting access to the bioinformatics software is using the module function. Running:
username@login2:~ $ module avail
will present you with the various modules available on the system and you should see something like:
------------------------------------------------ /cm/local/modulefiles ------------------------------------------------ cluster-tools/7.1 freeipmi/1.4.8 mvapich2/mlnx/gcc/64/2.1 use.own cluster-tools-dell/7.1 gcc/5.1.0 null version cmd ipmitool/1.8.15 openldap cmsh module-git openmpi/mlnx/gcc/64/1.8.8 dot module-info shared ----------------------------------------------- /cm/shared/modulefiles ------------------------------------------------ acml/gcc/64/5.3.1 chpc/python/anaconda/2 acml/gcc/fma4/5.3.1 chpc/python/anaconda/3 acml/gcc/mp/64/5.3.1 chpc/qespresso/5.3.0/openmpi-1.8.8/gcc-5.1.0 acml/gcc/mp/fma4/5.3.1 chpc/R/3.2.3-gcc5.1.0 acml/gcc-int64/64/5.3.1 chpc/vasp/5.3/openmpi-1.8.8/gcc-5.1.0 acml/gcc-int64/fma4/5.3.1 chpc/zlib/1.2.8/intel/16.0.1 acml/gcc-int64/mp/64/5.3.1 cmgui/7.1 acml/gcc-int64/mp/fma4/5.3.1 default-environment acml/open64/64/5.3.1 gdb/7.9 acml/open64/fma4/5.3.1 hdf5/1.6.10 acml/open64/mp/64/5.3.1 hdf5_18/1.8.14 acml/open64/mp/fma4/5.3.1 hpl/2.1 acml/open64-int64/64/5.3.1 hwloc/1.9.1 acml/open64-int64/fma4/5.3.1 intel/compiler/64/15.0/2015.5.223 acml/open64-int64/mp/64/5.3.1 intel-cluster-checker/2.2.2 acml/open64-int64/mp/fma4/5.3.1 intel-cluster-runtime/ia32/3.7 blas/gcc/64/3.5.0 intel-cluster-runtime/intel64/3.7 blas/open64/64/3.5.0 intel-cluster-runtime/mic/3.7 bonnie++/1.97.1 intel-tbb-oss/ia32/43_20150424oss chpc/amber/12/openmpi-1.8.8/gcc-5.1.0 intel-tbb-oss/intel64/43_20150424oss chpc/amber/14/openmpi-1.8.8/gcc-5.1.0 iozone/3_430 chpc/BIOMODULES iperf/3.0.11 chpc/cp2k/2.6.2/openmpi-1.8.8/gcc-5.1.0 lapack/gcc/64/3.5.0 ...
Bioinformatics modules are expected to add a lot to that list and so have a list of their own. Running
username@login2:~ $ module add chpc/BIOMODULES
followed by
username@login2:~ $ module avail
will result in the following being added to the above list (this should be much expanded as various applications are added to the system):
----------------------------------------- /apps/chpc/scripts/modules/bio/app ------------------------------------------ anaconda/2 doxygen/1.8.11 java/1.8.0_73 ncbi-blast/2.3.0/intel R/3.2.3-gcc5.1.0 anaconda/3 git/2.8.1 mpiblast/1.6.0 python/2.7.11 texlive/2015 cmake/3.5.1 htop/2.0.1 ncbi-blast/2.3.0/gcc python/3.5.1
Now to make use of, blast say, one can type:
username@login2:~ $ module add ncbi-blast/2.3.0/gcc
The appropriate environmental variables are set (usually as simple as adding a directory to the search path). Running:
username@login2:~ $ module list
will show which modules have been loaded. Whereas:
username@login2:~ $ module del modulename
will unload a module. And finally:
username@login2:~ $ module show modulename
will show what module modulename actually does.
Next one must create a job script such as the one below:
#!/bin/bash #PBS -l select=1:ncpus=2 #PBS -l walltime=10:00:00 #PBS -q serial #PBS -P SHORTNAME #PBS -o /mnt/lustre/users/username/my_data/stdout.txt #PBS -e /mnt/lustre/users/username/my_data/stderr.txt #PBS -N TophatEcoli #PBS -M myemailaddress@someplace.com #PBS -m b module add chpc/BIOMODULES module add tophat/2.1.1 NP=`cat ${PBS_NODEFILE} | wc -l` EXE="tophat" ARGS="--num-threads ${NP} someindex reads1 reads2 -o output_dir" cd /mnt/lustre/users/username/my_data ${EXE} ${ARGS}
Note that username should be your username and SHORTNAME should be your research programme's code. More details on the job script file can be found in our PBS quickstart guide.
Finally submit your job using:
username@login2:~ $ qsub my_job.qsub 192757.sched01 username@login2:~ $
where 192757.sched01 is the jobID that is returned.
Jobs can then be monitored/controlled in several ways:
username@login2:~ $ qstat -u username sched01: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 192759.sched01 username serial TophatEcol -- 1 24 -- 00:02 Q -- username@login2:~ $
username@login2:~ $ qstat -f 192759.sched01 Job Id: 192759.sched01 Job_Name = TophatEcoli Job_Owner = username@login2.cm.cluster resources_used.cpupercent = 0 resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.ncpus = 96 resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = R queue = serial server = sched01 Checkpoint = u ctime = Mon Oct 10 06:57:13 2016 Error_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stderr.txt exec_host = cnode0962/0*24 exec_vnode = (cnode0962:ncpus=24) Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Mon Oct 10 06:57:15 2016 Output_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stdout.txt Priority = 0 qtime = Mon Oct 10 06:57:13 2016 Rerunable = True Resource_List.ncpus = 24 Resource_List.nodect = 1 Resource_List.place = free Resource_List.select = 1:ncpus=24 Resource_List.walltime = 00:02:00 stime = Mon Oct 10 06:57:15 2016 session_id = 36609 jobdir = /mnt/lustre/users/username substate = 42 Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash, PBS_O_HOME=/home/dane,PBS_O_LOGNAME=username,PBS_O_WORKDIR=/mnt/lustre/users/username/my_data, PBS_O_LANG=en_ZA.UTF-8, PBS_O_PATH=/apps/chpc/bio/anaconda3/bin:/apps/chpc/bio/R/3.3.1/gcc-6.2 .0/bin:/apps/chpc/bio/bzip2/1.0.6/bin:/apps/chpc/bio/curl/7.50.0/bin:/a pps/chpc/bio/lib/png/1.6.21/bin:/apps/chpc/bio/openmpi/2.0.0/gcc-6.2.0_ java-1.8.0_73/bin:... comment = Job run at Mon Oct 10 at 06:57 on (cnode0962:ncpus=24)+(cnode0966 :ncpus=24)+(cnode0971:ncpus=24)+(cnode0983:ncpus=24) etime = Mon Oct 10 06:57:13 2016 umask = 22 run_count = 1 eligible_time = 00:00:00 Submit_arguments = my_job.qsub pset = rack=cx14 project = SHORTNAME username@login01:~ $
username@login01:~ $ qdel 192759.sched01 username@login01:~ $
Many scientific software tools rely on specific versions of libraries, compilers, and dependencies that often conflict with each other or with system-wide installations. Conda is a powerful, language-agnostic environment and package manager that helps solve this problem by allowing users to manage Python, R, C/C++, FORTRAN, and other language ecosystems in isolated environments.
For most use cases, especially in bioinformatics, CHPC provides pre-built, shared Conda environments installed under:
'/apps/chpc/bio/anaconda3-2020.02/envs'
These environments are curated by CHPC staff to include commonly used tools in genomics, transcriptomics, and other domains.
To access Conda functionality, first load the required modules:
module load chpc/BIOMODULES
module load conda_init
The second module updates your .bashrc file by adding necessary shell variables. To apply these changes, you can either log out and log back in, or run:
source ~/.bashrc.
After this setup, you won’t need to load additional modules for your jobs—only the eval and conda activate steps are required.
Activate Conda shell integration:
eval "$(conda shell.bash hook)"
This command sets up your shell environment to recognize Conda commands like `conda activate`.
conda info --envs
This will display all available shared Conda environments and their paths.
conda activate nameOfTheEnv
Replace nameOfTheEnv
with the name of an environment from the previous step.
Tip: If you're unsure which environment to use, contact CHPC support or explore the environment's contents with `conda list`.
Note: You do not need to install anything when using shared environments.
If you need software that is not included in the shared environments, you may create your own private Conda environment. This gives you full control over the software stack and package versions.
Important: Do not install environments in your home directory (/home/<username>
) -use your Lustre project storage instead.
ssh to username@scp.chpc.ac.za, the password is the same as the one you use on lengau
module load chpc/BIOMODULES module load conda_init eval "$(conda shell.bash hook)"
conda create --prefix /mnt/lustre/<username>/myenv python=3.10
This will create a Conda environment at the specified path with Python 3.10 installed. You can replace the Python version or leave it out if not needed.
conda activate /mnt/lustre/<username>/myenv
After activation, you can install any packages you need.
conda install mamba -n base -c conda-forge
Tip: Mamba is a drop-in replacement for Conda that uses a faster dependency solver written in C++. Once installed, you can use `mamba` instead of `conda` for installing packages:
mamba install numpy pandas
This significantly speeds up installations and environment solves, especially when working with large scientific packages.
conda install numpy pandas matplotlib
You can install packages one by one, or include them during environment creation:
conda create --prefix /mnt/lustre/<username>/myenv python=3.10 numpy pandas
Old or unused environments can be removed to free up space:
conda remove --prefix /mnt/lustre/<username>/myenv --all
/mnt/lustre/<username>
.$HOME
directory, it may lead to quota issues or slow performance.–prefix
flag to create environments with absolute paths, especially on clusters where –name
may default to $HOME
./apps
or $HOME
.Singularity is an open-source, cross-platform container platform specifically designed for scientific and high-performance computing environments. It prioritizes reproducibility, portability, and security, all essential for scientific workflows. Singularity enables users to package entire workflows, including software, libraries, and environment settings, into a single container image. This ensures consistent application execution across various systems without modification. This capability simplifies the migration of complex computational environments and supports reproducible research practices. A detailed Singularity user guide is available here.
Singularity images for commonly used bioinformatics tools are stored in the following directories:
/apps/chpc/bio /home/apps/chpc/bio
To view the available .sif images, run the following script:
#!/bin/bash dirs=("/home/apps/chpc/bio" "/apps/chpc/bio") # Loop through and search for .sif files only in immediate subdirectories for dir in "${dirs[@]}"; do if [ -d "$dir" ]; then echo "Searching for .sif files under $dir (only first subfolder level):" find "$dir" -mindepth 2 -maxdepth 2 -type f -name "*.sif" -readable -exec ls -al {} \; 2>/dev/null else echo "Directory $dir does not exist." fi done
Before pulling a new image, run the script above to check if it’s not already available. Only proceed with pulling the image yourself if you’re confident in what you’re doing and plan to remove it afterwards.
⚠️ Important: Singularity image files can be very large and may consume significant storage in your Lustre or project directory. Please remove any images you no longer need to help conserve shared storage resources.
To pull Singularity images from public container registries (like DockerHub), follow these steps:
SSH into the CHPC Globus node:
ssh username@globus.chpc.ac.za
Load the Singularity module:
module load chpc/singularity
Navigate to your desired working directory:
cd /path/to/working_directory
Pull the image from DockerHub (or another registry):
singularity pull docker://repository/image:tag
This downloads and converts the image into a local .sif file saved in your current directory.
singularity exec /path/to/image.sif <command> <OPTIONS>
Most bioinformatics containers don’t include large reference datasets. Instead, bind external directories at runtime.
CHPC provides bioinformatics databases at:
/mnt/lustre/bsp/DB
These include reference genomes, annotation and index files used by BWA, BLAST, Kraken2, etc.
How to Bind a Database Directory
Use the –bind (or -B) option:
singularity exec --bind /mnt/lustre/bsp/DB:/databases /path/to/my_image.sif <your_command>
Explanation:
/mnt/lustre/bsp/DB:/databases → Host path mapped to container path
/path/to/my_image.sif → Singularity image path
<your_command> → The tool command (e.g., bwa index, blastn, etc.)
Inside the container, always refer to the database as /databases unless specified otherwise on the manual.
Use a comma-separated list:
singularity exec --bind /mnt/lustre/bsp/DB:/databases,/mnt/lustre/username/data:/data /path/to/my_image.sif <your_command>
This binds:
- /mnt/lustre/bsp/DB to /databases - /mnt/lustre/username/data to /data
Use those paths in your tools or pipelines.
Best Practices
1. Use absolute paths for bindings.
2. Keep host/container paths logical (e.g., /databases, /data).
3. Clean up containers and intermediate data regularly.
To clean up:
rm /path/to/image.sif rm -rf /path/to/working_directory/*.sif
Clean the Singularity cache periodically:
singularity cache clean
#!/bin/bash #PBS -N singularity_job #PBS -q normal #PBS -l select=1:ncpus=24 #PBS -l walltime=12:00:00 #PBS -o singularity_output.log #PBS -e singularity_error.log #PBS -M your.email@domain.com #PBS -m abe # Load necessary modules module load chpc/singularity # Change to your working directory cd $PBS_O_WORKDIR # Run your command inside the container singularity exec --bind /mnt/lustre/projects/<your_project>:/data my_image.sif <your_command_inside_container>
Nextflow is a free and open-source workflow management system that enables the development and execution of data analysis pipelines. It simplifies complex computational workflows and ensures reproducibility, scalability, and portability—whether you’re working on a laptop, HPC cluster, or in the cloud. Nextflow workflows are written using DSL2, allowing modular code design and seamless integration with container technologies like Docker, Singularity, Conda, or manual installations. Official documentation is available here.
CHPC supports Nextflow workflows through Singularity containers. Since compute nodes have no internet access, all dependencies must be downloaded in advance on the login node.
Log into the CHPC login node using your Lengau credentials:
ssh username@scp.chpc.ac.za
Use this session to prepare your workflow and submit jobs.
Load the necessary environment modules:
module load chpc/BIOMODULES nextflow module load chpc/singularity
Note: Modules must be reloaded in every new session unless added to your ~/.bashrc.
Pull your workflow and dependencies on the login1 node:
Pull the workflow:
nextflow pull nf-core/rnaseq
Run a test execution:
nextflow run nf-core/rnaseq -profile test
This will:
Cache the workflow in ~/.nextflow/assets/
Download containers (if configured)
Retrieve auxiliary files and dependencies
Workflow code is stored in:
~/.nextflow/assets
Container images are stored in:
~/.singularity
🗂 Finding the nextflow.config File
After pulling a workflow, you’ll typically find the nextflow.config file in its root directory.
Example:
cd ~/.nextflow/assets/nf-core/rnaseq/ ls
Look for:
nextflow.config
If missing, config files may reside in the conf/ directory or be fetched remotely. You can always override settings by creating your own nextflow.config.
🚫 No Manual PBS Scripts Needed Nextflow automatically generates and submits PBS scripts. You only define resources in nextflow.config.
Set default resource usage for all workflow processes:
process { executor = 'pbs' withLabel: big_job { queue = 'smp' cpus = 24 memory = '120 GB' time = '24h' } }
Customize resources for specific process groups using labels:
process { executor = 'pbs' withLabel: big_job { cpus = 16 memory = '64 GB' time = '12h' queue = 'smp' } withLabel: short_job { cpus = 1 memory = '1 GB' time = '15m' queue = 'smp' } }
Use the label in your pipeline:
process bigTask { label 'big_job' ... }
Enable Singularity support:
singularity.enabled = true singularity.autoMounts = true
Specify containers:
From Docker Hub:
process.container = 'docker://biocontainers/fastqc:v0.11.9_cv8'
From local image:
process.container = '/path/to/image.sif'
Set cache directory to avoid re-downloads:
singularity.cacheDir = '/path/to/.singularity'
Customize PBS job submission:
executor { name = 'pbs' queueSize = 20 submitOptions = '-V -m abe -M your@email.com' }
Profiles let you switch configurations easily:
profiles { standard { process.executor = 'pbs' process.queue = 'smp' } local { process.executor = 'local' docker.enabled = false } cluster_singularity { process.executor = 'pbs' singularity.enabled = true process.container = 'file:///path/to/container.sif' } }
Run a profile:
nextflow run main.nf -profile cluster_singularity
Run jobs on compute nodes without internet access:
nextflow run ~/.nextflow/assets/nf-core/rnaseq \ -profile singularity -offline
⚠️ Always include -offline on compute nodes to prevent online fetching.
Each Nextflow process generates a unique work directory (work/ab/xyz123), containing:
.command.run — generated PBS job script
.command.sh — wrapped shell script
.command.log — job output
.exitcode — exit status
To inspect a failed job:
cd work/ab/xyz123/ less .command.log
nextflow.config centralizes all pipeline settings.
No need to write PBS scripts manually.
Resources, container usage, and submission options are all configurable.
Profiles improve portability and reproducibility.
Offline mode is essential for CHPC compute node compatibility.
🧠 Tip: For workflows requiring reference data, bind directories just like with containers:
nextflow run /path/to/my_pipeline -profile singularity -offline \ --input /data/input.fastq \ --genomeDir /mnt/lustre/bsp/DB/genomes
🧹 Clean Up: Nextflow stores all its cache files in your home directory, so it's important to clean up these files once you're finished using a workflow to avoid running out of space.
rm -rf ~/.nextflow/assets/ rm -rf ~/.nextflow/tmp rm -rf ~/.singularity
If you encounter issues or need a specific tool installed contact the CHPC support team at:
Include your job script and all errors encountered.
Several examples of running blast can be found in this page of using fault tolerant blast.
Big thanks to Peter van Heusden for developing this script.
#!/bin/bash WORKDIR="/lustre/users/${USER}/blast_proj" INPUT_FASTA=${WORKDIR}/data_set.fa.gz BLAST_E_VAL="1e-3" BLAST_DB="/mnt/lustre/bsp/NCBI/BLAST/nr" THREADS=24 BLAST_HOURS=0 BLAST_MINUTES=30 ID_FMT="%01d" SPLIT_PREFIX="sub_set" MAIL_ADDRESS="youremail@somewhere.ac.za" zcat ${INPUT_FASTA} | csplit -z -f ${WORKDIR}/${SPLIT_PREFIX} -b "${ID_FMT}.split.fasta" - '/^>/' '{*}' NUM_PARTS=$(ls sub_set*.split.fasta | wc -l) START=0 END=$(expr $NUM_PARTS - 1) TMPSCRIPT=thejob.sh # note: make a distinction between variables set by the containing script (e.g. WORKDIR) and # ones set in the script (e.g. INDEX). The ones set in the script need to be escaped out cat >${TMPSCRIPT} << END #!/bin/bash #PBS -l select=1:ncpus=${THREADS} #PBS -l place=excl:group=nodetype #PBS -l walltime=${BLAST_HOURS}:${BLAST_MINUTES}:00 #PBS -q normal #PBS -m ae #PBS -M ${MAIL_ADDRESS} . /etc/profile.d/modules.sh module add chpc/BIOMODULES module add ncbi-blast/2.6.0 INDEX="${WORKDIR}/${SPLIT_PREFIX}\${PBS_ARRAY_INDEX}" INFILE="\${INDEX}.split.fasta" OUTFILE="\${INDEX}.blastx.xml" cd ${WORKDIR} blastx -num_threads 8 -evalue ${BLAST_E_VAL} -db ${BLAST_DB} -outfmt 5 -query \${INFILE} -out \${OUTFILE} END BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1) echo "submitted: ${BLAST_JOBID}" rm ${TMPSCRIPT} cat >${TMPSCRIPT} << END #!/bin/bash #PBS -l select=1:ncpus=1 #PBS -l place=free #PBS -l walltime=1:00:00 #PBS -q workq #PBS -m ae #PBS -M ${MAIL_ADDRESS} #PBS -W depend=afterok:${BLAST_JOBID} cd ${WORKDIR} tar jcf blast-xml-output.tar.bz2 *.blastx.xml END qsub -N tarblast ${TMPSCRIPT} rm ${TMPSCRIPT}
This script is designed to be run from the login node – it creates the job job scripts themselves and submits them. There are a number of things to notice:
Things to note about this script – bowtie currently does not run across multiple nodes. So using anything other than select=1 will result in compute resources being wasted8).
Then your job script called bowtie_script.qsub will look something like this:
#! /bin/bash #PBS -l select=1:ncpus=24 #PBS -l place=excl #PBS -l walltime=06:00:00 #PBS -q workq #PBS -o /home/username/lustre/some_reads/stdout.txt #PBS -e /home/username/lustre/some_reads/stderr.txt #PBS -M youremail@address.com #PBS -m be #PBS -N bowtiejob ################## MODULEPATH=/opt/gridware/bioinformatics/modules:$MODULEPATH source /etc/profile.d/modules.sh #######module add module add bowtie2/2.2.2 NP=`cat ${PBS_NODEFILE} | wc -l` EXE="bowtie2" forward_reads="A_reads1.fq,B_reads_1.fq" reverse_reads="A_reads1.fq,B_reads_1.fq" output_file="piggy_hits.sam" ARGS="sscrofa --shmem --threads ${NP} --sam -q -1 ${forward_reads} -2 ${reverse_reads} ${output_file}" ${EXE} ${ARGS}
Note: username should contain your actual user name!
Finally submit your job using:
user@login01:~ $ qsub bowtie_script.qsub
If you would like to try running namd2 on the GPU please take a look at this.
The job script that follows is for running a NAMD over the infiniband. Note that this does not use MPI so the script is somewhat different from other scripts you may see here.
#!/bin/bash #PBS -l select=10:ncpus=12:mpiprocs=12 #PBS -l place=excl #PBS -l walltime=00:05:00 #PBS -q workq #PBS -o /home/username/lustre/namd2/stdout.txt #PBS -e /home/username/lustre/namd2/stderr.txt #PBS -m ae #PBS -M youremail@address.com #PBS -N NAMD_bench . /etc/profile.d/modules.sh MODULEPATH=/opt/gridware/bioinformatics/modules:${MODULEPATH} module add NAMD/2.10_ibverbs cd /export/home/${USER}/scratch5/namd2 pbspro_namd apoa1.namd
Finally submit your job using:
user@login01:~ $ qsub namd.qsub
There is an example here on how one might use R on Lengau
Databases are accessible on the cluster in the directory
/mnt/lustre/bsp/DB
Please contact us to request software updates/installs; download big datasets; get advice on the best way to run your analysis; or to tell us what is/isn't working!