User Tools

Site Tools


howto:bioinformatics

Bioinformatics at the CHPC

Welcome to the bioinformatics at the CHPC wiki page! This page describes the basic procedures involved in getting your programs running at the CHPC rather than a description of how to do any particular bioinformatics analysis. If anything is unclear please hover your mouse over the superscripts! 1) For the most part we will be assuming you have at least a little familiarity with Linux. Much of this information is available elsewhere in the CHPC's wiki (probably in more detail), but here we are trying to have everything accessible in one place for the bioinformatics community. Please do read the quick start guide before continuing and pay special attention to the sections on queues and parameters2).

The CHPC has a Globus endpoint: Look for CHPC-Lengau to transfer data to/from the cluster storage. Access it via http://globus.org/ and your cluster username/password.

Web Portal Access

Galaxy GUI access to the cluster has been provided in the past and may be made available in the future if there is enough demand for it from our users.

To transfer files inward using gridftp, the http://globus.org/ system can be used, and is accessible via our endpoint named CHPC-Lengau. You should use the same credentials used to log in via ssh.

Command Line Access

Before one gains access to the command line, you should have an account. In order to get an account you and your PI should both follow the instructions to apply for resources.

Once your registration has been approved then Linux and OSX users can simply open a terminal and connect via ssh to the server using a command of the form3):

localuser@my_linux:~ $ ssh username@lengau.chpc.ac.za
Last login: Mon Feb 29 14:05:35 2016 from 10.128.23.235
username@login1:~ $

where user is the username you are assigned upon registration. Windows users can download the putty client 4).

Once connected users can: use the modules system to get access to bioinformatics programs; create job scripts using editors such as vim5) or nano6); and finally submit and monitor their jobs.

Using Modules

For now a quick and simple way of getting access to the bioinformatics software is using the module function. Running:

username@login2:~ $ module avail

will present you with the various modules available on the system and you should see something like:

------------------------------------------------ /cm/local/modulefiles ------------------------------------------------
cluster-tools/7.1         freeipmi/1.4.8            mvapich2/mlnx/gcc/64/2.1  use.own
cluster-tools-dell/7.1    gcc/5.1.0                 null                      version
cmd                       ipmitool/1.8.15           openldap
cmsh                      module-git                openmpi/mlnx/gcc/64/1.8.8
dot                       module-info               shared

----------------------------------------------- /cm/shared/modulefiles ------------------------------------------------
acml/gcc/64/5.3.1                            chpc/python/anaconda/2
acml/gcc/fma4/5.3.1                          chpc/python/anaconda/3
acml/gcc/mp/64/5.3.1                         chpc/qespresso/5.3.0/openmpi-1.8.8/gcc-5.1.0
acml/gcc/mp/fma4/5.3.1                       chpc/R/3.2.3-gcc5.1.0
acml/gcc-int64/64/5.3.1                      chpc/vasp/5.3/openmpi-1.8.8/gcc-5.1.0
acml/gcc-int64/fma4/5.3.1                    chpc/zlib/1.2.8/intel/16.0.1
acml/gcc-int64/mp/64/5.3.1                   cmgui/7.1
acml/gcc-int64/mp/fma4/5.3.1                 default-environment
acml/open64/64/5.3.1                         gdb/7.9
acml/open64/fma4/5.3.1                       hdf5/1.6.10
acml/open64/mp/64/5.3.1                      hdf5_18/1.8.14
acml/open64/mp/fma4/5.3.1                    hpl/2.1
acml/open64-int64/64/5.3.1                   hwloc/1.9.1
acml/open64-int64/fma4/5.3.1                 intel/compiler/64/15.0/2015.5.223
acml/open64-int64/mp/64/5.3.1                intel-cluster-checker/2.2.2
acml/open64-int64/mp/fma4/5.3.1              intel-cluster-runtime/ia32/3.7
blas/gcc/64/3.5.0                            intel-cluster-runtime/intel64/3.7
blas/open64/64/3.5.0                         intel-cluster-runtime/mic/3.7
bonnie++/1.97.1                              intel-tbb-oss/ia32/43_20150424oss
chpc/amber/12/openmpi-1.8.8/gcc-5.1.0        intel-tbb-oss/intel64/43_20150424oss
chpc/amber/14/openmpi-1.8.8/gcc-5.1.0        iozone/3_430
chpc/BIOMODULES                              iperf/3.0.11
chpc/cp2k/2.6.2/openmpi-1.8.8/gcc-5.1.0      lapack/gcc/64/3.5.0
...

Bioinformatics modules are expected to add a lot to that list and so have a list of their own. Running

username@login2:~ $ module add chpc/BIOMODULES

followed by

username@login2:~ $ module avail

will result in the following being added to the above list (this should be much expanded as various applications are added to the system):

----------------------------------------- /apps/chpc/scripts/modules/bio/app ------------------------------------------
anaconda/2             doxygen/1.8.11         java/1.8.0_73          ncbi-blast/2.3.0/intel R/3.2.3-gcc5.1.0
anaconda/3             git/2.8.1              mpiblast/1.6.0         python/2.7.11          texlive/2015
cmake/3.5.1            htop/2.0.1             ncbi-blast/2.3.0/gcc   python/3.5.1

Now to make use of, blast say, one can type:

username@login2:~ $ module add ncbi-blast/2.3.0/gcc

The appropriate environmental variables are set (usually as simple as adding a directory to the search path). Running:

username@login2:~ $ module list

will show which modules have been loaded. Whereas:

username@login2:~ $ module del modulename

will unload a module. And finally:

username@login2:~ $ module show modulename

will show what module modulename actually does.

Create Job Scripts

Next one must create a job script such as the one below:

my_job.qsub
#!/bin/bash
#PBS -l select=1:ncpus=2
#PBS -l walltime=10:00:00
#PBS -q serial
#PBS -P SHORTNAME
#PBS -o /mnt/lustre/users/username/my_data/stdout.txt
#PBS -e /mnt/lustre/users/username/my_data/stderr.txt
#PBS -N TophatEcoli
#PBS -M myemailaddress@someplace.com
#PBS -m b
 
module add chpc/BIOMODULES
module add tophat/2.1.1
 
NP=`cat ${PBS_NODEFILE} | wc -l`
 
EXE="tophat"
ARGS="--num-threads ${NP} someindex reads1 reads2 -o output_dir"
 
cd /mnt/lustre/users/username/my_data
${EXE} ${ARGS}

Note that username should be your username and SHORTNAME should be your research programme's code. More details on the job script file can be found in our PBS quickstart guide.

Submit Job Script

Finally submit your job using:

username@login2:~ $ qsub my_job.qsub
 
192757.sched01
username@login2:~ $

where 192757.sched01 is the jobID that is returned.

Monitor jobs

Jobs can then be monitored/controlled in several ways:

qstat

check status of pending and running jobs
username@login2:~ $ qstat -u username
 
sched01: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
192759.sched01  username serial   TophatEcol     --   1  24    --  00:02 Q   -- 
 
username@login2:~ $
check status of particular job
username@login2:~ $ qstat -f 192759.sched01
Job Id: 192759.sched01
    Job_Name = TophatEcoli
    Job_Owner = username@login2.cm.cluster
    resources_used.cpupercent = 0
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.ncpus = 96
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = R
    queue = serial
    server = sched01
    Checkpoint = u
    ctime = Mon Oct 10 06:57:13 2016
    Error_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stderr.txt
    exec_host = cnode0962/0*24
    exec_vnode = (cnode0962:ncpus=24)
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Oct 10 06:57:15 2016
    Output_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stdout.txt
    Priority = 0
    qtime = Mon Oct 10 06:57:13 2016
    Rerunable = True
    Resource_List.ncpus = 24
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.select = 1:ncpus=24
    Resource_List.walltime = 00:02:00
    stime = Mon Oct 10 06:57:15 2016
    session_id = 36609
    jobdir = /mnt/lustre/users/username
    substate = 42
    Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
        PBS_O_HOME=/home/dane,PBS_O_LOGNAME=username,PBS_O_WORKDIR=/mnt/lustre/users/username/my_data,
        PBS_O_LANG=en_ZA.UTF-8,
        PBS_O_PATH=/apps/chpc/bio/anaconda3/bin:/apps/chpc/bio/R/3.3.1/gcc-6.2
        .0/bin:/apps/chpc/bio/bzip2/1.0.6/bin:/apps/chpc/bio/curl/7.50.0/bin:/a
        pps/chpc/bio/lib/png/1.6.21/bin:/apps/chpc/bio/openmpi/2.0.0/gcc-6.2.0_
        java-1.8.0_73/bin:...
    comment = Job run at Mon Oct 10 at 06:57 on (cnode0962:ncpus=24)+(cnode0966
        :ncpus=24)+(cnode0971:ncpus=24)+(cnode0983:ncpus=24)
    etime = Mon Oct 10 06:57:13 2016
    umask = 22
    run_count = 1
    eligible_time = 00:00:00
    Submit_arguments = my_job.qsub
    pset = rack=cx14
    project = SHORTNAME
 
username@login01:~ $
canceljob
username@login01:~ $ qdel 192759.sched01
username@login01:~ $

Software Environments

Conda

Many scientific software tools rely on specific versions of libraries, compilers, and dependencies that often conflict with each other or with system-wide installations. Conda is a powerful, language-agnostic environment and package manager that helps solve this problem by allowing users to manage Python, R, C/C++, FORTRAN, and other language ecosystems in isolated environments.

Shared Conda Environments

For most use cases, especially in bioinformatics, CHPC provides pre-built, shared Conda environments installed under:

'/apps/chpc/bio/anaconda3-2020.02/envs'

These environments are curated by CHPC staff to include commonly used tools in genomics, transcriptomics, and other domains.

Step-by-step usage

Step 1: Load required modules

To access Conda functionality, first load the required modules:

module load chpc/BIOMODULES
module load conda_init

The second module updates your .bashrc file by adding necessary shell variables. To apply these changes, you can either log out and log back in, or run:

source ~/.bashrc. 

After this setup, you won’t need to load additional modules for your jobs—only the eval and conda activate steps are required.

Step 2: Initialize Conda in your shell

Activate Conda shell integration:

eval "$(conda shell.bash hook)"

This command sets up your shell environment to recognize Conda commands like `conda activate`.

Step 3: List available environments

conda info --envs

This will display all available shared Conda environments and their paths.

Step 4: Activate a shared environment

conda activate nameOfTheEnv

Replace nameOfTheEnv with the name of an environment from the previous step.

Tip: If you're unsure which environment to use, contact CHPC support or explore the environment's contents with `conda list`.
Note: You do not need to install anything when using shared environments.

Creating Private Conda Environments

If you need software that is not included in the shared environments, you may create your own private Conda environment. This gives you full control over the software stack and package versions.

Important: Do not install environments in your home directory (/home/<username>) -use your Lustre project storage instead.

Step-by-step setup

ssh to username@scp.chpc.ac.za, the password is the same as the one you use on lengau

Step 1: Load Conda

module load chpc/BIOMODULES
module load conda_init
eval "$(conda shell.bash hook)"

Step 2: Create a new environment

conda create --prefix /mnt/lustre/<username>/myenv python=3.10

This will create a Conda environment at the specified path with Python 3.10 installed. You can replace the Python version or leave it out if not needed.

Step 3: Activate your environment

conda activate /mnt/lustre/<username>/myenv

After activation, you can install any packages you need.

Step 4: (Optional) Install Mamba for faster package management

conda install mamba -n base -c conda-forge
Tip: Mamba is a drop-in replacement for Conda that uses a faster dependency solver written in C++. Once installed, you can use `mamba` instead of `conda` for installing packages:
mamba install numpy pandas

This significantly speeds up installations and environment solves, especially when working with large scientific packages.

Step 5: Install packages

conda install numpy pandas matplotlib

You can install packages one by one, or include them during environment creation:

conda create --prefix /mnt/lustre/<username>/myenv python=3.10 numpy pandas

Step 6: Remove unused environments

Old or unused environments can be removed to free up space:

conda remove --prefix /mnt/lustre/<username>/myenv --all

Best Practices

  • ✅ Use shared environments whenever possible for consistency and faster setup.
  • 📁 Create private environments only in Lustre directories, such as /mnt/lustre/<username>.
  • ⚠️ Do not use Conda in your $HOME directory, it may lead to quota issues or slow performance.
  • 📌 Use the –prefix flag to create environments with absolute paths, especially on clusters where –name may default to $HOME.
  • 🧼 Periodically clean up unused environments with `conda remove –all`.
  • 🔁 Reuse environment definitions by exporting and sharing them with others or for reproducibility.

Troubleshooting

  • Conda not recognized? Make sure you loaded `conda_init` and ran `eval “$(conda shell.bash hook)“`.
  • 🚫 Permission denied? You might be trying to write to a restricted directory like /apps or $HOME.
  • 🔄 Environment behaving unexpectedly? Try deactivating (`conda deactivate`) and reactivating, or recreate the environment.
  • 🧪 Conflicts during install? Use `conda clean –all` to clear caches and retry with a minimal environment.

Singularity

Singularity is an open-source, cross-platform container platform specifically designed for scientific and high-performance computing environments. It prioritizes reproducibility, portability, and security, all essential for scientific workflows. Singularity enables users to package entire workflows, including software, libraries, and environment settings, into a single container image. This ensures consistent application execution across various systems without modification. This capability simplifies the migration of complex computational environments and supports reproducible research practices. A detailed Singularity user guide is available here.

Location of Bioinformatics Singularity Images at CHPC

Singularity images for commonly used bioinformatics tools are stored in the following directories:

/apps/chpc/bio  
/home/apps/chpc/bio

To view the available .sif images, run the following script:

#!/bin/bash
dirs=("/home/apps/chpc/bio" "/apps/chpc/bio")
 
# Loop through and search for .sif files only in immediate subdirectories
for dir in "${dirs[@]}"; do
    if [ -d "$dir" ]; then
        echo "Searching for .sif files under $dir (only first subfolder level):"
        find "$dir" -mindepth 2 -maxdepth 2 -type f -name "*.sif" -readable -exec ls -al {} \; 2>/dev/null
    else
        echo "Directory $dir does not exist."
    fi
done

Pulling a Singularity Image

Before pulling a new image, run the script above to check if it’s not already available. Only proceed with pulling the image yourself if you’re confident in what you’re doing and plan to remove it afterwards.

⚠️ Important: Singularity image files can be very large and may consume significant storage in your Lustre or project directory. Please remove any images you no longer need to help conserve shared storage resources.

To pull Singularity images from public container registries (like DockerHub), follow these steps:

SSH into the CHPC Globus node:

ssh username@globus.chpc.ac.za

Load the Singularity module:

module load chpc/singularity

Navigate to your desired working directory:

cd /path/to/working_directory

Pull the image from DockerHub (or another registry):

singularity pull docker://repository/image:tag

This downloads and converts the image into a local .sif file saved in your current directory.

Running Singularity

singularity exec /path/to/image.sif <command> <OPTIONS>

Running Singularity with External Databases

Most bioinformatics containers don’t include large reference datasets. Instead, bind external directories at runtime.

CHPC provides bioinformatics databases at:

/mnt/lustre/bsp/DB

These include reference genomes, annotation and index files used by BWA, BLAST, Kraken2, etc.

How to Bind a Database Directory

Use the –bind (or -B) option:

singularity exec --bind /mnt/lustre/bsp/DB:/databases /path/to/my_image.sif <your_command>

Explanation:

/mnt/lustre/bsp/DB:/databases → Host path mapped to container path

/path/to/my_image.sif → Singularity image path

<your_command> → The tool command (e.g., bwa index, blastn, etc.)

Inside the container, always refer to the database as /databases unless specified otherwise on the manual.

Binding Multiple Directories

Use a comma-separated list:

singularity exec --bind /mnt/lustre/bsp/DB:/databases,/mnt/lustre/username/data:/data /path/to/my_image.sif <your_command>

This binds:

- /mnt/lustre/bsp/DB to /databases - /mnt/lustre/username/data to /data

Use those paths in your tools or pipelines.

Best Practices

1. Use absolute paths for bindings.

2. Keep host/container paths logical (e.g., /databases, /data).

3. Clean up containers and intermediate data regularly.

To clean up:

rm /path/to/image.sif
rm -rf /path/to/working_directory/*.sif

Clean the Singularity cache periodically:

singularity cache clean

PBS Template for Running a Singularity Container

#!/bin/bash
#PBS -N singularity_job
#PBS -q normal
#PBS -l select=1:ncpus=24
#PBS -l walltime=12:00:00
#PBS -o singularity_output.log
#PBS -e singularity_error.log
#PBS -M your.email@domain.com
#PBS -m abe
 
# Load necessary modules
module load chpc/singularity
 
# Change to your working directory
cd $PBS_O_WORKDIR
 
# Run your command inside the container
singularity exec --bind /mnt/lustre/projects/<your_project>:/data my_image.sif <your_command_inside_container>

Nextflow

Nextflow is a free and open-source workflow management system that enables the development and execution of data analysis pipelines. It simplifies complex computational workflows and ensures reproducibility, scalability, and portability—whether you’re working on a laptop, HPC cluster, or in the cloud. Nextflow workflows are written using DSL2, allowing modular code design and seamless integration with container technologies like Docker, Singularity, Conda, or manual installations. Official documentation is available here.

Running Nextflow on the CHPC Cluster

CHPC supports Nextflow workflows through Singularity containers. Since compute nodes have no internet access, all dependencies must be downloaded in advance on the login node.

1. Connect to the Login Node

Log into the CHPC login node using your Lengau credentials:

 ssh username@scp.chpc.ac.za 

Use this session to prepare your workflow and submit jobs.

2. Load the Nextflow Module

Load the necessary environment modules:

module load chpc/BIOMODULES nextflow
module load chpc/singularity

Note: Modules must be reloaded in every new session unless added to your ~/.bashrc.

3. Pull Workflow and Dependencies

Pull your workflow and dependencies on the login1 node:

Pull the workflow:

 nextflow pull nf-core/rnaseq 

Run a test execution:

 nextflow run nf-core/rnaseq -profile test 

This will:

Cache the workflow in ~/.nextflow/assets/

Download containers (if configured)

Retrieve auxiliary files and dependencies

Cached Files and Workflow Structure

Workflow code is stored in:

~/.nextflow/assets

Container images are stored in:

~/.singularity

🗂 Finding the nextflow.config File

After pulling a workflow, you’ll typically find the nextflow.config file in its root directory.

Example:

cd ~/.nextflow/assets/nf-core/rnaseq/
ls

Look for:

nextflow.config

If missing, config files may reside in the conf/ directory or be fetched remotely. You can always override settings by creating your own nextflow.config.

🚫 No Manual PBS Scripts Needed Nextflow automatically generates and submits PBS scripts. You only define resources in nextflow.config.

⚙️ Configuration with nextflow.config

🔧 1. Global Resource Settings

Set default resource usage for all workflow processes:

 process {
    executor = 'pbs'
 
    withLabel: big_job {
        queue = 'smp'
        cpus = 24
        memory = '120 GB'
        time = '24h'
    }
}
 
🏷️ 2. Custom Resource Labels

Customize resources for specific process groups using labels:

process {
    executor = 'pbs'
 
    withLabel: big_job {
        cpus = 16
        memory = '64 GB'
        time = '12h'
        queue = 'smp'
    }
 
    withLabel: short_job {
        cpus = 1
        memory = '1 GB'
        time = '15m'
        queue = 'smp'
    }
}

Use the label in your pipeline:

process bigTask {
label 'big_job'
...
}
📦 3. Singularity Integration

Enable Singularity support:

 singularity.enabled = true 
singularity.autoMounts = true 

Specify containers:

From Docker Hub:

process.container = 'docker://biocontainers/fastqc:v0.11.9_cv8'

From local image:

process.container = '/path/to/image.sif'

Set cache directory to avoid re-downloads:

 singularity.cacheDir = '/path/to/.singularity' 
🖥️ 4. PBS Executor Settings

Customize PBS job submission:

executor {
  name = 'pbs'
  queueSize = 20
  submitOptions = '-V -m abe -M your@email.com'
} 
📂 5. Using Profiles

Profiles let you switch configurations easily:

 profiles {
  standard {
    process.executor = 'pbs'
    process.queue = 'smp'
  }
 
  local {
    process.executor = 'local'
    docker.enabled = false
  }
 
  cluster_singularity {
    process.executor = 'pbs'
    singularity.enabled = true
    process.container = 'file:///path/to/container.sif'
  }
}
 

Run a profile:

 nextflow run main.nf -profile cluster_singularity 
🚫 Offline Mode

Run jobs on compute nodes without internet access:

 nextflow run ~/.nextflow/assets/nf-core/rnaseq \ 
-profile singularity -offline 

⚠️ Always include -offline on compute nodes to prevent online fetching.

🧭 Debugging and Logs

Each Nextflow process generates a unique work directory (work/ab/xyz123), containing:

.command.run — generated PBS job script

.command.sh — wrapped shell script

.command.log — job output

.exitcode — exit status

To inspect a failed job:

 cd work/ab/xyz123/
less .command.log 

✅ Summary

nextflow.config centralizes all pipeline settings.

No need to write PBS scripts manually.

Resources, container usage, and submission options are all configurable.

Profiles improve portability and reproducibility.

Offline mode is essential for CHPC compute node compatibility.

🧠 Tip: For workflows requiring reference data, bind directories just like with containers:

 nextflow run /path/to/my_pipeline -profile singularity -offline \ 
--input /data/input.fastq \ 
--genomeDir /mnt/lustre/bsp/DB/genomes

🧹 Clean Up: Nextflow stores all its cache files in your home directory, so it's important to clean up these files once you're finished using a workflow to avoid running out of space.

rm -rf ~/.nextflow/assets/ 
rm -rf ~/.nextflow/tmp 
rm -rf ~/.singularity

Need Help?

If you encounter issues or need a specific tool installed contact the CHPC support team at:

Include your job script and all errors encountered.

Basic examples

Blast

Running Blast using gnu parallel

Several examples of running blast can be found in this page of using fault tolerant blast.

Running Blast on sun cluster

Big thanks to Peter van Heusden for developing this script.

sun_blast.sh
#!/bin/bash
 
WORKDIR="/lustre/users/${USER}/blast_proj"
INPUT_FASTA=${WORKDIR}/data_set.fa.gz
BLAST_E_VAL="1e-3"
BLAST_DB="/mnt/lustre/bsp/NCBI/BLAST/nr"
THREADS=24
BLAST_HOURS=0
BLAST_MINUTES=30
ID_FMT="%01d"
SPLIT_PREFIX="sub_set"
MAIL_ADDRESS="youremail@somewhere.ac.za"
 
zcat ${INPUT_FASTA} | csplit -z -f ${WORKDIR}/${SPLIT_PREFIX} -b "${ID_FMT}.split.fasta" - '/^>/' '{*}'
 
NUM_PARTS=$(ls sub_set*.split.fasta | wc -l)
START=0
END=$(expr $NUM_PARTS - 1)
 
TMPSCRIPT=thejob.sh
# note: make a distinction between variables set by the containing script (e.g. WORKDIR) and
# ones set in the script (e.g. INDEX). The ones set in the script need to be escaped out
cat >${TMPSCRIPT} << END
#!/bin/bash
#PBS -l select=1:ncpus=${THREADS}
#PBS -l place=excl:group=nodetype
#PBS -l walltime=${BLAST_HOURS}:${BLAST_MINUTES}:00
#PBS -q normal
#PBS -m ae
#PBS -M ${MAIL_ADDRESS}
 
. /etc/profile.d/modules.sh
module add chpc/BIOMODULES
module add ncbi-blast/2.6.0
 
INDEX="${WORKDIR}/${SPLIT_PREFIX}\${PBS_ARRAY_INDEX}"
INFILE="\${INDEX}.split.fasta"
OUTFILE="\${INDEX}.blastx.xml"
 
cd ${WORKDIR}
blastx -num_threads 8 -evalue ${BLAST_E_VAL} -db ${BLAST_DB} -outfmt 5 -query \${INFILE} -out \${OUTFILE} END
 
BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1)
echo "submitted: ${BLAST_JOBID}"
 
rm ${TMPSCRIPT}
 
cat >${TMPSCRIPT} << END
#!/bin/bash
#PBS -l select=1:ncpus=1
#PBS -l place=free
#PBS -l walltime=1:00:00
#PBS -q workq
#PBS -m ae
#PBS -M ${MAIL_ADDRESS}
#PBS -W depend=afterok:${BLAST_JOBID}
 
cd ${WORKDIR}
tar jcf blast-xml-output.tar.bz2 *.blastx.xml
END
 
qsub -N tarblast ${TMPSCRIPT}
 
rm ${TMPSCRIPT}

This script is designed to be run from the login node – it creates the job job scripts themselves and submits them. There are a number of things to notice:

  1. The use of heredocs. These allow us to embed scripts that are to be run into another script. Here we can see that they output the text between ”cat >${TMPSCRIPT} « END” and “END” into the file ${TMPSCRIPT}
  2. The use of job-arrays – these allow us to submit multiple independent jobs as sub-jobs of one larger script. The line “BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1)” does multiple things:
    • It submits a job-array with the -J option which contains a STARTing number and an ENDing number. The END value in turn is informed by the line “NUM_PARTS=$(ls sub_set*.split.fasta | wc -l)” which counts the number of sub-fasta files which were created using the “csplit7) command.
    • the “cut -d. -f1” is used to grab the job identifier that is returned from the scheduler when the job is submitted. This is assigned to the variable BLAST_JOBID.
    • Note that job-arrays create the environmental variable PBS_ARRAY_INDEX which is used as a parameter for both the blast's input file and the blast's output file parameters.
    • Another important aspect of the job array is that the walltime parameter is the longest time you'd expect the sub-jobs to run in. So in this case we've divided a fasta file into many smaller faster files – one for each sequence. In the event that your original sequences have widely differing lengths it may pay to have a different approach to the division – perhaps one that results in the sub-fastas having similar sizes.
  3. The use of job dependencies. We see it in the second heredoc in the line “#PBS -W depend=afterok:${BLAST_JOBID}”. What this line does is that it says the job script only runs after the job with ID ${BLAST_JOBID} has successfully finished running, i.e. this job will not run if there are problems with the first job.

bowtie

Things to note about this script – bowtie currently does not run across multiple nodes. So using anything other than select=1 will result in compute resources being wasted8).

Job script

Then your job script called bowtie_script.qsub will look something like this:

bowtie_script.qsub
#! /bin/bash
#PBS -l select=1:ncpus=24
#PBS -l place=excl
#PBS -l walltime=06:00:00
#PBS -q workq
#PBS -o /home/username/lustre/some_reads/stdout.txt
#PBS -e /home/username/lustre/some_reads/stderr.txt
#PBS -M youremail@address.com
#PBS -m be
#PBS -N bowtiejob
 
##################
MODULEPATH=/opt/gridware/bioinformatics/modules:$MODULEPATH
source /etc/profile.d/modules.sh
 
#######module add
module add bowtie2/2.2.2
 
NP=`cat ${PBS_NODEFILE} | wc -l`
 
EXE="bowtie2"
 
forward_reads="A_reads1.fq,B_reads_1.fq"
reverse_reads="A_reads1.fq,B_reads_1.fq"
output_file="piggy_hits.sam"
ARGS="sscrofa --shmem --threads ${NP} --sam -q -1 ${forward_reads} -2 ${reverse_reads} ${output_file}"
 
${EXE} ${ARGS}

Note: username should contain your actual user name!

Submit your job

Finally submit your job using:

user@login01:~ $ qsub bowtie_script.qsub

NAMD2

If you would like to try running namd2 on the GPU please take a look at this.

The job script that follows is for running a NAMD over the infiniband. Note that this does not use MPI so the script is somewhat different from other scripts you may see here.

Job script

namd.qsub
#!/bin/bash
#PBS -l select=10:ncpus=12:mpiprocs=12
#PBS -l place=excl
#PBS -l walltime=00:05:00
#PBS -q workq
#PBS -o /home/username/lustre/namd2/stdout.txt
#PBS -e /home/username/lustre/namd2/stderr.txt
#PBS -m ae
#PBS -M youremail@address.com
#PBS -N NAMD_bench
 
. /etc/profile.d/modules.sh
MODULEPATH=/opt/gridware/bioinformatics/modules:${MODULEPATH}
module add NAMD/2.10_ibverbs
 
cd /export/home/${USER}/scratch5/namd2
 
pbspro_namd apoa1.namd

Submit your job

Finally submit your job using:

user@login01:~ $ qsub namd.qsub

R/bioconductor

There is an example here on how one might use R on Lengau

http://wiki.chpc.ac.za/howto:r#r_bioconductor

tophat

tuxedo

biopython

velvet

SOAP

Advanced examples

Databases

Databases are accessible on the cluster in the directory

/mnt/lustre/bsp/DB

Support

Please contact us to request software updates/installs; download big datasets; get advice on the best way to run your analysis; or to tell us what is/isn't working!

1)
Because they might just give you some addiotional hints ;-)
2)
especially the -P parameter
3)
Note that localuser@my_linux:~ $ is not part of the command
4)
Here is the getting started with putty guide
7)
csplit is a very useful tool – google it!
8)
Both because it will only run on a single node, and telling a process to use more threads than it has cores usually results in inefficiencies.
/app/dokuwiki/data/pages/howto/bioinformatics.txt · Last modified: 2025/05/21 10:50 by nmfuphi