User Tools

Site Tools


howto:bioinformatics

Bioinformatics at the CHPC

Welcome to the bioinformatics at the CHPC wiki page! This page describes the basic procedures involved in getting your programs running at the CHPC rather than a description of how to do any particular bioinformatics analysis. If anything is unclear please hover your mouse over the superscripts! 1) For the most part we will be assuming you have at least a little familiarity with Linux. Much of this information is available elsewhere in the CHPC's wiki (probably in more detail), but here we are trying to have everything accessible in one place for the bioinformatics community. Please do read the quick start guide before continuing and pay special attention to the sections on queues and parameters2).

The CHPC has a Globus endpoint: Look for CHPC-Lengau to transfer data to/from the cluster storage. Access it via http://globus.org/ and your cluster username/password.

Web Portal Access

Galaxy GUI access to the cluster has been provided in the past and may be made available in the future if there is enough demand for it from our users.

To transfer files inward using gridftp, the http://globus.org/ system can be used, and is accessible via our endpoint named CHPC-Lengau. You should use the same credentials used to log in via ssh.

Command Line Access

Before one gains access to the command line, you should have an account. In order to get an account you and your PI should both follow the instructions to apply for resources.

Once your registration has been approved then Linux and OSX users can simply open a terminal and connect via ssh to the server using a command of the form3):

localuser@my_linux:~ $ ssh username@lengau.chpc.ac.za
Last login: Mon Feb 29 14:05:35 2016 from 10.128.23.235
username@login1:~ $

where user is the username you are assigned upon registration. Windows users can download the putty client 4).

Once connected users can: use the modules system to get access to bioinformatics programs; create job scripts using editors such as vim5) or nano6); and finally submit and monitor their jobs.

Using Modules

For now a quick and simple way of getting access to the bioinformatics software is using the module function. Running:

username@login2:~ $ module avail

will present you with the various modules available on the system and you should see something like:

------------------------------------------------ /cm/local/modulefiles ------------------------------------------------
cluster-tools/7.1         freeipmi/1.4.8            mvapich2/mlnx/gcc/64/2.1  use.own
cluster-tools-dell/7.1    gcc/5.1.0                 null                      version
cmd                       ipmitool/1.8.15           openldap
cmsh                      module-git                openmpi/mlnx/gcc/64/1.8.8
dot                       module-info               shared

----------------------------------------------- /cm/shared/modulefiles ------------------------------------------------
acml/gcc/64/5.3.1                            chpc/python/anaconda/2
acml/gcc/fma4/5.3.1                          chpc/python/anaconda/3
acml/gcc/mp/64/5.3.1                         chpc/qespresso/5.3.0/openmpi-1.8.8/gcc-5.1.0
acml/gcc/mp/fma4/5.3.1                       chpc/R/3.2.3-gcc5.1.0
acml/gcc-int64/64/5.3.1                      chpc/vasp/5.3/openmpi-1.8.8/gcc-5.1.0
acml/gcc-int64/fma4/5.3.1                    chpc/zlib/1.2.8/intel/16.0.1
acml/gcc-int64/mp/64/5.3.1                   cmgui/7.1
acml/gcc-int64/mp/fma4/5.3.1                 default-environment
acml/open64/64/5.3.1                         gdb/7.9
acml/open64/fma4/5.3.1                       hdf5/1.6.10
acml/open64/mp/64/5.3.1                      hdf5_18/1.8.14
acml/open64/mp/fma4/5.3.1                    hpl/2.1
acml/open64-int64/64/5.3.1                   hwloc/1.9.1
acml/open64-int64/fma4/5.3.1                 intel/compiler/64/15.0/2015.5.223
acml/open64-int64/mp/64/5.3.1                intel-cluster-checker/2.2.2
acml/open64-int64/mp/fma4/5.3.1              intel-cluster-runtime/ia32/3.7
blas/gcc/64/3.5.0                            intel-cluster-runtime/intel64/3.7
blas/open64/64/3.5.0                         intel-cluster-runtime/mic/3.7
bonnie++/1.97.1                              intel-tbb-oss/ia32/43_20150424oss
chpc/amber/12/openmpi-1.8.8/gcc-5.1.0        intel-tbb-oss/intel64/43_20150424oss
chpc/amber/14/openmpi-1.8.8/gcc-5.1.0        iozone/3_430
chpc/BIOMODULES                              iperf/3.0.11
chpc/cp2k/2.6.2/openmpi-1.8.8/gcc-5.1.0      lapack/gcc/64/3.5.0
...

Bioinformatics modules are expected to add a lot to that list and so have a list of their own. Running

username@login2:~ $ module add chpc/BIOMODULES

followed by

username@login2:~ $ module avail

will result in the following being added to the above list (this should be much expanded as various applications are added to the system):

----------------------------------------- /apps/chpc/scripts/modules/bio/app ------------------------------------------
anaconda/2             doxygen/1.8.11         java/1.8.0_73          ncbi-blast/2.3.0/intel R/3.2.3-gcc5.1.0
anaconda/3             git/2.8.1              mpiblast/1.6.0         python/2.7.11          texlive/2015
cmake/3.5.1            htop/2.0.1             ncbi-blast/2.3.0/gcc   python/3.5.1

Now to make use of, blast say, one can type:

username@login2:~ $ module add ncbi-blast/2.3.0/gcc

The appropriate environmental variables are set (usually as simple as adding a directory to the search path). Running:

username@login2:~ $ module list

will show which modules have been loaded. Whereas:

username@login2:~ $ module del modulename

will unload a module. And finally:

username@login2:~ $ module show modulename

will show what module modulename actually does.

Create Job Scripts

Next one must create a job script such as the one below:

my_job.qsub
#!/bin/bash
#PBS -l select=1:ncpus=2
#PBS -l walltime=10:00:00
#PBS -q serial
#PBS -P SHORTNAME
#PBS -o /mnt/lustre/users/username/my_data/stdout.txt
#PBS -e /mnt/lustre/users/username/my_data/stderr.txt
#PBS -N TophatEcoli
#PBS -M myemailaddress@someplace.com
#PBS -m b
 
module add chpc/BIOMODULES
module add tophat/2.1.1
 
NP=`cat ${PBS_NODEFILE} | wc -l`
 
EXE="tophat"
ARGS="--num-threads ${NP} someindex reads1 reads2 -o output_dir"
 
cd /mnt/lustre/users/username/my_data
${EXE} ${ARGS}

Note that username should be your username and SHORTNAME should be your research programme's code. More details on the job script file can be found in our PBS quickstart guide.

Submit Job Script

Finally submit your job using:

username@login2:~ $ qsub my_job.qsub
 
192757.sched01
username@login2:~ $

where 192757.sched01 is the jobID that is returned.

Monitor jobs

Jobs can then be monitored/controlled in several ways:

qstat

check status of pending and running jobs
username@login2:~ $ qstat -u username
 
sched01: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
192759.sched01  username serial   TophatEcol     --   1  24    --  00:02 Q   -- 
 
username@login2:~ $
check status of particular job
username@login2:~ $ qstat -f 192759.sched01
Job Id: 192759.sched01
    Job_Name = TophatEcoli
    Job_Owner = username@login2.cm.cluster
    resources_used.cpupercent = 0
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.ncpus = 96
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = R
    queue = serial
    server = sched01
    Checkpoint = u
    ctime = Mon Oct 10 06:57:13 2016
    Error_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stderr.txt
    exec_host = cnode0962/0*24
    exec_vnode = (cnode0962:ncpus=24)
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Oct 10 06:57:15 2016
    Output_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stdout.txt
    Priority = 0
    qtime = Mon Oct 10 06:57:13 2016
    Rerunable = True
    Resource_List.ncpus = 24
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.select = 1:ncpus=24
    Resource_List.walltime = 00:02:00
    stime = Mon Oct 10 06:57:15 2016
    session_id = 36609
    jobdir = /mnt/lustre/users/username
    substate = 42
    Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
        PBS_O_HOME=/home/dane,PBS_O_LOGNAME=username,PBS_O_WORKDIR=/mnt/lustre/users/username/my_data,
        PBS_O_LANG=en_ZA.UTF-8,
        PBS_O_PATH=/apps/chpc/bio/anaconda3/bin:/apps/chpc/bio/R/3.3.1/gcc-6.2
        .0/bin:/apps/chpc/bio/bzip2/1.0.6/bin:/apps/chpc/bio/curl/7.50.0/bin:/a
        pps/chpc/bio/lib/png/1.6.21/bin:/apps/chpc/bio/openmpi/2.0.0/gcc-6.2.0_
        java-1.8.0_73/bin:...
    comment = Job run at Mon Oct 10 at 06:57 on (cnode0962:ncpus=24)+(cnode0966
        :ncpus=24)+(cnode0971:ncpus=24)+(cnode0983:ncpus=24)
    etime = Mon Oct 10 06:57:13 2016
    umask = 22
    run_count = 1
    eligible_time = 00:00:00
    Submit_arguments = my_job.qsub
    pset = rack=cx14
    project = SHORTNAME
 
username@login01:~ $
canceljob
username@login01:~ $ qdel 192759.sched01
username@login01:~ $

Basic examples

Blast

Running Blast using gnu parallel

Several examples of running blast can be found in this page of using fault tolerant blast.

Running Blast on sun cluster

Big thanks to Peter van Heusden for developing this script.

sun_blast.sh
#!/bin/bash
 
WORKDIR="/lustre/users/${USER}/blast_proj"
INPUT_FASTA=${WORKDIR}/data_set.fa.gz
BLAST_E_VAL="1e-3"
BLAST_DB="/mnt/lustre/bsp/NCBI/BLAST/nr"
THREADS=24
BLAST_HOURS=0
BLAST_MINUTES=30
ID_FMT="%01d"
SPLIT_PREFIX="sub_set"
MAIL_ADDRESS="youremail@somewhere.ac.za"
 
zcat ${INPUT_FASTA} | csplit -z -f ${WORKDIR}/${SPLIT_PREFIX} -b "${ID_FMT}.split.fasta" - '/^>/' '{*}'
 
NUM_PARTS=$(ls sub_set*.split.fasta | wc -l)
START=0
END=$(expr $NUM_PARTS - 1)
 
TMPSCRIPT=thejob.sh
# note: make a distinction between variables set by the containing script (e.g. WORKDIR) and
# ones set in the script (e.g. INDEX). The ones set in the script need to be escaped out
cat >${TMPSCRIPT} << END
#!/bin/bash
#PBS -l select=1:ncpus=${THREADS}
#PBS -l place=excl:group=nodetype
#PBS -l walltime=${BLAST_HOURS}:${BLAST_MINUTES}:00
#PBS -q normal
#PBS -m ae
#PBS -M ${MAIL_ADDRESS}
 
. /etc/profile.d/modules.sh
module add chpc/BIOMODULES
module add ncbi-blast/2.6.0
 
INDEX="${WORKDIR}/${SPLIT_PREFIX}\${PBS_ARRAY_INDEX}"
INFILE="\${INDEX}.split.fasta"
OUTFILE="\${INDEX}.blastx.xml"
 
cd ${WORKDIR}
blastx -num_threads 8 -evalue ${BLAST_E_VAL} -db ${BLAST_DB} -outfmt 5 -query \${INFILE} -out \${OUTFILE} END
 
BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1)
echo "submitted: ${BLAST_JOBID}"
 
rm ${TMPSCRIPT}
 
cat >${TMPSCRIPT} << END
#!/bin/bash
#PBS -l select=1:ncpus=1
#PBS -l place=free
#PBS -l walltime=1:00:00
#PBS -q workq
#PBS -m ae
#PBS -M ${MAIL_ADDRESS}
#PBS -W depend=afterok:${BLAST_JOBID}
 
cd ${WORKDIR}
tar jcf blast-xml-output.tar.bz2 *.blastx.xml
END
 
qsub -N tarblast ${TMPSCRIPT}
 
rm ${TMPSCRIPT}

This script is designed to be run from the login node – it creates the job job scripts themselves and submits them. There are a number of things to notice:

  1. The use of heredocs. These allow us to embed scripts that are to be run into another script. Here we can see that they output the text between “cat >${TMPSCRIPT} « END” and “END” into the file ${TMPSCRIPT}
  2. The use of job-arrays – these allow us to submit multiple independent jobs as sub-jobs of one larger script. The line “BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1)” does multiple things:
    • It submits a job-array with the -J option which contains a STARTing number and an ENDing number. The END value in turn is informed by the line “NUM_PARTS=$(ls sub_set*.split.fasta | wc -l)” which counts the number of sub-fasta files which were created using the “csplit7) command.
    • the “cut -d. -f1” is used to grab the job identifier that is returned from the scheduler when the job is submitted. This is assigned to the variable BLAST_JOBID.
    • Note that job-arrays create the environmental variable PBS_ARRAY_INDEX which is used as a parameter for both the blast's input file and the blast's output file parameters.
    • Another important aspect of the job array is that the walltime parameter is the longest time you'd expect the sub-jobs to run in. So in this case we've divided a fasta file into many smaller faster files – one for each sequence. In the event that your original sequences have widely differing lengths it may pay to have a different approach to the division – perhaps one that results in the sub-fastas having similar sizes.
  3. The use of job dependencies. We see it in the second heredoc in the line “#PBS -W depend=afterok:${BLAST_JOBID}”. What this line does is that it says the job script only runs after the job with ID ${BLAST_JOBID} has successfully finished running, i.e. this job will not run if there are problems with the first job.

bowtie

Things to note about this script – bowtie currently does not run across multiple nodes. So using anything other than select=1 will result in compute resources being wasted8).

Job script

Then your job script called bowtie_script.qsub will look something like this:

bowtie_script.qsub
#! /bin/bash
#PBS -l select=1:ncpus=24
#PBS -l place=excl
#PBS -l walltime=06:00:00
#PBS -q workq
#PBS -o /home/username/lustre/some_reads/stdout.txt
#PBS -e /home/username/lustre/some_reads/stderr.txt
#PBS -M youremail@address.com
#PBS -m be
#PBS -N bowtiejob
 
##################
MODULEPATH=/opt/gridware/bioinformatics/modules:$MODULEPATH
source /etc/profile.d/modules.sh
 
#######module add
module add bowtie2/2.2.2
 
NP=`cat ${PBS_NODEFILE} | wc -l`
 
EXE="bowtie2"
 
forward_reads="A_reads1.fq,B_reads_1.fq"
reverse_reads="A_reads1.fq,B_reads_1.fq"
output_file="piggy_hits.sam"
ARGS="sscrofa --shmem --threads ${NP} --sam -q -1 ${forward_reads} -2 ${reverse_reads} ${output_file}"
 
${EXE} ${ARGS}

Note: username should contain your actual user name!

Submit your job

Finally submit your job using:

user@login01:~ $ qsub bowtie_script.qsub

NAMD2

If you would like to try running namd2 on the GPU please take a look at this.

The job script that follows is for running a NAMD over the infiniband. Note that this does not use MPI so the script is somewhat different from other scripts you may see here.

Job script

namd.qsub
#!/bin/bash
#PBS -l select=10:ncpus=12:mpiprocs=12
#PBS -l place=excl
#PBS -l walltime=00:05:00
#PBS -q workq
#PBS -o /home/username/lustre/namd2/stdout.txt
#PBS -e /home/username/lustre/namd2/stderr.txt
#PBS -m ae
#PBS -M youremail@address.com
#PBS -N NAMD_bench
 
. /etc/profile.d/modules.sh
MODULEPATH=/opt/gridware/bioinformatics/modules:${MODULEPATH}
module add NAMD/2.10_ibverbs
 
cd /export/home/${USER}/scratch5/namd2
 
pbspro_namd apoa1.namd

Submit your job

Finally submit your job using:

user@login01:~ $ qsub namd.qsub

R/bioconductor

There is an example here on how one might use R on Lengau

http://wiki.chpc.ac.za/howto:r#r_bioconductor

tophat

tuxedo

biopython

velvet

SOAP

Advanced examples

Databases

Databases are accessible on the cluster in the

/mnt/lustre/bsp/DB

directory.

Support

Please contact us to request software updates/installs; download big datasets; get advice on the best way to run your analysis; or to tell us what is/isn't working!

1)
Because they might just give you some addiotional hints ;-)
2)
especially the -P parameter
3)
Note that localuser@my_linux:~ $ is not part of the command
4)
Here is the getting started with putty guide
7)
csplit is a very useful tool – google it!
8)
Both because it will only run on a single node, and telling a process to use more threads than it has cores usually results in inefficiencies.
/app/dokuwiki/data/pages/howto/bioinformatics.txt · Last modified: 2022/01/14 10:13 by ischeepers