the old version of this page can be found here
While THIS MESSAGE is here, please consider THIS PAGE as a work in progress.
Welcome to the bioinformatics at the CHPC wiki page! This page describes the basic procedures involved in getting your programs running at the CHPC rather than a description of how to do any particular bioinformatics analysis. If anything is unclear please hover your mouse over the superscripts! 1) For the most part we will be assuming you have at least a little familiarity with Linux. Much of this information is available elsewhere in the CHPC's wiki (probably in more detail), but here we are trying to have everything accessible in one place for the bioinformatics community. Please do read the quick start guide before continuing and pay special attention to the sections on queues and parameters2).
The Bioinformatics Service Platform (BSP) has its own domain and website at http://bsp.ac.za/. We also host Globus endpoints: CHPC-BSP at chpcbio#bio.chpc.ac.za for tests and CHPC-Globus at chpcbio#globus.chpc.ac.za to transfer data to/from the cluster storage. Access it via http://globus.org/ and your cluster username/password.
Galaxy-access has been provided in the past and may be made available in the future. Additionally the JMS system will hopefully be integrated with the lengau cluster soon.
To transfer files inward using gridftp, the http://globus.org/ system can be used,
and is accessible via our endpoint named chpcbio#globus.chpc.ac.za
. You should
use the same credentials used to log in via ssh.
Before one gains access to the command line, you should have an account. In order to get an account you and your PI should both follow the instructions to apply for resources.
Once your registration has been approved then Linux and OSX users can simply open a terminal and connect via ssh to the server using a command of the form3):
localuser@my_linux:~ $ ssh username@lengau.chpc.ac.za Last login: Mon Feb 29 14:05:35 2016 from 10.128.23.235 username@login1:~ $
where user is the username you are assigned upon registration. Windows users can download the putty client 4).
Once connected users can: use the modules system to get access to bioinformatics programs; create job scripts using editors such as vim5) or nano6); and finally submit and monitor their jobs.
For now a quick and simple way of getting access to the bioinformatics software is using the module function. Running:
username@login2:~ $ module avail
will present you with the various modules available on the system and you should see something like:
------------------------------------------------ /cm/local/modulefiles ------------------------------------------------ cluster-tools/7.1 freeipmi/1.4.8 mvapich2/mlnx/gcc/64/2.1 use.own cluster-tools-dell/7.1 gcc/5.1.0 null version cmd ipmitool/1.8.15 openldap cmsh module-git openmpi/mlnx/gcc/64/1.8.8 dot module-info shared ----------------------------------------------- /cm/shared/modulefiles ------------------------------------------------ acml/gcc/64/5.3.1 chpc/python/anaconda/2 acml/gcc/fma4/5.3.1 chpc/python/anaconda/3 acml/gcc/mp/64/5.3.1 chpc/qespresso/5.3.0/openmpi-1.8.8/gcc-5.1.0 acml/gcc/mp/fma4/5.3.1 chpc/R/3.2.3-gcc5.1.0 acml/gcc-int64/64/5.3.1 chpc/vasp/5.3/openmpi-1.8.8/gcc-5.1.0 acml/gcc-int64/fma4/5.3.1 chpc/zlib/1.2.8/intel/16.0.1 acml/gcc-int64/mp/64/5.3.1 cmgui/7.1 acml/gcc-int64/mp/fma4/5.3.1 default-environment acml/open64/64/5.3.1 gdb/7.9 acml/open64/fma4/5.3.1 hdf5/1.6.10 acml/open64/mp/64/5.3.1 hdf5_18/1.8.14 acml/open64/mp/fma4/5.3.1 hpl/2.1 acml/open64-int64/64/5.3.1 hwloc/1.9.1 acml/open64-int64/fma4/5.3.1 intel/compiler/64/15.0/2015.5.223 acml/open64-int64/mp/64/5.3.1 intel-cluster-checker/2.2.2 acml/open64-int64/mp/fma4/5.3.1 intel-cluster-runtime/ia32/3.7 blas/gcc/64/3.5.0 intel-cluster-runtime/intel64/3.7 blas/open64/64/3.5.0 intel-cluster-runtime/mic/3.7 bonnie++/1.97.1 intel-tbb-oss/ia32/43_20150424oss chpc/amber/12/openmpi-1.8.8/gcc-5.1.0 intel-tbb-oss/intel64/43_20150424oss chpc/amber/14/openmpi-1.8.8/gcc-5.1.0 iozone/3_430 chpc/BIOMODULES iperf/3.0.11 chpc/cp2k/2.6.2/openmpi-1.8.8/gcc-5.1.0 lapack/gcc/64/3.5.0 chpc/dlpoly/1.9/openmpi-1.8.8/gcc-5.1.0 lapack/open64/64/3.5.0 chpc/dlpoly/4.07/openmpi-1.8.8/gcc-5.1.0 mpich/ge/gcc/64/3.1.4 chpc/dlpoly/4.08/openmpi-1.8.8/gcc-5.1.0 mpich/ge/open64/64/3.1.4 chpc/gaussian09/D01 mpiexec/0.84_432 chpc/gaussian09/E01 mvapich/gcc/64/1.2rc1 chpc/gromacs/5.1.2/openmpi-1.8.8/gcc-5.1.0 mvapich/open64/64/1.2rc1 chpc/hdf5/1.8.16/intel/16.0.1 netcdf/gcc/64/4.3.3.1 chpc/lammps/16Feb16/openmpi-1.8.8/gcc-5.1.0 netcdf/open64/64/4.3.3.1 chpc/namd/2.11/openmpi-1.8.8/gcc-5.1.0 netperf/2.6.0 chpc/netcdf/4.4.0-C/intel/16.0.1 open64/4.5.2.1 chpc/netcdf/4.4.3-F/intel/16.0.1 openblas/dynamic/0.2.14 chpc/openmpi/1.10.2/gcc-5.1.0 openlava/3.0 chpc/openmpi/1.10.2/intel-16.0.1 openmpi/pgi/64/1.8.5 chpc/openmpi/1.8.8/gcc-5.1.0 puppet/3.7.5 chpc/openmpi/1.8.8/intel-16.0.1 scalapack/gcc/64/1.8.0 chpc/parallel_studio_xe/16.0.1/2016.1.150 scalapack/open64/64/1.8.0 chpc/parallel_studio_xe/64/16.0.1/2016.1.150 sge/2011.11p1 chpc/python/2.7.11 slurm/14.11.6 chpc/python/3.5.1 torque/5.1.0
Bioinformatics modules are expected to add a lot to that list and so have a list of their own. Running
username@login2:~ $ module add chpc/BIOMODULES
followed by
username@login2:~ $ module avail
will result in the following being added to the above list (this should be much expanded as various applications are added to the system):
----------------------------------------- /apps/chpc/scripts/modules/bio/app ------------------------------------------ anaconda/2 doxygen/1.8.11 java/1.8.0_73 ncbi-blast/2.3.0/intel R/3.2.3-gcc5.1.0 anaconda/3 git/2.8.1 mpiblast/1.6.0 python/2.7.11 texlive/2015 cmake/3.5.1 htop/2.0.1 ncbi-blast/2.3.0/gcc python/3.5.1
Now to make use of, blast say, one can type:
username@login2:~ $ module add ncbi-blast/2.3.0/gcc
The appropriate environmental variables are set (usually as simple as adding a directory to the search path). Running:
username@login2:~ $ module list
will show which modules have been loaded. Whereas:
username@login2:~ $ module del modulename
will unload a module. And finally:
username@login2:~ $ module show modulename
will show what module modulename actually does.
Next one must create a job script such as the one below:
#!/bin/bash #PBS -l select=1:ncpus=2 #PBS -l walltime=10:00:00 #PBS -q serial #PBS -P SHORTNAME #PBS -o /mnt/lustre/users/username/my_data/stdout.txt #PBS -e /mnt/lustre/users/username/my_data/stderr.txt #PBS -N TophatEcoli #PBS -M myemailaddress@someplace.com #PBS -m b module add chpc/BIOMODULES module add tophat/2.1.1 NP=`cat ${PBS_NODEFILE} | wc -l` EXE="tophat" ARGS="--num-threads ${NP} someindex reads1 reads2 -o output_dir" cd /mnt/lustre/users/username/my_data ${EXE} ${ARGS}
Note that username should be your username and SHORTNAME should be your research programme's code. More details on the job script file can be found in our PBS quickstart guide.
Finally submit your job using:
username@login2:~ $ qsub my_job.qsub 192757.sched01 username@login2:~ $
where 192757.sched01 is the jobID that is returned.
Jobs can then be monitored/controlled in several ways:
username@login2:~ $ qstat -u username sched01: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 192759.sched01 username serial TophatEcol -- 1 24 -- 00:02 Q -- username@login2:~ $
username@login2:~ $ qstat -f 192759.sched01 Job Id: 192759.sched01 Job_Name = TophatEcoli Job_Owner = username@login2.cm.cluster resources_used.cpupercent = 0 resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.ncpus = 96 resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = R queue = serial server = sched01 Checkpoint = u ctime = Mon Oct 10 06:57:13 2016 Error_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stderr.txt exec_host = cnode0962/0*24 exec_vnode = (cnode0962:ncpus=24) Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Mon Oct 10 06:57:15 2016 Output_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stdout.txt Priority = 0 qtime = Mon Oct 10 06:57:13 2016 Rerunable = True Resource_List.ncpus = 24 Resource_List.nodect = 1 Resource_List.place = free Resource_List.select = 1:ncpus=24 Resource_List.walltime = 00:02:00 stime = Mon Oct 10 06:57:15 2016 session_id = 36609 jobdir = /mnt/lustre/users/username substate = 42 Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash, PBS_O_HOME=/home/dane,PBS_O_LOGNAME=username,PBS_O_WORKDIR=/mnt/lustre/users/username/my_data, PBS_O_LANG=en_ZA.UTF-8, PBS_O_PATH=/apps/chpc/bio/anaconda3/bin:/apps/chpc/bio/R/3.3.1/gcc-6.2 .0/bin:/apps/chpc/bio/bzip2/1.0.6/bin:/apps/chpc/bio/curl/7.50.0/bin:/a pps/chpc/bio/lib/png/1.6.21/bin:/apps/chpc/bio/openmpi/2.0.0/gcc-6.2.0_ java-1.8.0_73/bin:... comment = Job run at Mon Oct 10 at 06:57 on (cnode0962:ncpus=24)+(cnode0966 :ncpus=24)+(cnode0971:ncpus=24)+(cnode0983:ncpus=24) etime = Mon Oct 10 06:57:13 2016 umask = 22 run_count = 1 eligible_time = 00:00:00 Submit_arguments = my_job.qsub pset = rack=cx14 project = SHORTNAME username@login01:~ $
username@login01:~ $ qdel 192759.sched01 username@login01:~ $
Several examples of running blast can be found in this page of using fault tolerant blast.
Big thanks to Peter van Heusden for developing this script.
#!/bin/bash WORKDIR="/lustre/users/${USER}/blast_proj" INPUT_FASTA=${WORKDIR}/data_set.fa.gz BLAST_E_VAL="1e-3" BLAST_DB="/mnt/lustre/bsp/NCBI/BLAST/nr" THREADS=24 BLAST_HOURS=0 BLAST_MINUTES=30 ID_FMT="%01d" SPLIT_PREFIX="sub_set" MAIL_ADDRESS="youremail@somewhere.ac.za" zcat ${INPUT_FASTA} | csplit -z -f ${WORKDIR}/${SPLIT_PREFIX} -b "${ID_FMT}.split.fasta" - '/^>/' '{*}' NUM_PARTS=$(ls sub_set*.split.fasta | wc -l) START=0 END=$(expr $NUM_PARTS - 1) TMPSCRIPT=thejob.sh # note: make a distinction between variables set by the containing script (e.g. WORKDIR) and # ones set in the script (e.g. INDEX). The ones set in the script need to be escaped out cat >${TMPSCRIPT} << END #!/bin/bash #PBS -l select=1:ncpus=${THREADS} #PBS -l place=excl:group=nodetype #PBS -l walltime=${BLAST_HOURS}:${BLAST_MINUTES}:00 #PBS -q normal #PBS -m ae #PBS -M ${MAIL_ADDRESS} . /etc/profile.d/modules.sh module add chpc/BIOMODULES module add ncbi-blast/2.6.0 INDEX="${WORKDIR}/${SPLIT_PREFIX}\${PBS_ARRAY_INDEX}" INFILE="\${INDEX}.split.fasta" OUTFILE="\${INDEX}.blastx.xml" cd ${WORKDIR} blastx -num_threads 8 -evalue ${BLAST_E_VAL} -db ${BLAST_DB} -outfmt 5 -query \${INFILE} -out \${OUTFILE} END BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1) echo "submitted: ${BLAST_JOBID}" rm ${TMPSCRIPT} cat >${TMPSCRIPT} << END #!/bin/bash #PBS -l select=1:ncpus=1 #PBS -l place=free #PBS -l walltime=1:00:00 #PBS -q workq #PBS -m ae #PBS -M ${MAIL_ADDRESS} #PBS -W depend=afterok:${BLAST_JOBID} cd ${WORKDIR} tar jcf blast-xml-output.tar.bz2 *.blastx.xml END qsub -N tarblast ${TMPSCRIPT} rm ${TMPSCRIPT}
This script is designed to be run from the login node – it creates the job job scripts themselves and submits them. There are a number of things to notice:
If you would like to try running gromacs on the gpu please take a look at this.
The job script that follows is for running an MPI compiled version of gromacs 4.6.1 on nehalem. There are many different versions of gromacs, to see what's available try:
user@login01:~ $ module avail
The following example is for working with one of the “_nehalem” gromacs modules – note it's quite important to use the correct version as the input data changes with versions…
#!/bin/bash #PBS -l select=10:ncpus=8:mpiprocs=8:jobtype=nehalem,place=excl #PBS -l walltime=00:40:00 #PBS -q workq #PBS -M user@someinstitution.ac.za #PBS -m be #PBS -V #PBS -e /lustre/SCRATCH5/users/USERNAME/gromacs_data/std_err.txt #PBS -o /lustre/SCRATCH5/users/USERNAME/gromacs_data/std_out.txt #PBS -N GROMACS_JOB #PBS -mb MODULEPATH=/opt/gridware/bioinformatics/modules:$MODULEPATH source /etc/profile.d/modules.sh #######module add module add gromacs/4.6.1_nehalem OMP_NUM_THREADS=1 NP=`cat ${PBS_NODEFILE} | wc -l` EXE="mdrun_mpi" ARGS="-s XXX -deffnm YYYY" cd /lustre/SCRATCH5/users/USERNAME/gromacs_data mpirun -np ${NP} -machinefile ${PBS_NODEFILE} ${EXE} ${ARGS}
Finally submit your job using:
user@login01:~ $ qsub gromacs_nehalem.qsub
Things to note about this script – bowtie currently does not run across multiple nodes. So using anything other than select=1 will result in compute resources being wasted8).
Then your job script called gromacs_nehalem.qsub will look something like this:
#! /bin/bash #PBS -l select=1:ncpus=12 #PBS -l place=excl #PBS -l walltime=06:00:00 #PBS -q workq #PBS -o /export/home/username/scratch5/some_reads/stdout.txt #PBS -e /export/home/username/scratch5/some_reads/stderr.txt #PBS -M youremail@address.com #PBS -m be #PBS -N bowtiejob ################## MODULEPATH=/opt/gridware/bioinformatics/modules:$MODULEPATH source /etc/profile.d/modules.sh #######module add module add bowtie2/2.2.2 NP=`cat ${PBS_NODEFILE} | wc -l` EXE="bowtie2" forward_reads="A_reads1.fq,B_reads_1.fq" reverse_reads="A_reads1.fq,B_reads_1.fq" output_file="piggy_hits.sam" ARGS="sscrofa --shmem --threads ${NP} --sam -q -1 ${forward_reads} -2 ${reverse_reads} ${output_file}" ${EXE} ${ARGS}
Note: username should contain your actual user name!
Finally submit your job using:
user@login01:~ $ qsub bowtie_script.qsub
If you would like to try running namd2 on the GPU please take a look at this.
The job script that follows is for running a NAMD over the infiniband. Note that this does not use MPI so the script is somewhat different from other scripts you may see here.
#!/bin/bash #PBS -l select=10:ncpus=12:mpiprocs=12 #PBS -l place=excl #PBS -l walltime=00:05:00 #PBS -q workq #PBS -o /export/home/username/scratch5/namd2/stdout.txt #PBS -e /export/home/username/scratch5/namd2/stderr.txt #PBS -m ae #PBS -M youremail@address.com #PBS -N NAMD_bench . /etc/profile.d/modules.sh MODULEPATH=/opt/gridware/bioinformatics/modules:${MODULEPATH} module add NAMD/2.10_ibverbs cd /export/home/${USER}/scratch5/namd2 pbspro_namd apoa1.namd
Finally submit your job using:
user@login01:~ $ qsub namd.qsub
Things to note about this script – bowtie currently does not run across multiple nodes. So using anything other than select=1 will result in compute resources being wasted9).
Then your job script called gromacs_nehalem.qsub will look something like this:
#! /bin/bash #PBS -l select=1:ncpus=12 #PBS -l place=excl #PBS -l walltime=06:00:00 #PBS -q workq #PBS -o /lustre/SCRATCH5/users/username/some_reads/stdout.txt #PBS -e /lustre/SCRATCH5/users/username/some_reads/stderr.txt #PBS -M youremail@address.com #PBS -m be #PBS -N bowtiejob ################## MODULEPATH=/opt/gridware/bioinformatics/modules:$MODULEPATH source /etc/profile.d/modules.sh #######module add module add bowtie2/2.2.2 NP=`cat ${PBS_NODEFILE} | wc -l` EXE="bowtie2" forward_reads="A_reads1.fq,B_reads_1.fq" reverse_reads="A_reads1.fq,B_reads_1.fq" output_file="piggy_hits.sam" ARGS="sscrofa --shmem --threads ${NP} --sam -q -1 ${forward_reads} -2 ${reverse_reads} ${output_file}" cd /lustre/SCRATCH5/users/username/some_reads ${EXE} ${ARGS}
Note: username should contain your actual user name!
Finally submit your job using:
user@login01:~ $ qsub bowtie_script.qsub
There is an example here on how one might use R at the chp
Databases are accessible on the cluster in the
/lustre/SCRATCH5/groups/bioinfo/DBs
directory. Alternatively they are also mirrored on the bio machine.
Please contact us to: request software updates/installs; download big datasets; get advice on the best way to run your analysis; or to tell us what is/isn't working!