Welcome to the bioinformatics at the CHPC wiki page! This page describes the basic procedures involved in getting your programs running at the CHPC rather than a description of how to do any particular bioinformatics analysis. If anything is unclear please hover your mouse over the superscripts! 1) For the most part we will be assuming you have at least a little familiarity with Linux. Much of this information is available elsewhere in the CHPC's wiki (probably in more detail), but here we are trying to have everything accessible in one place for the bioinformatics community. Please do read the quick start guide before continuing and pay special attention to the sections on queues and parameters2).
The CHPC has a Globus endpoint: Look for CHPC-Lengau to transfer data to/from the cluster storage. Access it via http://globus.org/ and your cluster username/password.
Galaxy GUI access to the cluster has been provided in the past and may be made available in the future if there is enough demand for it from our users.
To transfer files inward using gridftp, the http://globus.org/ system can be used,
and is accessible via our endpoint named CHPC-Lengau
. You should
use the same credentials used to log in via ssh.
Before one gains access to the command line, you should have an account. In order to get an account you and your PI should both follow the instructions to apply for resources.
Once your registration has been approved then Linux and OSX users can simply open a terminal and connect via ssh to the server using a command of the form3):
localuser@my_linux:~ $ ssh username@lengau.chpc.ac.za Last login: Mon Feb 29 14:05:35 2016 from 10.128.23.235 username@login1:~ $
where user is the username you are assigned upon registration. Windows users can download the putty client 4).
Once connected users can: use the modules system to get access to bioinformatics programs; create job scripts using editors such as vim5) or nano6); and finally submit and monitor their jobs.
For now a quick and simple way of getting access to the bioinformatics software is using the module function. Running:
username@login2:~ $ module avail
will present you with the various modules available on the system and you should see something like:
------------------------------------------------ /cm/local/modulefiles ------------------------------------------------ cluster-tools/7.1 freeipmi/1.4.8 mvapich2/mlnx/gcc/64/2.1 use.own cluster-tools-dell/7.1 gcc/5.1.0 null version cmd ipmitool/1.8.15 openldap cmsh module-git openmpi/mlnx/gcc/64/1.8.8 dot module-info shared ----------------------------------------------- /cm/shared/modulefiles ------------------------------------------------ acml/gcc/64/5.3.1 chpc/python/anaconda/2 acml/gcc/fma4/5.3.1 chpc/python/anaconda/3 acml/gcc/mp/64/5.3.1 chpc/qespresso/5.3.0/openmpi-1.8.8/gcc-5.1.0 acml/gcc/mp/fma4/5.3.1 chpc/R/3.2.3-gcc5.1.0 acml/gcc-int64/64/5.3.1 chpc/vasp/5.3/openmpi-1.8.8/gcc-5.1.0 acml/gcc-int64/fma4/5.3.1 chpc/zlib/1.2.8/intel/16.0.1 acml/gcc-int64/mp/64/5.3.1 cmgui/7.1 acml/gcc-int64/mp/fma4/5.3.1 default-environment acml/open64/64/5.3.1 gdb/7.9 acml/open64/fma4/5.3.1 hdf5/1.6.10 acml/open64/mp/64/5.3.1 hdf5_18/1.8.14 acml/open64/mp/fma4/5.3.1 hpl/2.1 acml/open64-int64/64/5.3.1 hwloc/1.9.1 acml/open64-int64/fma4/5.3.1 intel/compiler/64/15.0/2015.5.223 acml/open64-int64/mp/64/5.3.1 intel-cluster-checker/2.2.2 acml/open64-int64/mp/fma4/5.3.1 intel-cluster-runtime/ia32/3.7 blas/gcc/64/3.5.0 intel-cluster-runtime/intel64/3.7 blas/open64/64/3.5.0 intel-cluster-runtime/mic/3.7 bonnie++/1.97.1 intel-tbb-oss/ia32/43_20150424oss chpc/amber/12/openmpi-1.8.8/gcc-5.1.0 intel-tbb-oss/intel64/43_20150424oss chpc/amber/14/openmpi-1.8.8/gcc-5.1.0 iozone/3_430 chpc/BIOMODULES iperf/3.0.11 chpc/cp2k/2.6.2/openmpi-1.8.8/gcc-5.1.0 lapack/gcc/64/3.5.0 ...
Bioinformatics modules are expected to add a lot to that list and so have a list of their own. Running
username@login2:~ $ module add chpc/BIOMODULES
followed by
username@login2:~ $ module avail
will result in the following being added to the above list (this should be much expanded as various applications are added to the system):
----------------------------------------- /apps/chpc/scripts/modules/bio/app ------------------------------------------ anaconda/2 doxygen/1.8.11 java/1.8.0_73 ncbi-blast/2.3.0/intel R/3.2.3-gcc5.1.0 anaconda/3 git/2.8.1 mpiblast/1.6.0 python/2.7.11 texlive/2015 cmake/3.5.1 htop/2.0.1 ncbi-blast/2.3.0/gcc python/3.5.1
Now to make use of, blast say, one can type:
username@login2:~ $ module add ncbi-blast/2.3.0/gcc
The appropriate environmental variables are set (usually as simple as adding a directory to the search path). Running:
username@login2:~ $ module list
will show which modules have been loaded. Whereas:
username@login2:~ $ module del modulename
will unload a module. And finally:
username@login2:~ $ module show modulename
will show what module modulename actually does.
Next one must create a job script such as the one below:
#!/bin/bash #PBS -l select=1:ncpus=2 #PBS -l walltime=10:00:00 #PBS -q serial #PBS -P SHORTNAME #PBS -o /mnt/lustre/users/username/my_data/stdout.txt #PBS -e /mnt/lustre/users/username/my_data/stderr.txt #PBS -N TophatEcoli #PBS -M myemailaddress@someplace.com #PBS -m b module add chpc/BIOMODULES module add tophat/2.1.1 NP=`cat ${PBS_NODEFILE} | wc -l` EXE="tophat" ARGS="--num-threads ${NP} someindex reads1 reads2 -o output_dir" cd /mnt/lustre/users/username/my_data ${EXE} ${ARGS}
Note that username should be your username and SHORTNAME should be your research programme's code. More details on the job script file can be found in our PBS quickstart guide.
Finally submit your job using:
username@login2:~ $ qsub my_job.qsub 192757.sched01 username@login2:~ $
where 192757.sched01 is the jobID that is returned.
Jobs can then be monitored/controlled in several ways:
username@login2:~ $ qstat -u username sched01: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 192759.sched01 username serial TophatEcol -- 1 24 -- 00:02 Q -- username@login2:~ $
username@login2:~ $ qstat -f 192759.sched01 Job Id: 192759.sched01 Job_Name = TophatEcoli Job_Owner = username@login2.cm.cluster resources_used.cpupercent = 0 resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.ncpus = 96 resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = R queue = serial server = sched01 Checkpoint = u ctime = Mon Oct 10 06:57:13 2016 Error_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stderr.txt exec_host = cnode0962/0*24 exec_vnode = (cnode0962:ncpus=24) Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Mon Oct 10 06:57:15 2016 Output_Path = login2.cm.cluster:/mnt/lustre/users/username/my_data/stdout.txt Priority = 0 qtime = Mon Oct 10 06:57:13 2016 Rerunable = True Resource_List.ncpus = 24 Resource_List.nodect = 1 Resource_List.place = free Resource_List.select = 1:ncpus=24 Resource_List.walltime = 00:02:00 stime = Mon Oct 10 06:57:15 2016 session_id = 36609 jobdir = /mnt/lustre/users/username substate = 42 Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash, PBS_O_HOME=/home/dane,PBS_O_LOGNAME=username,PBS_O_WORKDIR=/mnt/lustre/users/username/my_data, PBS_O_LANG=en_ZA.UTF-8, PBS_O_PATH=/apps/chpc/bio/anaconda3/bin:/apps/chpc/bio/R/3.3.1/gcc-6.2 .0/bin:/apps/chpc/bio/bzip2/1.0.6/bin:/apps/chpc/bio/curl/7.50.0/bin:/a pps/chpc/bio/lib/png/1.6.21/bin:/apps/chpc/bio/openmpi/2.0.0/gcc-6.2.0_ java-1.8.0_73/bin:... comment = Job run at Mon Oct 10 at 06:57 on (cnode0962:ncpus=24)+(cnode0966 :ncpus=24)+(cnode0971:ncpus=24)+(cnode0983:ncpus=24) etime = Mon Oct 10 06:57:13 2016 umask = 22 run_count = 1 eligible_time = 00:00:00 Submit_arguments = my_job.qsub pset = rack=cx14 project = SHORTNAME username@login01:~ $
username@login01:~ $ qdel 192759.sched01 username@login01:~ $
Several examples of running blast can be found in this page of using fault tolerant blast.
Big thanks to Peter van Heusden for developing this script.
#!/bin/bash WORKDIR="/lustre/users/${USER}/blast_proj" INPUT_FASTA=${WORKDIR}/data_set.fa.gz BLAST_E_VAL="1e-3" BLAST_DB="/mnt/lustre/bsp/NCBI/BLAST/nr" THREADS=24 BLAST_HOURS=0 BLAST_MINUTES=30 ID_FMT="%01d" SPLIT_PREFIX="sub_set" MAIL_ADDRESS="youremail@somewhere.ac.za" zcat ${INPUT_FASTA} | csplit -z -f ${WORKDIR}/${SPLIT_PREFIX} -b "${ID_FMT}.split.fasta" - '/^>/' '{*}' NUM_PARTS=$(ls sub_set*.split.fasta | wc -l) START=0 END=$(expr $NUM_PARTS - 1) TMPSCRIPT=thejob.sh # note: make a distinction between variables set by the containing script (e.g. WORKDIR) and # ones set in the script (e.g. INDEX). The ones set in the script need to be escaped out cat >${TMPSCRIPT} << END #!/bin/bash #PBS -l select=1:ncpus=${THREADS} #PBS -l place=excl:group=nodetype #PBS -l walltime=${BLAST_HOURS}:${BLAST_MINUTES}:00 #PBS -q normal #PBS -m ae #PBS -M ${MAIL_ADDRESS} . /etc/profile.d/modules.sh module add chpc/BIOMODULES module add ncbi-blast/2.6.0 INDEX="${WORKDIR}/${SPLIT_PREFIX}\${PBS_ARRAY_INDEX}" INFILE="\${INDEX}.split.fasta" OUTFILE="\${INDEX}.blastx.xml" cd ${WORKDIR} blastx -num_threads 8 -evalue ${BLAST_E_VAL} -db ${BLAST_DB} -outfmt 5 -query \${INFILE} -out \${OUTFILE} END BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1) echo "submitted: ${BLAST_JOBID}" rm ${TMPSCRIPT} cat >${TMPSCRIPT} << END #!/bin/bash #PBS -l select=1:ncpus=1 #PBS -l place=free #PBS -l walltime=1:00:00 #PBS -q workq #PBS -m ae #PBS -M ${MAIL_ADDRESS} #PBS -W depend=afterok:${BLAST_JOBID} cd ${WORKDIR} tar jcf blast-xml-output.tar.bz2 *.blastx.xml END qsub -N tarblast ${TMPSCRIPT} rm ${TMPSCRIPT}
This script is designed to be run from the login node – it creates the job job scripts themselves and submits them. There are a number of things to notice:
Things to note about this script – bowtie currently does not run across multiple nodes. So using anything other than select=1 will result in compute resources being wasted8).
Then your job script called bowtie_script.qsub will look something like this:
#! /bin/bash #PBS -l select=1:ncpus=24 #PBS -l place=excl #PBS -l walltime=06:00:00 #PBS -q workq #PBS -o /home/username/lustre/some_reads/stdout.txt #PBS -e /home/username/lustre/some_reads/stderr.txt #PBS -M youremail@address.com #PBS -m be #PBS -N bowtiejob ################## MODULEPATH=/opt/gridware/bioinformatics/modules:$MODULEPATH source /etc/profile.d/modules.sh #######module add module add bowtie2/2.2.2 NP=`cat ${PBS_NODEFILE} | wc -l` EXE="bowtie2" forward_reads="A_reads1.fq,B_reads_1.fq" reverse_reads="A_reads1.fq,B_reads_1.fq" output_file="piggy_hits.sam" ARGS="sscrofa --shmem --threads ${NP} --sam -q -1 ${forward_reads} -2 ${reverse_reads} ${output_file}" ${EXE} ${ARGS}
Note: username should contain your actual user name!
Finally submit your job using:
user@login01:~ $ qsub bowtie_script.qsub
If you would like to try running namd2 on the GPU please take a look at this.
The job script that follows is for running a NAMD over the infiniband. Note that this does not use MPI so the script is somewhat different from other scripts you may see here.
#!/bin/bash #PBS -l select=10:ncpus=12:mpiprocs=12 #PBS -l place=excl #PBS -l walltime=00:05:00 #PBS -q workq #PBS -o /home/username/lustre/namd2/stdout.txt #PBS -e /home/username/lustre/namd2/stderr.txt #PBS -m ae #PBS -M youremail@address.com #PBS -N NAMD_bench . /etc/profile.d/modules.sh MODULEPATH=/opt/gridware/bioinformatics/modules:${MODULEPATH} module add NAMD/2.10_ibverbs cd /export/home/${USER}/scratch5/namd2 pbspro_namd apoa1.namd
Finally submit your job using:
user@login01:~ $ qsub namd.qsub
There is an example here on how one might use R on Lengau
Databases are accessible on the cluster in the
/mnt/lustre/bsp/DB
directory.
Please contact us to request software updates/installs; download big datasets; get advice on the best way to run your analysis; or to tell us what is/isn't working!