User Tools

Site Tools


howto:bioinformatics.old

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

howto:bioinformatics.old [2016/04/14 12:24] (current)
dane created -- copy of the old page.
Line 1: Line 1:
 +====== Bioinformatics at the CHPC ======
  
 +Welcome to the bioinformatics at the CHPC wiki page! This page describes the basic procedures involved in getting your programs running at the CHPC rather than a description of how to do any particular bioinformatics analysis. If anything is unclear please hover your mouse over the superscripts! ((Because they might just give you some hints ;-))) For the most part we will be assuming you have at least a little familiarity with Linux. Much of this information is available elsewhere in the CHPC's wiki (probably in more detail), but here we are trying to have everything accessible in one place for the bioinformatics community.
 +
 +The Bioinformatics Service Platform (BSP) has recently obtained its own domain and website at http://​bsp.ac.za/​. We also host Globus endpoints at chpcbio#​bio.chpc.ac.za and chpcbio#​globus.chpc.ac.za.
 +
 +===== Web Portal Access =====
 +
 +Web-based access to the CHPC cluster is available via a //​Galaxy//​(([[http://​galaxyproject.org/​]])) web interface at [[http://​galaxy.chpc.ac.za/​]]. Another workflow-based system uses //​chipster//​(([[http://​chipster.csc.fi/​|chipster website]])) but
 +uses a dedicated VM and does not run on the cluster [[http://​chipster.chpc.ac.za/​]]. ​
 +
 +To transfer files inward using gridftp, the [[http://​globus.org/​]] system can be used,
 +and is accessible via our endpoint named ''​chpcbio#​globus.chpc.ac.za''​. You should ​
 +use the same credentials used to log in via ssh.
 +
 +
 +===== Command Line Access =====
 +
 +Various opensource packages have been pre-installed at the CHPC. For the moment they'​re on the SUN cluster but where appropriate they will be ported to the other architectures. First one must [[http://​www.chpc.ac.za/​index.php/​contact-us/​apply-for-resources-form|apply for resources]] to gain access to the cluster. Once your registration has been approved then Linux and OSX users can simply open a terminal and connect via ssh to the server using a command of the form((Note that //​**localuser@my_linux:​~ $**// is not part of the command)):
 +<code bash>​localuser@my_linux:​~ $ ssh username@sun.chpc.ac.za
 +Last login: Tue Jan 28 14:05:35 2014 from 10.128.23.235
 +username@login01:​~ $</​code>​
 +where //user// is the username you are assigned upon registration. Windows users can [[http://​www.chiark.greenend.org.uk/​~sgtatham/​putty/​download.html|download the putty client]] ((Here is the [[http://​the.earth.li/​~sgtatham/​putty/​0.63/​htmldoc/​Chapter2.html#​gs|getting started with putty]] guide)). Once connected users can: [[howto:​bioinformatics#​Using Modules|use the modules system]] to get access to bioinformatics programs; [[howto:​bioinformatics#​Create Job Scripts|create job scripts]] using editors such as //​vim//​(([[http://​vimdoc.sourceforge.net/​htmldoc/​usr_02.html|vim user guide]])) or //​nano//​(([[http://​www.nano-editor.org/​dist/​v2.0/​nano.html|nano user guide]])); and finally [[howto:​bioinformatics#​Submit Job Script| submit]] and [[howto:​bioinformatics#​Monitor jobs|monitor]] their jobs.
 +
 +==== Using Modules ====
 +
 +For now a quick and simple way of getting access to the bioinformatics software is using the //module// function. First of all one should ensure that:
 +<code bash>​export MODULEPATH=/​opt/​gridware/​bioinformatics/​modules:​$MODULEPATH</​code>​
 +exists in your **//​~/​.profile//​** file((it is not included by default)). Then running:<​code bash>​username@login01:​~ $ module avail</​code>​ will present you with the various modules available on the system and you should see something like:
 +<​file>​
 +------------------------ /​opt/​gridware/​bioinformatics/​modules -------------------------
 +R/​2.15.2 ​                   bowtie/​0.12.9 ​              ​gcc_4.7.2_libs
 +R/​2.15.3 ​                   bowtie/​1.0.0 ​               lam/7.1.4
 +R/​3.0.0 ​                    ​bowtie2/​2.1.0 ​              ​latex/​texlive_2012
 +R/​default ​                  ​clustal/​clustal-omega-1.1.0 mpich2/1.5
 +beagle/​beagle_lib-r1090 ​    ​clustal/​clustalw-2.1 ​       perl/5.16.3
 +beagle/​default ​             clustal/​clustalw-MPI-1.82 ​  ​python/​2.7.3
 +beast/​beast-1.7.2 ​          ​cufflinks/​2.0.2 ​            ​samtools/​0.1.18
 +beast/​default ​              ​cufflinks/​2.1.1 ​            ​tophat/​2.0.8b
 +bioperl/​1.6.1 ​              ​emboss/​6.5.7 ​               velvet/​1.2.08
 +
 +------------------- /​opt/​gridware/​modules-3.2.7/​Modules/​3.2.7/​CHPC --------------------
 +amber/​12(default) ​              ​intel2012
 +clustertools ​                   inteltools
 +dell/​default-environment ​       mvapich2/​1.8-gnu
 +dell/​moab ​                      ​mvapich2/​1.8-r5668
 +dell/​openmpi/​intel/​1.4.4 ​       netcdf/​gnu-4.1.2
 +dell/​torque/​2.5.12 ​             netcdf/​intel-4.1.2
 +dlpoly/​2.20-impi ​               openfst/​1.3.3-gnu
 +dlpoly/​2.20-steve-impi ​         openfst/​1.3.3-intel
 +dlpoly/​3.07-impi ​               openmpi/​openmpi-1.6.1-gnu
 +dlpoly/​3.09-iompi ​              ​openmpi/​openmpi-1.6.1-gnu.bak
 +espresso/​3.1.2 ​                 openmpi/​openmpi-1.6.1-intel
 +fftw/​3.3.2-intel ​               openmpi/​openmpi-1.6.1-intel.bak
 +g09                             ​sapt/​2008
 +gcc/​4.6.3 ​                      ​sunstudio
 +gcc/​4.7.2 ​                      tau
 +intel                           ​zlib/​1.2.7
 +</​file>​
 +
 +Now to make use of, tophat say, one can type:<​code bash>​username@login01:​~ $ module add tophat/​2.0.8b</​code>​The appropriate environmental variables are set (usually as simple as adding a directory to the search path). Notice that often there are several versions of software available, e.g. **//R//** versions **//​2.15.2//​**,​ **//​2.15.3//​** and **//​3.0.0//​**. The module system then allows you to choose which version specifically you'd like to use by running a command such as **''​module add R/​2.15.3''​**. **Note:** it is better in general to be specific about which version you'd like to use rather than assuming the system will know. Running:<​code bash>​username@login01:​~ $ module list</​code>​ will show which modules have been loaded. Whereas:<​code bash>​username@login01:​~ $ module del modulename</​code>​ will unload a module. And finally: <code bash>​username@login01:​~ $ module show modulename</​code>​ will show what module //​modulename//​ actually does.
 +
 +==== Create Job Scripts ====
 +
 +Next one must create a job script such as the one below:
 +<file bash my_job.qsub>​
 +#!/bin/bash
 +#PBS -l select=1:​ncpus=12:​jobtype=dell,​place=excl
 +#PBS -l walltime=10:​00:​00
 +#PBS -q workq
 +#PBS -V
 +#PBS -o /​export/​home/​username/​scratch5/​NGS_data/​stdout.txt
 +#PBS -e /​export/​home/​username/​scratch5/​NGS_data/​stderr.txt
 +#PBS -N TophatEcoli
 +#PBS -M myemailaddress@someplace.com
 +#PBS -m b
 +
 +source /​etc/​profile.d/​modules.sh
 +MODULEPATH=/​opt/​gridware/​bioinformatics/​modules:​${MODULEPATH}
 +module add tophat/​2.0.9
 +
 +NP=`cat ${PBS_NODEFILE} | wc -l`
 +
 +EXE="​tophat"​
 +ARGS="​--num-threads ${NP} someindex reads1 reads2 -o output_dir"​
 +
 +cd /​export/​home/​username/​scratch5/​NGS_data/​
 +${EXE} ${ARGS}
 +</​file>​
 +
 +Note that //​username//​ should be your username... More details on the job script file can be found [[quick:​pbspro|in our PBS quickstart guide]].
 +
 +==== Submit Job Script ====
 +
 +Finally submit your job using:
 +<code bash>​username@login01:​~ $ qsub my_job.qsub
 +
 +13614.chpcmoab0
 +username@login01:​~ $</​code>​
 +where //​13614@chpcmoab0//​ is the //jobID// that is returned.
 +
 +==== Monitor jobs ====
 +
 +Jobs can then be monitored/​controlled in several ways:
 +=== qstat ===
 +
 +== check status of pending and running jobs ==
 +<code bash>
 +username@login01:​~ $ qstat -u username
 +
 +chpcmoab01:
 +                                                            Req'​d ​ Req'​d ​  Elap
 +Job ID          Username Queue    Jobname ​   SessID NDS TSK Memory Time  S Time
 +--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
 +13614.chpcmoab0 username workq    TophatEcol 17546   ​1 ​  ​12 ​  ​-- ​ 01:00 R 00:00
 +username@login01:​~ $
 +</​code>​
 +
 +== check status of particular job ==
 +<code bash>
 +username@login01:​~ $ qstat -f 13614.chpcmoab01
 +Job Id: 13614.chpcmoab01
 +    Job_Name = TophatEcoli
 +    Job_Owner = username@login01
 +    resources_used.cpupercent = 0
 +    resources_used.cput = 00:00:00
 +    resources_used.mem = 16796kb
 +    resources_used.ncpus = 12
 +    resources_used.vmem = 166064kb
 +    resources_used.walltime = 00:02:32
 +    job_state = R
 +    queue = workq
 +    server = chpcmoab01
 +    Checkpoint = u
 +    ctime = Tue Jan 28 13:15:41 2014
 +    Error_Path = /​export/​home/​username/​scratch5/​NGS_data/​stderr.txt
 +    exec_host = cnode-9-34/​2*12
 +    exec_vnode = (cnode-9-34:​ncpus=12)
 +    Hold_Types = n
 +    interactive = True
 +    Join_Path = n
 +    Keep_Files = n
 +    Mail_Points = a
 +    mtime = Tue Jan 28 13:15:42 2014
 +    Output_Path = /​export/​home/​username/​scratch5/​NGS_data/​stdout.txt
 +    Priority = 0
 +    qtime = Tue Jan 28 13:15:41 2014
 +    Rerunable = False
 +    Resource_List.ncpus = 12
 +    Resource_List.nodect = 1
 +    Resource_List.place = free
 +    Resource_List.select = 1:​ncpus=12:​jobtype=dell
 +    Resource_List.walltime = 20:00:00
 +    stime = Tue Jan 28 13:15:42 2014
 +    session_id = 16154
 +    jobdir = /​export/​home/​username/​scratch5/​NGS_data
 +    substate = 42
 +    Variable_List = PBS_O_SYSTEM=Linux,​PBS_O_SHELL=/​bin/​bash,​
 +        PBS_O_HOME=/​export/​home/​user,​PBS_O_LOGNAME=username,​
 +        PBS_O_WORKDIR=/​export/​home/​username/​scratch5/​NGS_data,​
 +        PBS_O_LANG=en_US.UTF-8,​
 +        PBS_O_PATH=/​opt/​gridware/​bioinformatics/​emacs/​emacs-24.3/​bin:/​export/​h
 +        ome/​username/​local/​bin:/​usr/​lib64/​qt-3.3/​bin:/​opt/​pbs/​default/​sbin/:/​op
 +        t/​pbs/​default/​bin/:/​usr/​kerberos/​bin:/​usr/​local/​bin:/​bin:/​usr/​bin,​
 +        PBS_O_MAIL=/​var/​spool/​mail/​username,​PBS_O_QUEUE=workq,​
 +        PBS_O_HOST=login01
 +    comment = Job run at Tue Jan 28 at 13:15 on (cnode-9-34:​ncpus=12)
 +    etime = Tue Jan 28 13:15:41 2014
 +    Submit_arguments = -I -l select=1:​ncpus=12:​mpiprocs=12:​jobtype=dell,​
 +        place=free -N TophatEcoli -l walltime=20:​00:​00
 +    project = _pbs_project_default
 +
 +username@login01:​~ $
 +</​code>​
 +== canceljob ==
 +<code bash>
 +username@login01:​~ $ qdel 13614.chpcmoab01
 +username@login01:​~ $
 +</​code>​
 +===== Basic examples =====
 +
 +==== Blast ====
 +
 +=== Running Blast on the M9000 ===
 +
 +One thing to note is that one cannot use scratch on the m9000 -- so jobs must be run in the user's home (or a sub-directory of home).
 +
 +== Job script ==
 +Your job script will look something like this((Note you can click on the tab //​my_job.qsub//​ to download this if you wish to use it as a template. Or you can just copy and paste...)): ​
 +<file bash my_job.qsub>​
 +#! /bin/bash
 +#PBS -l select=1:​ncpus=128:​mpiprocs=128:​jobtype=spark
 +#PBS -l place=free
 +#PBS -l walltime=06:​00:​00
 +#PBS -q spark
 +#PBS -o /​export/​home/​username/​blastjob/​stdout.txt
 +#PBS -e /​export/​home/​username/​blastjob/​stderr.txt
 +#PBS -M youremail@address.com
 +#PBS -m be
 +#PBS -N m9000_blast
 +
 +# NOTE: The M9000 has its own scratch space separate from main Lustre storage ​
 +# So run in your home, or a subdir of home, or request via helpdesk that a
 +# scratch directory be created for you on the m9000, eg. in ''/​scratch/​work/​username''​
 +
 +cd /​export/​home/​username/​blastjob
 +NP=`cat $PBS_NODEFILE | wc -l`
 +
 +EXE="/​opt/​gridware/​bioinformatics/​m9000/​ncbi-blast-2.2.24/​bin/​blastx"​
 +ARGS="​-db /​scratch/​work/​bioinfo/​BLASTDB/​nr -query my_seqs.fasta -evalue 0.001 -num_alignments 20 -outfmt 5 -num_threads ${NP} > my_results.xml"​
 +
 +$EXE $ARGS
 +</​file>​
 +
 +Of course one should set the parameters as required. (Setting a small evalue is recommended as is limiting the number of alignments). For blast2go users remember to set //-outfmt// to //5// for XML output. Note one should also select the correct **//​EXE//​**cutable and **//​-db//​**:​ //blastx//, //blastn// and //blastp// are available for the former, while //nr// and //nt// are available for the latter.
 +== Submit your job ==
 +Finally submit your job using:<​code bash>​user@login01:​~ $ qsub my_job.qsub</​code>​
 +
 +=== Running Blast on sun cluster ===
 +
 +Big thanks to Peter van Heusden for developing this script.
 +
 +<file bash sun_blast.sh>​
 +#​!/​bin/​bash ​                                                                                                                                                                                                                                  
 +
 +WORKDIR="/​export/​home/​${USER}/​scratch5/​blast_proj"​
 +INPUT_FASTA=${WORKDIR}/​data_set.fa.gz
 +BLAST_E_VAL="​1e-3"​
 +BLAST_DB="/​lustre/​SCRATCH5/​groups/​bioinfo/​DBs/​BLAST/​nr"​
 +JOBTYPE=nehalem
 +THREADS=8
 +BLAST_HOURS=0
 +BLAST_MINUTES=30
 +ID_FMT="​%01d"​
 +SPLIT_PREFIX="​sub_set"​
 +MAIL_ADDRESS="​youremail@somewhere.ac.za"​
 +
 +zcat ${INPUT_FASTA} | csplit -z -f ${WORKDIR}/​${SPLIT_PREFIX} -b "​${ID_FMT}.split.fasta"​ - '/​^>/'​ '​{*}'​
 +
 +NUM_PARTS=$(ls sub_set*.split.fasta | wc -l)
 +START=0
 +END=$(expr $NUM_PARTS - 1)
 +
 +TMPSCRIPT=thejob.sh
 +# note: make a distinction between variables set by the containing script (e.g. WORKDIR) and                                                                                                                                                  ​
 +# ones set in the script (e.g. INDEX). The ones set in the script need to be escaped out                                                                                                                                                      ​
 +cat >​${TMPSCRIPT} << END                                                                                                                                                                                                                      ​
 +#​!/​bin/​bash ​                                                                                                                                                                                                                                  
 +                                                                                                                                                                                                                                              ​
 +#PBS -l select=1:​ncpus=${THREADS}:​jobtype=${JOBTYPE} ​                                                                                                                                                                                         ​
 +#PBS -l place=excl:​group=nodetype ​                                                                                                                                                                                                            
 +#PBS -l walltime=${BLAST_HOURS}:​${BLAST_MINUTES}:​00 ​                                                                                                                                                                                          
 +#PBS -q workq                                                                                                                                                                                                                                 
 +#PBS -m ae                                                                                                                                                                                                                                    ​
 +#PBS -M ${MAIL_ADDRESS} ​                                                                                                                                                                                                                      
 +                                                                                                                                                                                                                                              ​
 +. /​etc/​profile.d/​modules.sh ​                                                                                                                                                                                                                  
 +module add blast/​2.2.29+ ​                                                                                                                                                                                                                     ​
 +                                                                                                                                                                                                                                              ​
 +INDEX="​${WORKDIR}/​${SPLIT_PREFIX}\${PBS_ARRAY_INDEX}" ​                                                                                                                                                                                        
 +INFILE="​\${INDEX}.split.fasta" ​                                                                                                                                                                                                               ​
 +OUTFILE="​\${INDEX}.blastx.xml" ​                                                                                                                                                                                                               ​
 +                                                                                                                                                                                                                                              ​
 +cd ${WORKDIR} ​                                                                                                                                                                                                                                
 +blastx -num_threads 8 -evalue ${BLAST_E_VAL} -db ${BLAST_DB} -outfmt 5 -query \${INFILE} -out \${OUTFILE} ​                                                                                                                                    
 +END                                                                                                                                                                                                                                           
 +
 +BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1)
 +echo "​submitted:​ ${BLAST_JOBID}"​
 +
 +rm ${TMPSCRIPT}
 +
 +cat >​${TMPSCRIPT} << END                                                                                                                                                                                                                      ​
 +#​!/​bin/​bash ​                                                                                                                                                                                                                                  
 +                                                                                                                                                                                                                                              ​
 +#PBS -l select=1:​ncpus=1:​jobtype=${JOBTYPE} ​                                                                                                                                                                                                  
 +#PBS -l place=free:​group=nodetype ​                                                                                                                                                                                                            
 +#PBS -l walltime=1:​00:​00 ​                                                                                                                                                                                                                     ​
 +#PBS -q workq                                                                                                                                                                                                                                 
 +#PBS -m ae                                                                                                                                                                                                                                    ​
 +#PBS -M ${MAIL_ADDRESS} ​                                                                                                                                                                                                                      
 +#PBS -W depend=afterok:​${BLAST_JOBID} ​                                                                                                                                                                                                        
 +                                                                                                                                                                                                                                              ​
 +cd ${WORKDIR} ​                                                                                                                                                                                                                                
 +tar jcf blast-xml-output.tar.bz2 *.blastx.xml ​                                                                                                                                                                                                
 +END                                                                                                                                                                                                                                           
 +
 +qsub -N tarblast ${TMPSCRIPT}
 +
 +rm ${TMPSCRIPT}
 +
 +</​file>​
 +
 +This script is designed to be run from the login node -- it creates the job job scripts themselves and submits them. There are a number of things to notice:
 +  - The use of //​heredoc//​s. These allow us to embed scripts that are to be run into another script. Here we can see that they output the text between "//cat >​${TMPSCRIPT} << END//" and "//​END//"​ into the file //​${TMPSCRIPT}//​
 +  - The use of job-arrays -- these allow us to submit multiple independent jobs as sub-jobs of one larger script. The line "//​BLAST_JOBID=$(qsub -N sunblast -J ${START}-${END} ${TMPSCRIPT} | cut -d. -f1)//"​ does multiple things:
 +    * It submits a job-array with the **-J** option which contains a //​START//​ing number and an //END//ing number. The //END// value in turn is informed by the line "//​NUM_PARTS=$(ls sub_set*.split.fasta | wc -l)//" which counts the number of sub-fasta files which were created using the "//​csplit//"​((csplit is a very useful tool -- google it!)) command.
 +    * the "//cut -d. -f1//" is used to grab the job identifier that is returned from the scheduler when the job is submitted. This is assigned to the variable //​BLAST_JOBID//​.
 +    * Note that job-arrays create the environmental variable //​PBS_ARRAY_INDEX//​ which is used as a parameter for both the blast'​s input file and the blast'​s output file parameters.
 +    * Another important aspect of the job array is that the //​walltime//​ parameter is the longest time you'd expect the sub-jobs to run in. So in this case we've divided a fasta file into many smaller faster files -- one for each sequence. In the event that your original sequences have widely differing lengths it may pay to have a different approach to the division -- perhaps one that results in the sub-fastas having similar sizes.
 +  - The use of job dependencies. We see it in the second //heredoc// in the line "//#​PBS -W depend=afterok:​${BLAST_JOBID}//"​. What this line does is that it says the job script only runs after the job with ID ${BLAST_JOBID} has successfully finished running, i.e. this job will not run if there are problems with the first job.
 +==== Blast2Go ====
 +A local instance of blast2go is available at the CHPC. It is accessible from outside the CHPC, however it does require you to set up some port forwarding.
 +=== Port Forwarding ===
 +This is accomplished via setting up port forwarding in your SSH session. In windows this is usually done in PuTTY, and in unix/osx this is done on the command line. **Note: This connection must stay on for as long as you wish to use CHPC's blast2go database.**
 +== PuTTY ==
 +When setting up the ssh connection you should go to: //​connection -> SSH -> Tunnels//. Then add **3306** for //Source port// and **10.128.15.90:​3306** for //​Destination//​. Then click //Add//.
 +{{ :​howto:​bioinformatics:​puttyconfigurationsshportforwarding.png?​500 |}}
 +
 +You should then save your session (so that you don't have to fill this information in every time) and connect (using your normal CHPC login details).
 +{{ :​howto:​bioinformatics:​puttyconfigurationsession.png?​500 |}}
 +== SSH ==
 +Your normal ssh command will change to look more like this:
 +<code bash>​localuser@my_linux:​~ $ ssh username@sun.chpc.ac.za -L 3306:​10.128.15.90:​3306
 +Last login: Tue Jan 28 14:05:35 2014 from 10.128.23.235
 +username@login01:​~ $</​code>​
 +=== blast2go Configuration ===
 +First you should go to the [[http://​www.blast2go.com/​b2ghome|blast2go website]] and start blast2go as normal by clicking on the **please click here** link((You may want to change the memory specifications here if you have lots of sequences)).
 +{{ :​howto:​bioinformatics:​blast2gowebstart.png?​600 |}}
 +
 +A small file will download and you should then run it, and then blast2go proceeds to download the rest of the application.
 +
 +{{ :​howto:​bioinformatics:​blast2godownloading.png?​300 |}}
 +
 +Once blast2go is running go to: //Tools -> General Settings -> DataAccess Settings//.
 +{{ :​howto:​bioinformatics:​blast2goinitialsetup.png?​600 |}}
 +
 +Then set:
 +  * **Own Database**
 +  * //DB Name//: **b2gdbFeb2014**
 +  * //DB Host//: **localhost**
 +  * //DB User//: **b2guser**
 +  * //DB Password//: **blast4it**
 +{{ :​howto:​bioinformatics:​blast2godataaccessconfiguration.png?​600 |}}
 +and click //OK//.
 +Then you will see in the bottom window/tab that it has connected to the database ((Open database connection to database '​b2gdbFeb2014'​ on '​localhost'​ as '​b2guser',​ with...)), and should also confirmed right at the bottom in the status message((Connected to own database: localhost: b2gdbFeb2014)).
 +{{ :​howto:​bioinformatics:​blast2godataaccessconfigurationdone.png?​600 |}}
 +
 +Finally you may test the everything is working as expected by clicking on the white arrow in the green circle and confirm that you get the GO graph.
 +{{ :​howto:​bioinformatics:​blast2gotest.png?​600 |}}
 +
 +==== Gromacs ====
 +
 +If you would like to try running gromacs on the gpu please take a look at [[howto:​gpu_gromacs|this]].
 +
 +The job script that follows is for running an MPI compiled version of gromacs 4.6.1 on nehalem. There are many different versions of gromacs, to see what's available try:<​code bash>​user@login01:​~ $ module avail</​code>​
 +
 +The following example is for working with one of the "​_nehalem"​ gromacs modules -- note it's quite important to use the correct version as the input data changes with versions...
 +
 +=== Job script ===
 +<file bash gromacs_nehalem.qsub>​
 +#!/bin/bash
 +#PBS -l select=10:​ncpus=8:​mpiprocs=8:​jobtype=nehalem,​place=excl
 +#PBS -l walltime=00:​40:​00
 +#PBS -q workq
 +#PBS -M user@someinstitution.ac.za
 +#PBS -m be
 +#PBS -V
 +#PBS -e /​lustre/​SCRATCH5/​users/​USERNAME/​gromacs_data/​std_err.txt
 +#PBS -o /​lustre/​SCRATCH5/​users/​USERNAME/​gromacs_data/​std_out.txt
 +#PBS -N GROMACS_JOB
 +#PBS -mb
 +
 +MODULEPATH=/​opt/​gridware/​bioinformatics/​modules:​$MODULEPATH
 +source /​etc/​profile.d/​modules.sh
 +
 +#######​module add
 +module add gromacs/​4.6.1_nehalem
 +
 +OMP_NUM_THREADS=1
 +
 +NP=`cat ${PBS_NODEFILE} | wc -l`
 +
 +EXE="​mdrun_mpi"​
 +ARGS="​-s XXX -deffnm YYYY"
 +
 +cd /​lustre/​SCRATCH5/​users/​USERNAME/​gromacs_data
 +mpirun -np ${NP} -machinefile ${PBS_NODEFILE} ${EXE} ${ARGS}
 +</​file>​
 +
 +=== Submit your job ===
 +Finally submit your job using:<​code bash>​user@login01:​~ $ qsub gromacs_nehalem.qsub</​code>​
 +==== bowtie ====
 +
 +Things to note about this script -- bowtie currently does not run across multiple nodes. So using anything other than //​select=1//​ will result in compute resources being wasted((Both because it will only run on a single node, and telling a process to use more threads than it has cores //usually// results in inefficiencies.)).
 +
 +=== Job script ===
 +Then your job script called //​gromacs_nehalem.qsub//​ will look something like this:
 +<file bash bowtie_script.qsub>​
 +#! /bin/bash
 +#PBS -l select=1:​ncpus=12
 +#PBS -l place=excl
 +#PBS -l walltime=06:​00:​00
 +#PBS -q workq
 +#PBS -o /​export/​home/​username/​scratch5/​some_reads/​stdout.txt
 +#PBS -e /​export/​home/​username/​scratch5/​some_reads/​stderr.txt
 +#PBS -M youremail@address.com
 +#PBS -m be
 +#PBS -N bowtiejob
 +
 +##################​
 +MODULEPATH=/​opt/​gridware/​bioinformatics/​modules:​$MODULEPATH
 +source /​etc/​profile.d/​modules.sh
 + 
 +#######​module add
 +module add bowtie2/​2.2.2
 + 
 +NP=`cat ${PBS_NODEFILE} | wc -l`
 + 
 +EXE="​bowtie2"​
 +
 +forward_reads="​A_reads1.fq,​B_reads_1.fq"​
 +reverse_reads="​A_reads1.fq,​B_reads_1.fq"​
 +output_file="​piggy_hits.sam"​
 +ARGS="​sscrofa --shmem --threads ${NP} --sam -q -1 ${forward_reads} -2 ${reverse_reads} ${output_file}"​
 +
 +${EXE} ${ARGS}
 +</​file>​
 +Note: username should contain your actual user name!
 +
 +=== Submit your job ===
 +Finally submit your job using:<​code bash>​user@login01:​~ $ qsub bowtie_script.qsub</​code>​
 +
 +==== NAMD2 ====
 +
 +If you would like to try running namd2 on the GPU please take a look at [[howto:​gpu_namd|this]].
 +
 +The job script that follows is for running a NAMD over the infiniband. Note that this does not use MPI so the script is somewhat different from other scripts you may see here.
 +
 +=== Job script ===
 +<file bash namd.qsub>​
 +#!/bin/bash
 +#PBS -l select=10:​ncpus=12:​mpiprocs=12
 +#PBS -l place=excl
 +#PBS -l walltime=00:​05:​00
 +#PBS -q workq
 +#PBS -o /​export/​home/​username/​scratch5/​namd2/​stdout.txt
 +#PBS -e /​export/​home/​username/​scratch5/​namd2/​stderr.txt
 +#PBS -m ae
 +#PBS -M youremail@address.com
 +#PBS -N NAMD_bench
 +
 +. /​etc/​profile.d/​modules.sh
 +MODULEPATH=/​opt/​gridware/​bioinformatics/​modules:​${MODULEPATH}
 +module add NAMD/​2.10_ibverbs
 +
 +cd /​export/​home/​${USER}/​scratch5/​namd2
 +
 +pbspro_namd apoa1.namd
 +</​file>​
 +
 +=== Submit your job ===
 +Finally submit your job using:<​code bash>​user@login01:​~ $ qsub namd.qsub</​code>​
 +
 +==== bowtie ====
 +
 +Things to note about this script -- bowtie currently does not run across multiple nodes. So using anything other than //​select=1//​ will result in compute resources being wasted((Both because it will only run on a single node, and telling a process to use more threads than it has cores //usually// results in inefficiencies.)).
 +
 +=== Job script ===
 +Then your job script called //​gromacs_nehalem.qsub//​ will look something like this:
 +<file bash bowtie_script.qsub>​
 +#! /bin/bash
 +#PBS -l select=1:​ncpus=12
 +#PBS -l place=excl
 +#PBS -l walltime=06:​00:​00
 +#PBS -q workq
 +#PBS -o /​lustre/​SCRATCH5/​users/​username/​some_reads/​stdout.txt
 +#PBS -e /​lustre/​SCRATCH5/​users/​username/​some_reads/​stderr.txt
 +#PBS -M youremail@address.com
 +#PBS -m be
 +#PBS -N bowtiejob
 +
 +##################​
 +MODULEPATH=/​opt/​gridware/​bioinformatics/​modules:​$MODULEPATH
 +source /​etc/​profile.d/​modules.sh
 + 
 +#######​module add
 +module add bowtie2/​2.2.2
 + 
 +NP=`cat ${PBS_NODEFILE} | wc -l`
 + 
 +EXE="​bowtie2"​
 +
 +forward_reads="​A_reads1.fq,​B_reads_1.fq"​
 +reverse_reads="​A_reads1.fq,​B_reads_1.fq"​
 +output_file="​piggy_hits.sam"​
 +ARGS="​sscrofa --shmem --threads ${NP} --sam -q -1 ${forward_reads} -2 ${reverse_reads} ${output_file}"​
 +
 +cd /​lustre/​SCRATCH5/​users/​username/​some_reads
 +${EXE} ${ARGS}
 +</​file>​
 +Note: username should contain your actual user name!
 +
 +=== Submit your job ===
 +Finally submit your job using:<​code bash>​user@login01:​~ $ qsub bowtie_script.qsub</​code>​
 +
 +==== R/​bioconductor ====
 +
 +=== pbdR example ===
 +== Job scripts ==
 +<file bash pbdtest.qsub>​
 +#PBS -l select=2:​ncpus=8:​mpiprocs=8:​jobtype=nehalem,​place=excl
 +#PBS -l walltime=00:​01:​00
 +#PBS -q workq
 +#PBS -M YOUREMAILADDRESS
 +#PBS -m be
 +#PBS -V
 +#PBS -e /​lustre/​SCRATCH5/​users/​USERNAME/​pbdR_test/​std_err.txt
 +#PBS -o /​lustre/​SCRATCH5/​users/​USERNAME/​pbdR_test/​std_out.txt
 +#PBS -N PBDR_TEST
 +#PBS -mb
 +
 +MODULEPATH=/​opt/​gridware/​bioinformatics/​modules:​$MODULEPATH
 +source /​etc/​profile.d/​modules.sh
 +module add R/3.2.0
 +
 +NP=`cat ${PBS_NODEFILE} | wc -l`
 +
 +cd /​lustre/​SCRATCH5/​users/​USERNAME/​pbdR_test/​
 +mpirun -np ${NP} -machinefile ${PBS_NODEFILE} Rscript test_script.R
 +
 +</​file>​
 +Note: __USERNAME__ should contain your actual user name!
 +
 +<file R test_script.R>​
 +library(pbdMPI,​ quiet=TRUE)
 +init()
 +my.rank <- comm.rank()
 +comm.print(my.rank,​ all.rank=TRUE)
 +
 +finalize()
 +</​file>​
 +
 +== Submit your job ==
 +Finally submit your job using:<​code bash>​user@login01:​~ $ qsub pbdtest.qsub</​code>​
 +==== tophat ====
 +
 +==== cufflinks ====
 +
 +==== tuxedo ====
 +
 +==== biopython ====
 +
 +==== velvet ====
 +
 +==== SOAP ====
 +
 +
 +===== Advanced examples =====
 +
 +
 +===== Databases =====
 +Databases are accessible on the cluster in the <code bash>/​lustre/​SCRATCH5/​groups/​bioinfo/​DBs</​code>​ directory. Alternatively they are also mirrored on the [[http://​bio.chpc.ac.za/​data/​|bio]] machine.
 +
 +===== Support =====
 +Please [[http://​www.chpc.ac.za/​index.php/​support-resources/​log-a-support-query|contact us]] to: request software updates/​installs;​ download big datasets; get advice on the best way to run your analysis; or to tell us what is/​isn'​t working!
/var/www/wiki/data/pages/howto/bioinformatics.old.txt · Last modified: 2016/04/14 12:24 by dane