User Tools

Site Tools


howto:bioinformatics:gnu-parallel

gnu parallel

What is gnu parallel

Examples

Simple Example

What this example does is that it finds a whole lot of scripts in a directory and gets gnu parallel to run them. The -j 6 options tells gnu parallel to run 6 jobs per node, while the -M and –sshdelay 0.2 options tells it to use ssh's ControlMaster and to have a delay between establishing links to the same node.

gnu_parallel.qsub
#!/bin/bash
#PBS -e /mnt/lustre/users/USERNAME/test_scripts/gnu_parallel.stderr.out
#PBS -o /mnt/lustre/users/USERNAME/test_scripts/gnu_parallel.stdout.out
#PBS -V
#PBS -P PROGRAMMESHORTNAME
#PBS -M youremailaddress
#PBS -l select=2:ncpus=24:nodetype=haswell_reg
#PBS -l walltime=00:01:00
#PBS -q normal
#PBS -m be
#PBS -r n
#PBS -mb
 
 
module add chpc/BIOMODULES
module add gnu-parallel
 
WORKING_DIR=/mnt/lustre/users/USERNAME/test_scripts
echo "Hello World! Main gnu parallel test thingy running here"
 
cd ${WORKING_DIR}
 
ls ${WORKING_DIR}/gnup_scripts/* | parallel -M --sshdelay 0.2 -j 6 -u --sshloginfile ${PBS_NODEFILE} "cd ${WORKING_DIR}/gnup_scripts; {} {}"

Then inside the directory

/mnt/lustre/users/USERNAME/test_scripts/gnup_scripts

you can put a whole lot of copies of the following file:

gnup.test.sh
#!/bin/bash
 
NOW=$(date +"%x %X")
echo "Hello World! The time is ${NOW} and I'm running on host: ${HOSTNAME}. I'm running task $1 :-)"
sleep 1

and make sure they are all runnable, i.e.

chmod u+x gnup.test.sh

After running

qsub gnu_parallel.qsub

if you look inside

/mnt/lsutre/users/USERNAME/test_scripts/gnu_parallel.stdout.out

you should see something like:

Hello World! Main gnu parallel test thingy running here
Hello World! The time is 05/05/2016 15:24:26 and I'm running on host: cnode0282. I'm running task /home/dane/test_scripts/gnup_scripts/gnup.test.12.sh :-)
Hello World! The time is 05/05/2016 15:24:26 and I'm running on host: cnode0282. I'm running task /home/dane/test_scripts/gnup_scripts/gnup.test.10.sh :-)
.
.
.
Hello World! The time is 05/05/2016 15:24:29 and I'm running on host: cnode0281. I'm running task /home/dane/test_scripts/gnup_scripts/gnup.test.9.sh :-)

Fault tolerant Example

One problem we have is that jobs fail on the cluster. If one is making use of gnu-parallel it can be simple to include some fault checking code in the form of log files. So building on the previous example we build a script that is designed to fail randomly. It logs successful runs and doesn't re-run if a failure is detected.

gnup.test.sh
#!/bin/bash
 
LOGFILE="$0.log"  # create a log file based on the executable's name
echo "$(date +'%x %X'): STARTING" >> "${LOGFILE}"  # Log the start of the run
if [ $(grep "COMPLETE" "${LOGFILE}" | wc -l) -lt 1 ]  # check if it has already been run...
then
  echo "Hello World! The time is $(date +'%x %X') and I'm running on host: ${HOSTNAME}. I'm running task $1 :-)"
  sleep 1
  if [ $(( ( RANDOM % 10 )  + 1 )) -gt 4 ]  # will run successfully something like 60% of the time
  then
    echo "SUCCESS!"
    echo "$(date +'%x %X'):COMPLETE" >> "${LOGFILE}"
  else
    echo "Failure"  # Oh boo. This will not be logged
  fi
else
  echo "$(date +'%x %X'): NOT REPEATING" >> "${LOGFILE}"  # Log that run has already been performed
fi

So if we run our same PBS script, but this time point at a directory containing, say 50, copies of the above script then we see:

[dane@login1]$ ls gnup_scripts/*.sh | wc -l; ls gnup_scripts/*.log | wc -l
50
50
[dane@login1]$ cat gnup_scripts/*.log | grep STARTING | wc -l; cat gnup_scripts/*.log | grep COMPLETE | wc -l ; cat gnup_scripts/*.log | grep "NOT REPEATING" | wc -l
50
34
0
[dane@login1]$ cat gnu_parallel.stdout.out | grep "SUCCESS" | wc -l; cat gnu_parallel.stdout.out | grep "Failure" | wc -l
34
16
[dane@login1]$ 

So we can see that it ran 50 times, and failed 16 out of 50 times. We can run again and we see:

[dane@login1]$ cat gnup_scripts/*.log | grep STARTING | wc -l; cat gnup_scripts/*.log | grep COMPLETE | wc -l ; cat gnup_scripts/*.log | grep "NOT REPEATING" | wc -l                                                                                                          
100
45
34
[dane@login1]$ cat gnu_parallel.stdout.out | grep "SUCCESS" | wc -l; cat gnu_parallel.stdout.out | grep "Failure" | wc -l
11
5

i.e. we successfully ran another 11 jobs…

Fault tolerant Blast

So the idea around this script is that one has a working directory where you in turn have tree structure that might look something like

DATE/EXPERIMENT/*.fasta

. Here gnu parallel simply takes as input each fasta file and does the blast on each with the parameters provided.

blast.gnup.qsub
#!/bin/bash
#PBS -e /mnt/lustre/users/USERNAME/blastjobs/blast.stderr.out
#PBS -o /mnt/lustre/users/USERNAME/blastjobs/blast.stderr.out
#PBS -P PROGRAMMESHORTNAME
#PBS -M youremail@address
#PBS -l select=4:ncpus=24:nodetype=haswell_reg
#PBS -l walltime=00:60:00
#PBS -q normal
#PBS -m be
 
cd ${PBS_O_WORKDIR}
 
module add chpc/BIOMODULES
module add blast
module add gnu-parallel
 
 
#The module sets the env variable below
# and provides the path to the databases
#BLASTDB="/mnt/lustre/bsp/DB/BLAST"
# To blast against eg, simply use -db nt
# To see all databases, "ls $BLASTDB/*.?al"
 
BLASTCMD=$(which blastn)
BLASTARGS="-evalue 0.005 -num_alignments 20 -outfmt 5 -num_threads 24 -db nt"
INPUTDIRS="DATE/*"
 
ls ${INPUTDIRS}/*.fasta | parallel -M --sshdelay 0.2 -j 1 -u --sshloginfile ${PBS_NODEFILE} "cd ${PBS_O_WORKDIR}; ${BLASTCMD} -query {} ${BLASTARGS} -out {}.xml && gzip --best {} {}.xml"

The idea around the fault tolerance is very simple – when the blast has been successfully run, the file will be zipped and given a .gz suffix and won't be found in subsequent blasts (which explicitly look for .fasta).

Advanced Fault tolerant Blast

This takes the previous example one step further and is useful for large jobs with very many independent fasta files (and this individual blasts). The added step involves copying the blast database into ram disk at the beginning of the job.

blast.advanced.gnup.qsub
#!/bin/bash
#PBS -e /mnt/lustre/users/USERNAME/blastjobs/blast.advanced.stderr.out
#PBS -o /mnt/lustre/users/USERNAME/blastjobs/blast.advanced.stderr.out
#PBS -P PROGRAMMESHORTNAME
#PBS -M youremail@address
#PBS -l select=4:ncpus=24:mem=120gb:nodetype=haswell_reg
#PBS -l walltime=00:60:00
#PBS -q normal
#PBS -m be
 
cd ${PBS_O_WORKDIR}
 
 
module add chpc/BIOMODULES
module add blast
module add gnu-parallel
 
BLASTDB="/mnt/lustre/bsp/NCBI/BLAST"
DB="nt"
BLASTCMD=$(which blastn)
BLASTARGS="-evalue 0.005 -num_alignments 20 -outfmt 5 -num_threads 24"
INPUTDIRS="2016-05-12/*"
 
NODES=$(cat ${PBS_NODEFILE} | sort | uniq)
 
# copy blast databases to ram disk
for node in ${NODES}
do
  ssh ${node} "mkdir -p /dev/shm/${USER}/BLAST && cp -r ${BLASTDB}/${DB}* /dev/shm/${USER}/BLAST && echo 'successfully added DBs on ${node}' || exit 1" &
done
 
wait  # wait for parallel copies to finish
 
ls ${INPUTDIRS}/*.fa | parallel -j 1 -u --sshloginfile ${PBS_NODEFILE} "cd ${PBS_O_WORKDIR}; ${BLASTCMD} -db /dev/shm/${USER}/BLAST/${BLASTDB} -query {} ${BLASTARGS} -out {}.xml && gzip --best {} {}.xml"
 
# clean up ram disk
for node in ${NODES}
do
  ssh ${node} "rm -rf /dev/shm/${USER}/BLAST && echo 'successfully deleted DBs on ${node}' || exit 1" &
done
 
wait

Advanced Generalised Gnu Parallel

The idea behind this example is that you can create a list of commands that need running and then give this to the script. If sub-jobs fail there will be an easily parsed log file so that individual steps can be quickly identified. If sub-jobs have been successfully run, they're logged as such and aren't re-run on subsequent job submissions.

This first script is a list of helper functions that are used in the main script.

log_support.sh
#!/bin/bash
 
##  Some simple helper log functions to track jobs success / failure   ##
## Created by Dane Kennedy @ the Centre for High Performance Computing ##
 
############## Create some useful functions ##############
 
# Echos the current date/time in a nice format
function now { echo -n "$( date +"%F %X" )"; }
 
# Appends to log file
function log () {
  if [[ -v LOGFILE ]]
  then
    echo -e "$( now ): ${HOSTNAME}: $@" >> "${LOGFILE}"
  else
    echo -e "$( now ): ${HOSTNAME}: $@"
  fi
}
 
#Appends to log file and exits
function log_fail () {
  log "$@"
  exit 1
}
 
# returns true (in the bash sense of 0 exit status meaning success) if line exists in log file
function check_log () {
  if [[ -v LOGFILE ]]
  then
    if [[ -e ${LOGFILE} ]]
    then
      if $( grep -q "$@" ${LOGFILE} )
      then
        return 0
      fi
    fi
  fi
  return 1
}
 
# Checks for a line on the LOG file. If is exists, it doesn't repeat. If it's not there, it runs. If
# the run is successful is records it as such.
function check_run (){
  SUCCESS_LINE="\"$@\" SUCCESSUL"
  FAIL_LINE="\"$@\" fail."
  if check_log "${SUCCESS_LINE}"
  then
    log "\"$@\" already successfully run. Not repeating."
    return 0
  fi
  log "Running \"$@\""
  $@ \
    && { log "${SUCCESS_LINE}"; return 0; } \
    || { log "${FAIL_LINE}"; return 1; }
}
 
 
# Same as above but exit 1's on fail.
function check_run_abort (){
  if ! check_run "$@"
  then
    log "Aborting."
  fi
}
myjob.sh
#!/bin/bash
#PBS -e /mnt/lustre/users/USERNAME/gnup/stderr.out
#PBS -o /mnt/lustre/users/USERNAME/gnup/stdout.out
#PBS -P PROGRAMMESHORTNAME
#PBS -M youremailaddress
#PBS -l select=2:ncpus=24:mpiprocs=16:nodetype=haswell_reg
#PBS -l walltime=48:00:00
#PBS -q normal
#PBS -m be
#PBS -r n
 
module add chpc/BIOMODULES
module add gnu-parallel
 
JOBSPERNODE=16
 
# Make sure we start the job in the right place.
cd -P ${PBS_O_WORKDIR}
 
# Set up logging stuffs
source log_support.sh
LOGFILE="gnuparallel.text.log"
THIS="GNU PARALLEL TEST"
 
# First check if analysis has been run before. If it has abort...
if check_log "${THIS} COMPLETED SUCCESSFULLY"
then
        log "${THIS} already successfully completed. Not repeating."
        exit 0
else
        log "Beginning ${THIS}"
fi
 
# Point to Job File -- this contains a list of commands to run, one per line 
JOBFILE="gnu_parallel.jobs"
 
# pass the commands on to gnu parallel which runs them with "check_run". It will record
# a successful completion if all sub-jobs complete successfully :-). Woot.
 
cat ${JOBFILE} | parallel -M --sshdelay 0.2 -j ${JOBSPERNODE} -u --sshloginfile ${PBS_NODEFILE} \
  "cd -P \"${PBS_O_WORKDIR}\"; . log_support.sh; LOGFILE=\"${LOGFILE}\"; check_run {}" \
  && log "${THIS} COMPLETED SUCCESSFULLY" \
  || log "${THIS} incomplete."
/app/dokuwiki/data/pages/howto/bioinformatics/gnu-parallel.txt · Last modified: 2023/04/17 12:40 by ischeepers