User Tools

Site Tools


howto:tipsandtricks

Tips and Tricks

The DOS vs Unix end of line character problem

If you have created an ASCII file on Windows, and transferred the file to the cluster, you may experience a subtle problem. The background to this is that DOS (and thus also Windows) terminates each line in a text file with both a carriage return and a linefeed character. Unix (and thus also Linux) uses a linefeed character only. Some Linux applications have a built-in way of handling this difference, but most don't. A PBS script that has not been corrected will produce output that looks like “/bin/bash^M bad interpreter: no such file or directory”. This problem is trivially easy to fix. Simply run the utility dos2unix on the offending file. for example:

 dos2unix runMyJob.pbs 

Running the utility on a file that has already been converted will do no damage to it, and attempting to run it on a binary file will result in a warning message. There is also a unix2dos utility that can convert a file back to Windows format. These utilities are available on the login and visualisation nodes, but not the compute nodes.

Most codes use ASCII input or run script files. These may or may not be affected by this problem, but if you get weird or unexpected behaviour, run dos2unix on all the ASCII files.

Using directory and file names containing spaces

The Linux operating system can deal with directory and file names containing spaces. This does not mean that your command line instruction, script or application is going to handle it correctly. The simple answer is “DON'T”. Also do not expect any sympathy from CHPC staff if you have used spaces and cannot find the reason why your job is not working correctly. For that matter, don't use double dots either. If you are having difficulties, and we see something that looks like this

 My File..data   

you will be entitled to a severe reprimand.

Keeping your ssh login sessions alive

By default, unused ssh sessions time out after about 20 minutes. You can keep your ssh session alive by following the instructions on How to geek. In summary, in your ~/.ssh directory on your workstation, create a file called config. This file should contain the following line:

 ServerAliveInterval 60  

Using an interactive PBS session

The CHPC follows a fairly strict policy on CPU usage on the login node. As a consequence, any moderately intensive task, such as unzipping a large file, will be killed very quickly. In order to get around this problem, use an interactive session for more or less everything. The syntax for obtaining an interactive session is:

 qsub -I -l select=1:ncpus=4:mpiprocs=4 -q serial -P MECH1234 -l walltime=3:00:00   

Take note of the following:

  • Obviously use your project short code instead of MECH1234.
  • Yes, it is tedious to type that in every time. Edit your .bashrc file, and define an alias as follows: alias qsubi=“qsub -I -l select=1:ncpus=4:mpiprocs=4 -q serial -P MECH1234 -l walltime=3:00:00” . Now typing in the command qsubi will do the job for you.
  • Please customize the command to suit your requirements. If you are going to need it all day, use walltime=8:00:00 . In this example we are asking for 4 processes. You can ask for more or less, depending on what you need to do. If you want a full node, use ncpus=24:mpiprocs=24.
  • You can use -l select=2:ncpus=24:mpiprocs=24 -q normal, for example, which will give you two complete nodes. This way you can test in interactive mode if your program runs in distributed parallel mode. You will need to know which nodes you have got ….. cat $PBS_NODEFILE will give you the contents of your machinefile.
  • Once you have an interactive session, you can also ssh into that node separately from the login node. This is very handy, because you can now get multiple sessions with different environments without having to exit and restart an interactive PBS session.

Running software with GUI's

The usual ssh-session and interactive PBS sessions do not by default support any graphics. If you need to run a software package with a GUI (many pre-processors, for example), you need a session with graphics capability. There are two ways of getting this:

  1. Use a VNC session to connect to one of the two visualization nodes, as per the instructions on Remote visualization.
  2. Use X-forwarding. This is only a realistic option if you are on a fast connection to the CHPC. ssh -X into lengau, then ssh -X from there to your compute node that you already have interactive PBS session on (see above). Thanks to the wonders of Mesa and software rendering, quite sophisticated graphics processing may be done this way. Look for the Mesa modules if you need OpenGL-capable software to run in this manner.
  3. You can also get an X-capable interactive PBS session by appending -X to your qsub -I instruction. This will only work if your ssh-session into the login node has X-forwarding turned on, that is, ssh -X user@lengau.chpc.ac.za or ssh -Y user@lengau.chpc.ac.za

Windows ssh clients

PuTTY is widely used, and also has an easy to use interface for setting up ssh-tunnels. However, MobaXterm also works extremely well, and has a number of additional advantages, such as:

  • Multiple tabs
  • Remembering passwords
  • X-forwarding that (mostly) works
  • Convenient graphical interface for setting up ssh-tunnels
  • A file explorer for transferring files
  • Linux-style mouse-button bindings

Transferring files to the cluster

Command line scp and rsync are the usual methods for data transfer. However, it is easy to make mistakes, and you need to have the path right. MobaXterm (see above) has an easy to use “drag & drop” interface. FileZilla is fast, easy to use and runs on Linux, Windows and OSX. A different option is to use sshfs to mount your lengau directory directly on your workstation. There is a Windows sshfs client that sort-of works. Sometimes.

How to qsub from a compute node

Lengau is set up in such a way that it is not possible to submit another PBS script (qsub) from a compute node. However, it is possible to have an ssh command in a PBS script, so the obvious solution is to ssh to the login node in order to submit another PBS script, if you wanted to submit another PBS script at the completion of the current one, you could insert a line like this at the end of your first script:

ssh login2 qsub /mnt/lustre/users/jblogs/scripts/PBS_Script1

Submit one script after completion of another

There are several situations where you may only want one job to run after completion of another. It may be to manage the load on a limited software license pool, or the first job may be a pre-processing step for the subsequent one, or the second job could be dependent on the data file written by the first one. One solution is to submit the second job from the PBS script of the first one, as described above. An alternative method is to use the depend option in PBS:

jobid=`qsub runJob01`
jobid=`echo $jobid | awk 'match($0,/[0-9]+/){print substr($0,RSTART,RLENGTH)}'`
qsub -W depend=afterany:$jobid runJob2

Using Large Queue

Add the following PBS directive to your submit-script:

#PBS -W group_list=largeq

How to run jobs that require very long wall times

It becomes difficult for the scheduler to fit in jobs that require very long wall times. It is instructive to think of the scheduler's task as a game of Tetris. It has to fit in contiguous blocks of varying length (time) and width (number of nodes). Very long narrow blocks are difficult to fit in without wasting a lot of space (computing resources). For these reasons, the CHPC's policies do not permit very long wall times. We prefer users to submit jobs that are effectively parallelized, and can therefore finish more quickly by using more simultaneous computing resources. If you do have a job that requires a very long wall time, use either a secondary qsub from your job script (see the paragraph “How to qsub from a compute node”) or alternatively a dependent qsub (see the paragraph “Submit one script after completion of another”). Both of these methods assume that your code can write restart files. If your code cannot write restart files, you have a serious problem which can be resolved in one of three ways:

  • If it is your own code, implement restart files immediately. What on earth are you trying to achieve by doing multi-day runs without a restart capability?
  • If it is an open-source code, take on the task of implementing a restart capability.
  • If it is a commercial code, inform the code developer that a restart capability is essential.

Check health of compute nodes before starting the run

An HPC cluster consists of a very large number of compute nodes, and statistics dictate that larger numbers of components result in more failures. When doing especially large queue runs, your chances of encountering a faulty node are significant. Some software will immediately crash and terminate the job, but others may simply hang up. In the case of a faulty node, it is obviously better for the run to terminate immediately and send helpful diagnostics to the system administrators. This can be implemented easily by adding the following lines to your job script, immediately above the actual run command:

module add chpc/healthcheck/0.2
healthcheck -v || exit 1
module del chpc/healthcheck/0.2

The -v option will provide helpful diagnostics, but can be omitted if you want to avoid a substantial amount of additional output.

Dealing with zombies

Under certain unusual circumstances, a job can turn into a zombie job, which sits in the queue with “R” status, occupies resources, but does not actually run or produce meaningful output. Zombies resist killing by means of the qdel command, and need to be terminated with extreme prejudice. Use qdel -W force followed by the job number in order to accomplish this. The job will exit with status E, and, true to the tradition of zombies, linger on with this status in the visible queue for a while longer.

/var/www/wiki/data/pages/howto/tipsandtricks.txt · Last modified: 2018/10/19 11:30 by ccrosby