User Tools

Site Tools


howto:tipsandtricks

Tips and Tricks

Fault finding - Check your ssh keys

Parallel computing relies on passwordless ssh access between compute nodes. If passwordless ssh does not work, nothing else will. Therefore, if you find yourself with jobs that simply won't run, or even provide helpful error messages, first check that your ssh keys are correct. You can do this very simply be testing if you can ssh from the login node into another service node without supplying a password:

ssh login1

or

ssh chpcviz1

for example. If you cannot get in without a password, you have a problem which first has to be corrected. The first thing to do is to check the permissions of the files in your $HOME/.ssh directory. There should be rw access for you only. It can be corrected with the command

chmod 0600 ~/.ssh/*

Managing your storage usage

You may want to do yourself the favour of checking with the commercial cloud providers how much they charge for data storage. Once you have picked yourself up off the floor, set about managing your level of usage of the CHPC's free Lustre storage resource.

Remember that the CHPC Lustre is not intended for long-term storage - it is temporary workspace, is limited and has been designed to be fast rather than reliable. The CHPC's official policies allow us to remove data that has not been used in the preceding 90 days. Please be pro-active about managing your data before we do it for you. Without first asking your permission. In order to get a list of files that have not been accessed in the last 90 days, use the find command:

find  -type f -atime +90 -printf "%h/%f, %s, %AD \n"

which will produce a csv table with the path, size in bytes and last time of access. To make it even easier for yourself, simply delete these files automagically with find:

find  -type f -atime +90 -delete

If you are unfamiliar with the Linux command line, it can be painful to find your way around your files and directories. Consider using the Gnu Midnight Commander. Here is a short video which demonstrates its use on the cluster. It is available on Lengau by way of a module:

module load chpc/mc/4.8.17

Do not put echo statements in your .bashrc file

It is tempting to put echo statements in your .bashrc file, so that you can get a handy heads-up when you log in that some environment variables are being set. Please do not do this. The reason is that it breaks scp, which expects to see its protocol data over the stdin/stdout channels. There are ways around this (you can use Google, right?), but unless you know what you are doing, just don't put echo statements in a .bashrc file.

Allowing or preventing rerunning

Under certain conditions, such as when recovering from system faults, PBS may rerun jobs that had previously been interrupted. Depending on your particular setup, this is not necessarily beneficial. If, for example, your software regularly writes restart data, but by default starts a run from scratch unless otherwise specified, you definitely want to suppress rerunning, because it would overwrite existing results. Add the following line to your PBS directives in the job script:

#PBS -r n

On the other hand, if your software is set up to resume automatically from the last data written, PBS should be permitted to rerun the process:

#PBS -r y

Checking for zombie processes

The PBS scheduler cleans up each compute node on the completion of a job. However, under certain conditions, it is possible for the scheduler to be unaware of rogue processes left on compute nodes. These may interfere with other user process. If your job is running significantly slower than expected, it may be worth checking for the presence of rogue processes. This can be done quite easily by adding the following lines to your job script, preferably just before launching the actual compute process.

myname=`whoami`
for host in `cat  $PBS_NODEFILE | sort -u` ; 
 do 
  echo $host ` ssh $host ps hua -N -u root,apache,dbus,nslcd,ntp,rpcuser,rpc,$myname ` ; 
 done;

This will produce a list of your compute nodes, together with any processes not belonging to yourself or the system. If you do find compute processes belonging to other users, you should log into the compute node concerned, and run top to see if your processes are suffering as a result of the rogue process. Also submit a helpdesk ticket to inform the system administrators.

Please slow down and work methodically

  • Do NOT bring a complicated script from your own system and insist on trying to run it as is on 10 cluster nodes for your first cluster run. It has NEVER worked for anybody else, and it is NOT going to work for you.
  • Start by obtaining an interactive session.
  • Work through your process in step-by-step fashion and fix your problems at each step before advancing to the next stage.
  • Do not attempt to solve your entire problem at the first attempt. It does not work. It WILL break, leaving you with a complicated mess to untangle, and it will unendear you to the CHPC staff tasked with sorting out YOUR mess.
  • Start with a smaller and simplified version of your problem, and satisfy yourself that it works on a single node.
  • Once you are happy that it works on a single node, try it on two nodes.
  • Do not move on before you have proved to yourself that it is working as expected, and is faster than a single node.
  • Add complexity and compute nodes only once you are totally satisfied that EVERYTHING is working properly.
  • Please bear in mind that you also need to prove to the CHPC that you are competent to run very large cases. You can only do this by starting with small cases and demonstrating that you can run them efficiently.
  • Now, and only now, may you start thinking of working on automating your process. It might just work.

The DOS vs Unix end of line character problem

If you have created an ASCII file on Windows, and transferred the file to the cluster, you may experience a subtle problem. The background to this is that DOS (and thus also Windows) terminates each line in a text file with both a carriage return and a linefeed character. Unix (and thus also Linux) uses a linefeed character only. Some Linux applications have a built-in way of handling this difference, but most don't. A PBS script that has not been corrected will produce output that looks like /bin/bash^M bad interpreter: no such file or directory. This problem is trivially easy to fix. Simply run the utility dos2unix on the offending file. for example:

 dos2unix runMyJob.pbs 

Running the utility on a file that has already been converted will do no damage to it, and attempting to run it on a binary file will result in a warning message. There is also a unix2dos utility that can convert a file back to Windows format. These utilities are available by default on the login and visualisation nodes, but not the compute nodes where you will need to load a module instead:

 module load chpc/compmech/dos2unix 

Most codes use ASCII input or run script files. These may or may not be affected by this problem, but if you get weird or unexpected behaviour, run dos2unix on all the ASCII files.

If you have a lot of files in sub-directories, use the command

 find . -type f -print0 | xargs -0 dos2unix 

to recursively go through your directories and change all the files.

Using directory and file names containing spaces

The Linux operating system can deal with directory and file names containing spaces. This does not mean that your command line instruction, script or application is going to handle it correctly. The simple answer is “DON'T”. Also do not expect any sympathy from CHPC staff if you have used spaces and cannot find the reason why your job is not working correctly. For that matter, don't use double dots either. If you are having difficulties, and we see something that looks like this

 My File..data   

you will be entitled to a severe reprimand.

Keeping your ssh login sessions alive

By default, unused ssh sessions time out after about 20 minutes. You can keep your ssh session alive by following the instructions on How to geek. In summary, in your ~/.ssh directory on your workstation, create a file called config. This file should contain the following lines, note the indent on the second line:

Host *
 ServerAliveInterval 60  

If you are using MobaXterm to access the cluster, follow the menus to “Settings - Configuration - SSH - SSH-Settings” to activate the SSH Keepalive option, as per this example.

Using an interactive PBS session

The CHPC follows a fairly strict policy on CPU usage on the login node. As a consequence, any moderately intensive task, such as unzipping a large file, will be killed very quickly. In order to get around this problem, use an interactive session for more or less everything. The syntax for obtaining an interactive session is:

 qsub -X -I -l select=1:ncpus=4:mpiprocs=4 -q serial -P MECH1234 -l walltime=3:00:00   

The -X option turns on X-forwarding. Take note of the following:

  • Obviously use your project short code instead of MECH1234.
  • Yes, it is tedious to type that in every time. Edit your .bashrc file, and define an alias as follows:
     alias qsubi="qsub -X -I -l select=1:ncpus=4:mpiprocs=4 -q serial -P MECH1234 -l walltime=3:00:00" 

    Now typing in the command qsubi will do the job for you.

  • Please customize the command to suit your requirements. If you are going to need it all day, use walltime=8:00:00 . In this example we are asking for 4 processes. You can ask for more or less, depending on what you need to do. If you want a full node, use
     ncpus=24:mpiprocs=24 -q smp 
  • You can use
     -l select=2:ncpus=24:mpiprocs=24 -q normal 

    , for example, which will give you two complete nodes. This way you can test in interactive mode if your program runs in distributed parallel mode. You will need to know which nodes you have got …..

     cat $PBS_NODEFILE 

    will give you the contents of your machinefile.

  • Once you have an interactive session, you can also ssh into that node separately from the login node. This is very handy, because you can now get multiple sessions with different environments without having to exit and restart an interactive PBS session.
  • This process also works if you need to interactively test a code on a GPU node. In this case the command would be something like
     qsub -X -I -l select=1:ncpus=8:mpiprocs=8:ngpus=1 -q gpu_1 -P MECH1234 

Running software with GUI's

The usual ssh-session and interactive PBS sessions do not by default support any graphics. If you need to run a software package with a GUI (many pre-processors, for example), you need a session with graphics capability. Here are some ways of getting this:

  1. Use a VNC session to connect to one of the two visualization nodes, as per the instructions on Remote visualization. Keep in mind that it is possible and practical to also get a VNC session directly on a compute node, without using one of the dedicated visualization nodes. For those unfamiliar with VNC, it is quite similar to the Remote Desktop in Windows.
  2. Use X-forwarding. This is only a realistic option if you are on a fast connection to the CHPC. ssh -X into lengau, then get an interactive PBS session with X-forwarding, as described in the previous section on Interactive PBS sessions. Thanks to the wonders of Mesa and software rendering, quite sophisticated graphics processing may be done this way. Look for the Mesa modules if you need OpenGL-capable software to run in this manner.

Windows ssh clients

PuTTY is widely used, and also has an easy to use interface for setting up ssh-tunnels. However, MobaXterm also works extremely well, and has a number of additional advantages, such as:

  • Multiple tabs
  • Remembering passwords
  • X-forwarding that (mostly) works
  • Convenient graphical interface for setting up ssh-tunnels
  • A file explorer for transferring files
  • Linux-style mouse-button bindings

There are also other options such as Cygwin, which give you Linux functionality on a Windows system. It is also fairly straightforward to set up the Windows Subsystem for Linux, which allows you to install and run a Linux distribution directly inside Windows. This has some similarities to running a Linux Virtual Machine on a Windows computer. These three methods all provide you with a useful environment to experiment with and test things in Linux, but are definitely overkill if you just need an ssh client.

Many of the CHPC's users work with WinSCP, which offers powerful file transfer and management options. However, we find that its ssh-client is rather cumbersome.

Transferring files to the cluster

Transferring data with Globus

Large amounts of data should be transferred by means of Globus, which provides a GUI for managing your data transfers, although there is also an API which can be used to script file transfers. Globus is based on the GridFTP protocol, and is faster and more robust than scp-based methods.

Transferring data with FileSender

SanRen has a facility for staging fairly large quantities of data. Please take a look at the FileSender web page.

Transferring data with ssh / scp / rsync

Command line scp and rsync are the usual methods for data transfer of smaller files. Remember to transfer files to scp.chpc.ac.za rather than lengau.chpc.ac.za However, it is easy to make mistakes, and you need to have the path right. MobaXterm (see above) has an easy to use “drag & drop” interface. FileZilla is fast, easy to use and runs on Linux, Windows and OSX.

Using sshfs

A different option is to use sshfs to mount your lengau directory directly on your workstation. Take a look at these instructions. In summary:

Method 1 (tested successfully on Windows 10)

  1. Install WinFSP
  2. Install SSHFS-Win
  3. Use your Windows file explorer's “Home - Easy Acess - Map as Drive” (“Map network drive” in Windows 11) menu sequence to get a panel where you can input your scp.chpc.ac.za login credentials and select a drive letter. You will need to tick “Connect using different credentials”.
  4. The above process will mount your home directory. If you want your lustre directory instead, append ..\..\mnt\lustre3p\users to the path.
  5. Please don't use the login ID “joeblogs” …

Method 2 (Tested successfully on Windows 11)

  1. Download and install WinFSP
  2. Download and install SSHFS-Win
  3. Download and install SSHFS-Win Manager
  4. Run SSHFS-Win Manager, set up a new connection and map your directory on the cluster to a drive on your workstation.

How to qsub from a compute node

You may have difficulties submitting a PBS job from a compute node, although as the cluster is currently configured, it generally does work. However, it is possible to have an ssh command in a PBS script, so the obvious solution if you experience difficulties, is to ssh to the login node in order to submit another PBS script, if you wanted to submit another PBS script at the completion of the current one, you could insert a line like this at the end of your first script:

ssh login2 qsub /mnt/lustre/users/jblogs/scripts/PBS_Script1

Submit one script after completion of another

There are several situations where you may only want one job to run after completion of another. It may be to manage the load on a limited software license pool, or the first job may be a pre-processing step for the subsequent one, or the second job could be dependent on the data file written by the first one. One solution is to submit the second job from the PBS script of the first one, as described above. An alternative method is to use the depend option in PBS:

jobid=`qsub runJob01`
jobid=`echo $jobid | awk 'match($0,/[0-9]+/){print substr($0,RSTART,RLENGTH)}'`
qsub -W depend=afterany:$jobid runJob2

Using Large Queue

Add the following PBS directive to your submit-script:

#PBS -W group_list=largeq

How to run jobs that require very long wall times

It becomes difficult for the scheduler to fit in jobs that require very long wall times. It is instructive to think of the scheduler's task as a game of Tetris. It has to fit in contiguous blocks of varying length (time) and width (number of nodes). Very long narrow blocks are difficult to fit in without wasting a lot of space (computing resources). For these reasons, the CHPC's policies do not permit very long wall times. We prefer users to submit jobs that are effectively parallelized, and can therefore finish more quickly by using more simultaneous computing resources. If you do have a job that requires a very long wall time, use either a secondary qsub from your job script (see the paragraph “How to qsub from a compute node”) or alternatively a dependent qsub (see the paragraph “Submit one script after completion of another”). Both of these methods assume that your code can write restart files. If your code cannot write restart files, you have a serious problem which can be resolved in one of three ways:

  • If it is your own code, implement restart files immediately. What on earth are you trying to achieve by doing multi-day runs without a restart capability?
  • If it is an open-source code, take on the task of implementing a restart capability.
  • If it is a commercial code, inform the code developer that a restart capability is essential.

In order to improve the efficiency of very long single node jobs, a new queue has been introduced with effect from 13 December 2018. The seriallong queue has a walltime limit of 144 hours, but can only be used with less than 13 cores. Because it results in node-sharing, it is mandatory to provide a memory resource request. The relevant lines in a PBS script should look something like this:

#PBS -l select=1:ncpus=8:mpiprocs=8:mem=24gb 
#PBS -q seriallong
#PBS -l walltime=120:00:00

Check health of compute nodes before starting the run

An HPC cluster consists of a very large number of compute nodes, and statistics dictate that larger numbers of components result in more failures. When doing especially large queue runs, your chances of encountering a faulty node are significant. Some software will immediately crash and terminate the job, but others may simply hang up. In the case of a faulty node, it is obviously better for the run to terminate immediately and send helpful diagnostics to the system administrators. This can be implemented easily by adding the following lines to your job script, immediately above the actual run command:

module add chpc/healthcheck/0.2
healthcheck -v || exit 1
module del chpc/healthcheck/0.2

The -v option will provide helpful diagnostics, but can be omitted if you want to avoid a substantial amount of additional output. You may also want to check for rogue or zombie processes which may slow down your calculations.

Dealing with zombies

Under certain unusual circumstances, a job can turn into a zombie job, which sits in the queue with “R” status, occupies resources, but does not actually run or produce meaningful output. Zombies resist killing by means of the qdel command, and need to be terminated with extreme prejudice. Use qdel -W force followed by the job number in order to accomplish this. The job will exit with status E, and, true to the tradition of zombies, linger on with this status in the visible queue for a while longer.

Determining the status of your queued jobs

Your job/s may be queued for various reasons. When the cluster is oversubscribed, such as when there are loadshedding cycles and we do not have sufficient generator capacity, a major reason is that there large number of users' jobs waiting in the queues. However, it is important to be aware that your job/s may be queued because your Research Programme (RP) allocation has run out and your Principal Investigator (PI) needs to provide 6 monthly feedback and/or contact your CHPC support scientist. Also possible is that you have specified a job which cannot ever run. Please check your jobs and queued jobs on the cluster using:

qstat -n1awu my_userid
qstat -f myqueued_jobid1 myqueued_jobid2  |grep comment

For example for one of your queued jobs you may see something like::

qstat -f 5015940.sched01 |grep comment
comment = Not Running: Server per-project limit reached on resource ncpus

This indicates that your job is queued because your RP allocation has expired or has run out of cpu hours and your PI needs to submit feedback.

Other job comment messages include:

comment = Not Running: Insufficient amount of resource: nodetype
comment = Not Running: Insufficient amount of resource: ncpus
comment = Not Running: Insufficient amount of resource: ngpus (R: 2 A: 1 T:
comment = Not Running: User has reached queue smp running job limit.
comment = Not Running: User has reached queue normal running job limit.
comment = Not Running: User has reached queue serial running job limit.
comment = Not Running: User has reached queue seriallong running job limit.
comment = Not Running: User has reached queue gpu_1 running job limit.

The first 3 messages indicate indicate that there are not enough resources of the particular types. The last 5 messages indicate that the user (you if these your own job numbers) have other jobs in the specified queue which have reached the limit of number of jobs per user in this queue.

If you see “Insufficient amount of resource” then it is worth checking whether your jobs do correctly request the resources.

qstat -f 4123211.sched01  |grep "List.select"
Resource_List.select = 1:ncpus=24:mpiprocs=24:mem=999gb
qstat -f 4123211.sched01  |grep "queue"
queue = smp

This example indicates a job on the smp queue requiring one node, 24 cpu's/cores, 24 mpi processes, and memory of 999Gb. Note that the standard nodes on Lengau have either 64Gb or 128Gb, so this memory request is inappropriate and will never run. In fact in terms of memory on such nodes one should specify 56gb or 120gb since each node needs memory for system processes.

If you are running Materials Studio jobs (accelrys) you will likely need to first look up you job number on the cluster. MS does give you a job identification name, so for example:

qstat |grep My_MS_jobname
qstat|grep  MS_L2FXD
5046144.sched01   MS_L2FXD         accelrys                 0 Q accelrys  

Then in this example the job number is 5046144.sched01, so thereafter please follow the procedure above.

/app/dokuwiki/data/pages/howto/tipsandtricks.txt · Last modified: 2023/08/11 09:13 by alopis