User Tools

Site Tools


jobperformance:start

Getting an Idea of Job Performance

First find out which nodes your running job is using:

qstat -awu <userid> -n1

Where <userid> refers to your userid on Lengau. Or if your job has many nodes then this is more legible.

qstat -awu <userid> -n

Thereafter login (“ssh” command) to the 1st node, and then at least 1 or 2 other nodes (if not an smp single node job of course). On each node run the command “htop” (or “top”). You can see how the resources of cpu and memory etc are being utilised on each node.

If only the first node is showing any real usage for example, then you know there's a potential issue - this code is only running on one node, and the other nodes are being wasted. You need to investigate whether this code can parallelise to more than one node. Please also check how many of the 24 cores on each node (normal queue) is actually being used, is this what you are expecting. Further, please look at the info htop provides to see if anything seems suspicious or if all is ok. If you need advice, please submit a ticket to our Help Desk.

An example (htop info not shown):

[alopis@login2 ~]$ qstat -awu alopis -n1
sched01:
Job ID           Username   Queue   Jobname  SessID  NDS  TSK  Memory  Time   S   Time
3672969.sched01   alopis     smp     3P_13   108699   1    24    --   00:30   R  00:10:04 cnode0263/0*24
[alopis@login2 ~]$ ssh cnode0263
Warning: Permanently added 'cnode0263,172.18.1.206' (ECDSA) to the list of known hosts.
[alopis@cnode0263 ~]$ htop
[alopis@cnode0263 ~]$ exit
logout
Connection to cnode0263 closed.
[alopis@login2 ~]$
/app/dokuwiki/data/pages/jobperformance/start.txt · Last modified: 2021/12/09 16:42 (external edit)