There are several reasons why you would want to request more cores than you need, and they all depend on the hardware limits of the compute nodes.
A standard compute node on Lengau has two chips with 12 cores in each for a total of 24 cores. Each chip is limited by its TDP which means that fewer active cores can run faster (up to 3.5 GHz) but if all cores are running at 100% each is limited to 2.6 GHz.
Should you have a code which is CPU speed (frequency) limited then you will probably want to only use 2 threads per chip to make sure each thread runs at the maximum 3.5 GHz.
However, that is low parallelism so that you may find the sweet spot is with a few more cores at slightly lower speeds: e.g., 3 threads per chip at 3.3 GHz (6 threads total).
To find out, benchmark: you will have to request a whole single node and then time using 1 thread per chip, 2 per chip, 3, …
Each compute node has only one Infiniband device, thus all I/O to the file system and between compute nodes (MPI) passes through this bottleneck.
If your code is limited by available I/O then you will want to use a whole node but with few threads accessing the IB device.
Each compute node has 126 GiB or 64 GiB of memory (RAM). Codes that create very large data structures may be limited by the amount of memory available per thread. Divide the total RAM by the amount each thread needs to get an estimate on the number of cores needed.
Performance is also affected by where your threads run, i.e., which cores on which chip are running your thread.
The operating system numbers the available CPUs (individual physical cores) are follows:
'PKG' means package i.e. chip; 'CORE' is the core within each chip; 'CPU' is the operating system number.
On a standard compute node we see that all even numbered CPUs are on one chip and all odd numbered CPUs are on the other chip.
To set how your threads should be allocated across the chips use the CPU number or specify the affinity setting to your code.
The only way to determine what limits you face is to measure the performance of your code on a representative problem size with different number of threads per node.
Once your job starts running you can
ssh into the compute node and run
dstat to see what is happening.
But the output of the
time command for the entire code run is what you plot against thread count to see the scaling.