User Tools

Site Tools


quick:start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
quick:start [2019/09/09 11:02]
wikiadmin [Shared Filesystems]
quick:start [2021/05/27 12:28] (current)
kevin [Shared Filesystems]
Line 2: Line 2:
  
 This guide is intended for experienced HPC users and provides a summary of the essential components of the systems available at the CHPC.  For more detailed information on the subjects below see the full [[guide:​start|User Guide]]. This guide is intended for experienced HPC users and provides a summary of the essential components of the systems available at the CHPC.  For more detailed information on the subjects below see the full [[guide:​start|User Guide]].
- 
-**NOTE: the new system is still under construction and information here and in the User Guide is incomplete and subject to sudden change.** 
  
 //docti cave// //docti cave//
  
  
 +===== New 3 PB Lustre storage system =====
 +
 +With effect from 1 May 2021, a new 3 PB Lustre storage system has been implemented.  ​
 +
 +Please read: **[[howto:​datamigration|Moving to the new 3 PB Lustre storage]]**
 +
 +:!: The old 4 PB Lustre storage system **will be //​decommissioned//​ on 1 June 2021**
 +
 +====New directory path:====
 +
 +The 3 PB Lustre is mounted on ''/​mnt/​lustre3p''​ and, once you have [[howto:​datamigration|copied over your files]], you will need to change the paths in your job scripts.  ​
 +
 +Search and replace your script files and change ''/​mnt/​lustre/''​ to ''/​mnt/​lustre3p/''​
 +
 +We will be updating all examples in this wiki to the new path, however, you may still come across an old example with the outmoded path. To make the example wor
  
 =====Overview:​ 32 832 cores===== =====Overview:​ 32 832 cores=====
Line 15: Line 28:
  
  
-* Maximum available memory on each type of node: ''​mem=125gb''​ (regular) or ''​mem=61gb''​ (regular with only 64GiB), and ''​mem=1007gb''​ (fat).+* Maximum available memory on each type of node: ''​mem=124gb''​ (regular) or ''​mem=61gb''​ (regular with only 64GiB), and ''​mem=1007gb''​ (fat).
  
  
Line 37: Line 50:
  
 Once you have logged in, give some consideration to how you will be using your session on the login node.  If you are going to spend a long time logged in, doing a variety of tasks, it is best to get yourself [[http://​wiki.chpc.ac.za/​quick:​start#​example_interactive_job_request|an interactive PBS session]] to work in.  This way, if you need to do something demanding, it will not conflict with other users logged into the login node. Once you have logged in, give some consideration to how you will be using your session on the login node.  If you are going to spend a long time logged in, doing a variety of tasks, it is best to get yourself [[http://​wiki.chpc.ac.za/​quick:​start#​example_interactive_job_request|an interactive PBS session]] to work in.  This way, if you need to do something demanding, it will not conflict with other users logged into the login node.
 +
 +====Trouble Logging in?====
 +Many users have their login blocked at some point. Usually this is because an incorrect password was entered more times than permitted (5 times). This restriction was put in place to prevent brute-force attacks by malicious individuals who want to gain access to your account.
 +
 +  * If you cannot log in, the first step is to make sure that you typed your username, hostname (lengau.chpc.ac.za or scp.chpc.ac.za) and password correctly. It sounds stupid, but this is often the problem. It happens to CHPC staff too...
 +  * Next, check that you are not experiencing a network problem. If you see a message along the lines of "​cannot resolve hostname",​ then your network is probably at fault (assuming that your spelling is correct).
 +  * If your network connection is fine, wait 30 minutes before attempting to log in again. After this period, the block is supposed to be automatically removed.
 +  * If for some reason this does not work, you should go to your user page on users.chpc.ac.za. There is a link at that address, to the left, which allows you to change your password and also edit other details for your entry on our user database (email addresses, qualifications,​ institution,​ etc.) **Be sure that your password conforms to all requirements**
 +  * If even changing the password does not help, please contact our helpdesk, and ask for our assistance.
 +
  
 ==== Transferring Data ==== ==== Transferring Data ====
Line 56: Line 79:
 From the command line on your Linux workstation:​ From the command line on your Linux workstation:​
 <​code>​ <​code>​
-scp filetocopy.tar.gz yourusername@scp.chpc.ac.za:/​mnt/​lustre/​users/​yourusername/​run15/​+scp filetocopy.tar.gz yourusername@scp.chpc.ac.za:/​mnt/​lustre3p/​users/​yourusername/​run15/​
 </​code>​ </​code>​
 transfers the file //​filetocopy.tar.gz//​ from your disk on your computer to  transfers the file //​filetocopy.tar.gz//​ from your disk on your computer to 
 the Lustre file system on the CHPC cluster, under the //​run15/// ​           ​ the Lustre file system on the CHPC cluster, under the //​run15/// ​           ​
-subdirectory of your scratch directory ///mnt/lustre/​users/​yourusername///  ​+subdirectory of your scratch directory ///mnt/lustre3p/​users/​yourusername///  ​
 (where //​yourusername//​ is replaced by your user name on the CHPC cluster). (where //​yourusername//​ is replaced by your user name on the CHPC cluster).
 +
 +
 +=== Downloading files from other servers ===
 +You may need to download data from a server at another site.  Do not do this on ** //login2// **!  Use ** // scp.chpc.ac.za//​ ** for this purpose. ​ The easiest way of doing this is with the **wget** command:
 +
 +<​code>​
 +wget http://​someserver.someuni.ac.za/​pub/​somefile.tgz
 +</​code>​
 +
 +Very large files may be transferred more quickly by using a multi-threaded downloader. The easiest of these is **axel**, see [[https://​github.com/​axel-download-accelerator/​axel|axel'​s GitHub page]]. ​ The syntax is very simple:
 +
 +<​code>​
 +module load chpc/​compmech/​axel/​2.17.6
 +axel -n 4 -a http://​someserver.someuni.ac.za/​pub/​somefile.tgz
 +</​code>​
 +
 +
  
 [[guide:​connect|Read more on connecting to the CHPC...]] [[guide:​connect|Read more on connecting to the CHPC...]]
Line 70: Line 110:
  
 ^ Mount point  ^  File System ^  Size ^  Quota ^  Backup ^ Access ​ ^ ^ Mount point  ^  File System ^  Size ^  Quota ^  Backup ^ Access ​ ^
-| ''/​home'' ​ | NFS  | 80 TB ​ | **15 GB**  | Yes  | Yes  |  +| ''/​home'' ​ | NFS  | 80 TB ​ | **15 GB**  | NO <​sup>​[1]</​sup> ​ | Yes  |  
-| ''/​mnt/​lustre/​users'' ​ | Lustre ​ | 4 PB  | none  | NO  | Yes  +| ''/​mnt/​lustre3p/​users'' ​ | Lustre ​ | 4 PB  | none :!:  | NO  | Yes   ​
 | ''/​apps'' ​ | NFS  | 20 TB ​ | none  | Yes  | On request ​ |  | ''/​apps'' ​ | NFS  | 20 TB ​ | none  | Yes  | On request ​ | 
-| ''/​mnt/​lustre/​groups'' ​ | Lustre ​ | 1 PB  | none  ​| NO  | On request only  | +| ''/​mnt/​lustre3p/​groups'' ​ | Lustre ​ | 1 PB  | 1 TB <​sup>​[2]</​sup>​| NO  | On request only  |  
 + 
 +**Note 1:** Unfortunately,​ at the moment the CHPC cannot guarantee any backup of the ''/​home''​ file system owing to hardware limitations. 
 + 
 +:!: **IMPORTANT NOTE:** Files older than 90 days on ''/​mnt/​lustre3p/​users''​ will be automatically deleted without any warning or advance notice. 
 + 
 +**Note 2:** Access to ''/​mnt/​lustre3p/​groups''​ is by application only and a quota will be assigned to the programme, to be shared by all members of that group. 
 + 
 +It is essential that all files that your job script writes to be on Lustre, apart from scheduler errors you will lose performance because your home directory is on NFS which is not a parallel file system. It is also recommended that all files your jobs scripts reads, especially if large or read more than once, be on Lustre for the same reason.
  
 +It is usually okay to keep binaries and libraries on home since they are read once and loaded into RAM when your executable launches. But you may notice improved performance if they are also on Lustre.
 ====Quotas==== ====Quotas====
  
-The ''/​home''​ file system is managed by quotas and a strict limit of 15 GB (15 000 000 000 bytes) is applied to it.  Please take care to not fill up your home directory. ​ Use ''/​mnt/​lustre/​users/​yourusername''​ to store large files. ​ If your project requires access to large files over a long duration (more than 60 days) then please submit a request to helpdesk.+The ''/​home''​ file system is managed by quotas and a strict limit of 15 GB (15 000 000 000 bytes) is applied to it.  Please take care to not fill up your home directory. ​ Use ''/​mnt/​lustre3p/​users/​yourusername''​ to store large files. ​ If your project requires access to large files over a long duration (more than 60 days) then please submit a request to helpdesk.
  
 You can see how much you are currently using with the ''​du''​ command: You can see how much you are currently using with the ''​du''​ command:
Line 88: Line 137:
 Make sure that all jobs use a working directory on the Lustre file system. ​ Do not use your home directory for the working directory of your job.  Use the directory allocated to you on the fast Lustre parallel file system: Make sure that all jobs use a working directory on the Lustre file system. ​ Do not use your home directory for the working directory of your job.  Use the directory allocated to you on the fast Lustre parallel file system:
 <​code>​ <​code>​
-/mnt/lustre/​users/​USERNAME/​+/mnt/lustre3p/​users/​USERNAME/​
 </​code>​ </​code>​
 where ''​USERNAME''​ is replace by //your// user name on the CHPC cluster. where ''​USERNAME''​ is replace by //your// user name on the CHPC cluster.
Line 203: Line 252:
 ^ Queue Name  ^ Max. cores  ^ Min. cores  ^  Max. jobs  ^^  Max. time  ^  Notes  ^ Access ​ ^ ^ Queue Name  ^ Max. cores  ^ Min. cores  ^  Max. jobs  ^^  Max. time  ^  Notes  ^ Access ​ ^
 ^ :::  ^  per job  ^^  in queue  ^  running ​ ^  hrs  ^ :::  ^ :::  ^ ^ :::  ^  per job  ^^  in queue  ^  running ​ ^  hrs  ^ :::  ^ :::  ^
-| serial ​ |  23 |  1 |  ​??? |  ​??? |  48 | For single-node non-parallel jobs.  |  | +| serial ​ |  23 |  1 |  ​24 |  ​10 |  48 | For single-node non-parallel jobs.  |  | 
-| seriallong ​ |  12 |  1 |  ​??? |  ​??? |  144 | For very long sub 1-node jobs.  |  |+| seriallong ​ |  12 |  1 |  ​24 |  ​10 |  144 | For very long sub 1-node jobs.  |  |
 | smp  |  24 |  24 |  20 |  10 |  96 | For single-node parallel jobs.  |  | | smp  |  24 |  24 |  20 |  10 |  96 | For single-node parallel jobs.  |  |
-^ normal ​ ^  240 ^  ​48 ^  20 ^  10 ^  48 ^ The standard queue for parallel jobs ^  ^+^ normal ​ ^  240 ^  ​25 ^  20 ^  10 ^  48 ^ The standard queue for parallel jobs ^  ^
 | large  |  2400 |  264 |  10 |  5 |  48 | For large parallel runs  | //​Restricted// ​ | | large  |  2400 |  264 |  10 |  5 |  48 | For large parallel runs  | //​Restricted// ​ |
 | express ​ |  2400 |  25 |  N/A |  100 total nodes |  96 | For paid commercial use only  | //​Restricted// ​ | | express ​ |  2400 |  25 |  N/A |  100 total nodes |  96 | For paid commercial use only  | //​Restricted// ​ |
Line 212: Line 261:
 | vis  |  12 |  1 |  1 |  1 |  3 | Visualisation node  |  | | vis  |  12 |  1 |  1 |  1 |  3 | Visualisation node  |  |
 | test  |  24 |  1 |  1 |  1 |  3 | Normal nodes, for testing only  |  | | test  |  24 |  1 |  1 |  1 |  3 | Normal nodes, for testing only  |  |
 +| gpu_1 |  10 |  1 |    |  2 |  12 | Up to 10 cpus, 1 GPU        |  |
 +| gpu_2 |  20 |  1 |    |  2 |  12 | Up to 20 cpus, 2 GPUs        |  |
 +| gpu_3 |  36 |  1 |    |  2 |  12 | Up to 36 cpus, 3 GPUs        |  |
 +| gpu_4 |  40 |  1 |    |  2 |  12 | Up to 40 cpus, 4 GPUs        |  |
 +| gpu_long |  20 |  1 |    |  1 |  24 | Up to 20 cpus, 1 or 2 GPUs        |  //​Restricted// ​ |
 +
 +
  
 ===Notes:​=== ===Notes:​===
Line 299: Line 355:
 #PBS -q smp #PBS -q smp
 #PBS -l walltime=4:​00:​00 #PBS -l walltime=4:​00:​00
-#PBS -o /mnt/lustre/​users/​USERNAME/​OMP_test/​test1.out +#PBS -o /mnt/lustre3p/​users/​USERNAME/​OMP_test/​test1.out 
-#PBS -e /mnt/lustre/​users/​USERNAME/​OMP_test/​test1.err+#PBS -e /mnt/lustre3p/​users/​USERNAME/​OMP_test/​test1.err
 #PBS -m abe #PBS -m abe
-#PBS -M your.email@address+#PBS -WMail_Users=youremail@ddress
 ulimit -s unlimited ulimit -s unlimited
  
-cd /mnt/lustre/​users/​USERNAME/​OMP_test+cd /mnt/lustre3p/​users/​USERNAME/​OMP_test
 nproc=`cat $PBS_NODEFILE | wc -l` nproc=`cat $PBS_NODEFILE | wc -l`
 echo nproc is $nproc echo nproc is $nproc
Line 325: Line 381:
 #PBS -q normal #PBS -q normal
 #PBS -l walltime=4:​00:​00 #PBS -l walltime=4:​00:​00
-#PBS -o /mnt/lustre/​users/​USERNAME/​WRF_Tests/​WRFV3/​run2km_100/​wrf.out +#PBS -o /mnt/lustre3p/​users/​USERNAME/​WRF_Tests/​WRFV3/​run2km_100/​wrf.out 
-#PBS -e /mnt/lustre/​users/​USERNAME/​WRF_Tests/​WRFV3/​run2km_100/​wrf.err+#PBS -e /mnt/lustre3p/​users/​USERNAME/​WRF_Tests/​WRFV3/​run2km_100/​wrf.err
 #PBS -m abe #PBS -m abe
-#PBS -M your.email@address+#PBS -WMail_Users=youremail@ddress
 ulimit -s unlimited ulimit -s unlimited
 . /​apps/​chpc/​earth/​WRF-3.7-impi/​setWRF . /​apps/​chpc/​earth/​WRF-3.7-impi/​setWRF
-cd /mnt/lustre/​users/​USERNAME/​WRF_Tests/​WRFV3/​run2km_100+cd /mnt/lustre3p/​users/​USERNAME/​WRF_Tests/​WRFV3/​run2km_100
 rm wrfout* rsl* rm wrfout* rsl*
 nproc=`cat $PBS_NODEFILE | wc -l` nproc=`cat $PBS_NODEFILE | wc -l`
Line 351: Line 407:
 Note that in the above job script example the working directory is on the Lustre file system. ​ Do not use your home directory for the working directory of your job.  Use the directory allocated to you on the fast Lustre parallel file system: Note that in the above job script example the working directory is on the Lustre file system. ​ Do not use your home directory for the working directory of your job.  Use the directory allocated to you on the fast Lustre parallel file system:
 <​code>​ <​code>​
-/mnt/lustre/​users/​USERNAME/​+/mnt/lustre3p/​users/​USERNAME/​
 </​code>​ </​code>​
 where ''​USERNAME''​ is replace by //your// user name on the CHPC cluster. where ''​USERNAME''​ is replace by //your// user name on the CHPC cluster.
/var/www/wiki/data/attic/quick/start.1568019726.txt.gz · Last modified: 2019/09/09 11:02 by wikiadmin