User Tools

Site Tools


howto:quantum_espresso_checkpointing_restarting

Checkpoint or Restarting a Failed QE Calculation

Most codes permit a checkpoint file to be created so that one's job can be continued from the point shortly before it failed. Since QE 5.1 restarting from an arbitrary point of the code is no more supported. Instead it is possible to write out only one checkpoint file at a user specified time.

Please see and look for restart_mode and max_seconds in the QE documentation.
https://www.quantum-espresso.org/Doc/INPUT_PW.html#idm60

Please specify your outdir which is where the scratch files will be written.
outdir=' '

If your QE job is running on one node, the smp queue, then the maximum walltime is 96 hours, while for 2 nodes or more on the normal queue the maximum walltime is 48 hours.

If you believe your job will require a shorter time than the maximum walltimes, then it is important to choose a shorter PBS walltime. You also may believe that you job may not complete in the 48 or 96 hours. You can then make a decision on what max_seconds value to use in your QE input file. For example:

1. You believe your job will complete within 20 hours for example, or that around 20 hours is a vital point in the calculation where you believe restarting from is important, or that you've seen your job fail previously after 20 hours:

Set your PBS job walltime greater than 20 hours (23 hours for example), and choose max_seconds equivalent to 20 hours.

20 hours = 20 x 60 x 60 = 72 000 seconds

Initial first job:
max_seconds=72000
restart_mode ='from_scratch'

Subsequent jobs (2nd, 3rd etc):
max_seconds=72000
restart_mode ='restart'

2. Your job may not complete in the maximum PBS walltime of 48 hours on the normal queue, so then choose max_seconds equivalent to less than 48 hours, such as 45 hours:

45 hours = 45 x 60 x 60 = 162 000 seconds

Initial first job:
max_seconds=162000
restart_mode ='from_scratch'

Subsequent jobs (2nd, 3rd etc):
max_seconds=162000
restart_mode ='restart'


3. Your job may not complete in the maximum PBS walltime of 96 hours on the smp queue, so then choose choose max_seconds equivalent to less than 96 hours, such as 93 hours:

93 hours = 93 x 60 x 60 = 334 800 seconds
max_seconds=334800
restart_mode ='from_scratch'

Subsequent jobs (2nd, 3rd etc):
max_seconds=334800
restart_mode ='restart'

Please remember to delete any unwanted QE restart files and unwanted temporary QE files! We need to do our best with lustre/lustre3p.

/app/dokuwiki/data/pages/howto/quantum_espresso_checkpointing_restarting.txt · Last modified: 2021/12/09 16:42 (external edit)