howto:quantum_espresso_checkpointing_restarting

Most codes permit a checkpoint file to be created so that one's job can be continued from the point shortly before it failed. Since QE 5.1 restarting from an arbitrary point of the code is no more supported. Instead it is possible to write out only one checkpoint file at a user specified time.

Please see and look for restart_mode and max_seconds in the QE documentation.

https://www.quantum-espresso.org/Doc/INPUT_PW.html#idm60

Please specify your outdir which is where the scratch files will be written.

outdir=' '

If your QE job is running on one node, the *smp* queue, then the maximum walltime is 96 hours, while for 2 nodes or more on the *normal* queue the maximum walltime is 48 hours.

If you believe your job will require a shorter time than the maximum walltimes, then it is important to choose a shorter PBS walltime. You also may believe that you job may not complete in the 48 or 96 hours. You can then make a decision on what *max_seconds* value to use in your QE input file. For example:

**1.** You believe your job will complete within 20 hours for example, or that around 20 hours is a vital point in the calculation where you believe restarting from is important, or that you've seen your job fail previously after 20 hours:

Set your PBS job walltime greater than 20 hours (23 hours for example), and choose max_seconds equivalent to 20 hours.

20 hours = 20 x 60 x 60 = 72 000 seconds

Initial first job:

max_seconds=72000

restart_mode ='from_scratch'

Subsequent jobs (2nd, 3rd etc):

max_seconds=72000

restart_mode ='restart'

**2.** Your job may not complete in the maximum PBS walltime of 48 hours on the *normal* queue, so then choose max_seconds equivalent to less than 48 hours, such as 45 hours:

45 hours = 45 x 60 x 60 = 162 000 seconds

Initial first job:

max_seconds=162000

restart_mode ='from_scratch'

Subsequent jobs (2nd, 3rd etc):

max_seconds=162000

restart_mode ='restart'

**3.** Your job may not complete in the maximum PBS walltime of 96 hours on the *smp* queue, so then choose choose max_seconds equivalent to less than 96 hours, such as 93 hours:

93 hours = 93 x 60 x 60 = 334 800 seconds

max_seconds=334800

restart_mode ='from_scratch'

Subsequent jobs (2nd, 3rd etc):

max_seconds=334800

restart_mode ='restart'

**Please remember to delete any unwanted QE restart files and unwanted temporary QE files! We need to do our best with lustre/lustre3p.**

/app/dokuwiki/data/pages/howto/quantum_espresso_checkpointing_restarting.txt · Last modified: 2021/12/09 16:42 (external edit)