If you are making use of Jupyter notebook to write your python scripts then you first need to make sure you export the .py file from Jupyter and then copy it onto the cluster
Also ensure job is copied to /mnt/lustre3p/users/YOURUSERNAME or subdirectories thereunder.
To test a job on a compute node first get onto an interactive node with the following:
qsub -I -P YOURPROGRAMME(E.G. CSCI1234) -q smp -l select=1:ncpus=24
Once on an interactive node (cnodeNNNN) you need to load up the appropriate modules:
module purge module load chpc/python/3.6.1_gcc-6.3.0
Then cd /mnt/lustre3p/users/YOURUSERNAME
or where ever you placed your .py
file
Finally run
python nameofyourfile.py
If you import matplotlib in your python script you may end up with the following error:
ModuleNotFoundError: No module named '_tkinter'
If so then add the following to your python script before resubmitting
import matplotlib matplotlib.use('agg')
As with CPU version you can test your python jobs on an interactive node:
qsub -I -P YOURPROGRAMME(E.G. CSCI1234) -q gpu_1 -l select=1:ncpus=10:ngpus=1
Once on an interactive node (gpuNNNN) you need to load up appropriate modules:
module purge module load chpc/python/anaconda/3-2021.05
To test that Python (and Tensorflow) sees the GPUs:
module purge module load chpc/python/anaconda/3-2021.05 module load chpc/cuda/11.2/PCIe/11.2 python3 import tensorflow as tf print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
From the Tensorflow Guide.
The output will look something like this:
2020-11-03 18:17:58.640547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-11-03 18:17:58.654899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:3b:00.0 2020-11-03 18:17:58.656022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:af:00.0 2020-11-03 18:17:58.657201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:d8:00.0 2020-11-03 18:17:58.658593: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-11-03 18:17:58.690365: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-11-03 18:17:58.718120: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-11-03 18:17:58.762931: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-11-03 18:17:58.795721: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-11-03 18:17:58.824856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-11-03 18:17:58.870878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-11-03 18:17:58.877470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2 Num GPUs Available: 3 >>>
Typeexit()
to quit the interactive Python session and return to the shell.
Now you can try your Python code:
cd /mnt/lustre3p/users/YOURUSERNAME or where ever you placed your .py file
When running on a single GPU you need to include the following in your .py file to ensure that not all the CPU's on the node get consumed, thereby resulting in your job being killed by the scheduler
session_conf = tf.ConfigProto(intra_op_parallelism_threads=10,inter_op_parallelism_threads=10) sess = tf.Session(config=session_conf)
Finally run
python nameofyourfile.py
If you wish to run jobs through the scheduler then there are scripts on the login node that can help you setup a PBS submission script
Once you are on the login node just type:
module purge
module load chpc/easy_scripts
qtensorflow_cpu (CPU version of Tensorflow)
or
qtensorflow_gpu (GPU version of tensorflow)
Examples of what is needed when running the above scripts are provided below:
EXAMPLE1 Enter research programme name CSCI1234 Enter python script name (with .py extension) test.py Enter total walltime (hour:minute) 2:00 Enter email address testing@gmail.com Generated pbs file for test Do you wish to submit job to cluster (y/n) y
EXAMPLE2 Enter research programme name CSCI1234 Enter python script name (with .py extension) test.py Enter total walltime (hour:minute) Enter email address testing@gmail.com Generated pbs file for test Do you wish to submit job to cluster (y/n) y
Please take note of empty space in EXAMPLE2. This corresponds to the enter key
First get onto a GPU node:
qsub -I -P YOURPROGRAMME(E.G. CSCI1234) -q gpu_1 -l select=1:ncpus=10:ngpus=1
To update tensorflow you should use a conda environment
module purge module load chpc/python/anaconda/3-2021.05
Create your environment
conda create -n tf-gpu tensorflow-gpu python=3.8 conda activate tf-gpu
The above only needs to be done once thereafter all you will need to do is:
module purge module load chpc/python/anaconda/3-2021.05 conda activate tf-gpu
in order to use your updated version of tensorflow