User Tools

Site Tools


guide:tensorflow

TensorFlow

If you are making use of Jupyter notebook to write your python scripts then you first need to make sure you export the .py file from Jupyter and then copy it onto the cluster

Also ensure job is copied to /mnt/lustre3p/users/YOURUSERNAME or subdirectories thereunder.

Running Tensorflow on CPU nodes

To test a job on a compute node first get onto an interactive node with the following:

   qsub -I -P YOURPROGRAMME(E.G. CSCI1234) -q smp -l select=1:ncpus=24

Once on an interactive node (cnodeNNNN) you need to load up the appropriate modules:

   module purge
   module load chpc/python/3.6.1_gcc-6.3.0

Then cd /mnt/lustre3p/users/YOURUSERNAME or where ever you placed your .py file

Finally run

   python nameofyourfile.py

If you import matplotlib in your python script you may end up with the following error:

   ModuleNotFoundError: No module named '_tkinter'

If so then add the following to your python script before resubmitting

   import matplotlib
   matplotlib.use('agg')

Running Tensorflow on GPU nodes

As with CPU version you can test your python jobs on an interactive node:

   qsub -I -P YOURPROGRAMME(E.G. CSCI1234) -q gpu_1 -l select=1:ncpus=10:ngpus=1

Once on an interactive node (gpuNNNN) you need to load up appropriate modules:

   module purge
   module load chpc/python/anaconda/3-2021.05

To test that Python (and Tensorflow) sees the GPUs:

module purge
module load chpc/python/anaconda/3-2021.05
module load chpc/cuda/11.2/PCIe/11.2
python3
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
From the Tensorflow Guide.

The output will look something like this:

2020-11-03 18:17:58.640547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-03 18:17:58.654899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2020-11-03 18:17:58.656022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2020-11-03 18:17:58.657201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:d8:00.0
2020-11-03 18:17:58.658593: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-11-03 18:17:58.690365: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-11-03 18:17:58.718120: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-11-03 18:17:58.762931: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-11-03 18:17:58.795721: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-11-03 18:17:58.824856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-11-03 18:17:58.870878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-03 18:17:58.877470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2
Num GPUs Available:  3
>>> 
Type exit() to quit the interactive Python session and return to the shell.

Now you can try your Python code:

   cd /mnt/lustre3p/users/YOURUSERNAME or where ever you placed your .py file

When running on a single GPU you need to include the following in your .py file to ensure that not all the CPU's on the node get consumed, thereby resulting in your job being killed by the scheduler

    session_conf = tf.ConfigProto(intra_op_parallelism_threads=10,inter_op_parallelism_threads=10)
    sess = tf.Session(config=session_conf) 

Finally run

   python nameofyourfile.py

If you wish to run jobs through the scheduler then there are scripts on the login node that can help you setup a PBS submission script

Once you are on the login node just type:

module purge

module load chpc/easy_scripts

qtensorflow_cpu (CPU version of Tensorflow)

or

qtensorflow_gpu (GPU version of tensorflow)

Examples of what is needed when running the above scripts are provided below:

          EXAMPLE1
 Enter research programme name
 CSCI1234
 Enter python script name (with .py extension)
 test.py
 Enter total walltime (hour:minute)
 2:00
 Enter email address
 testing@gmail.com
 Generated pbs file for test
 Do you wish to submit job to cluster (y/n)
 y
          EXAMPLE2
 Enter research programme name
 CSCI1234
 Enter python script name (with .py extension)
 test.py
 Enter total walltime (hour:minute)
 
 Enter email address
 testing@gmail.com
 Generated pbs file for test
 Do you wish to submit job to cluster (y/n)
 y

Please take note of empty space in EXAMPLE2. This corresponds to the enter key

Updating Tensorflow on GPU nodes

First get onto a GPU node:

  qsub -I -P YOURPROGRAMME(E.G. CSCI1234) -q gpu_1 -l select=1:ncpus=10:ngpus=1

To update tensorflow you should use a conda environment

  module purge
  module load chpc/python/anaconda/3-2021.05

Create your environment

  conda create -n tf-gpu tensorflow-gpu python=3.8
  conda activate tf-gpu

The above only needs to be done once thereafter all you will need to do is:

  module purge 
  module load chpc/python/anaconda/3-2021.05    
  conda activate tf-gpu

in order to use your updated version of tensorflow

/app/dokuwiki/data/pages/guide/tensorflow.txt · Last modified: 2021/12/09 16:42 (external edit)