User Tools

Site Tools


guide:intelphi

Using the Intel Xeon Phi nodes at CHPC

Jargon

The Phi cards are known officially as “Intel® Xeon Phi™” and will be referred to simply as “Phi”.

Generally, the chip that is the Phi is called MIC (for Many Integrated Core architecture) and hence many programming tools will refer to “mic” in file names and other identifiers.

Introduction

The Intel Phi coprocessor card is a separate compute node in the form of a PCIe card that is installed in a regular compute node in the C8000 cluster of the CHPC. The card is referred to as the “device” node and the node housing the card is the “host” node.

As a compute node within a compute node there are two ways to run programs on the Intel Phi card:

  • Native execution means that your entire Phi program runs on the card, using the memory of the card only, and accessing other nodes (via MPI calls) over the network interface that operates through the PCIe bus.
  • Offload execution mode has your program run on the host node (using its Intel Xeon CPUs and RAM) and part of the program runs on the Phi card. The part of your code that runs on the Phi has access only to the card's RAM but the Intel compiler makes shared variables available on both.

The preffered way to run Intel Phi code at the CHPC is offload.

Intel Compilers

Once logged on please load the appropriate module to load the Intel compilers which have support for compiling on the Xeon Phi:

module load intel-XE/c8000

This enables, for example, icc and ifort to make usable executables.

Compiling a C program for native execution is done using

icc –o Hello –O2 hello.c -mmic

and Fortran done similarly using

ifort –o Hello –O2 hello.f90 –mmic

For offload execution you do not include the -mmic flag.

PBSPro Job Script Examples

Native Execution

This is not recommended. You are better off using offload mode and running your program on the host node.

Job script file native.pbs:

#!/bin/bash
#PBS -q intel_mic
#PBS -e OUTPUT/error.txt
#PBS -o OUTPUT/output.txt
#PBS -l walltime=0:02:00
#PBS -l select=1:ncpus=1
#PBS -l place=excl
#PBS -W group_list=mic_user
#PBS -V
 
###############################################################################
# Set executable name here 
export EXECUTABLE=cheby_omp.exe
export PARAMETERS="500000"
 
export OMP_NUM_THREADS=240
export KMP_AFFINITY='granularity=thread,balanced'
 
###############################################################################
 
# Set up environment for using MIC
. /opt/gridware/compilers/intel/icsxe/bin/compilervars.sh intel64
export MIC_LD_LIBRARY_PATH=/opt/intel.1.117/compiler/lib:/opt/intel.1.117/mkl/lib:/opt/papi/lib
 
# Change to current directory and report which node we are on
cd $PBS_O_WORKDIR
pwd
hostname
 
# Use approved identity to copy executable to MIC
scp -i ~/.ssh/mic_id_rsa $EXECUTABLE chpcmic@mic0:
 
# Report starting time
date
 
# Run executable
ssh -i ~/.ssh/mic_id_rsa chpcmic@mic0  "export LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH ; export OMP_NUM_THREADS=$OMP_NUM_THREADS ; export KMP_AFFINITY=$KMP_AFFINITY ; ./$EXECUTABLE $PARAMETERS"
 
# Clean up
ssh -i ~/.ssh/mic_id_rsa chpcmic@mic0 rm $EXECUTABLE
 
# Finish by reporting final time
date

Offload Execution

This is the preferred way to run Phi code: execute on the host and offlad the compute intensive part on the Phi device.

Job script file offload.pbs:

#!/bin/bash
#PBS -q intel_mic
#PBS -e OUTPUT/error.txt
#PBS -o OUTPUT/output.txt
#PBS -l walltime=0:02:00
#PBS -l select=1:ncpus=1
#PBS -l place=excl
#PBS -W group_list=mic_user
#PBS -V

###############################################################################
# Set executable name here 
export EXECUTABLE=cheby_omp.exe
export PARAMETERS="50000"

export OMP_NUM_THREADS=24
export KMP_AFFINITY='granularity=thread,compact'
export MIC_OMP_NUM_THREADS=236
export MIC_KMP_AFFINITY='granularity=thread,balanced'

###############################################################################

# Set up environment for using MIC
. /opt/gridware/compilers/intel/icsxe/bin/compilervars.sh intel64
export MIC_LD_LIBRARY_PATH=/opt/intel.1.117/compiler/lib:/opt/intel.1.117/mkl/lib:/opt/papi/lib

# Change to current directory and report which node we are on
cd $PBS_O_WORKDIR
pwd
hostname

# Report starting time
date

# Run executable
./$EXECUTABLE $PARAMETERS

# Finish by reporting final time
date

As you can see above offload execution is much simpler.

Warning

The MIC architecture is designed for parallel code with many threads. The Phi cards at the CHPC have 60 cores each but are only really effective if your code uses multiple threads per core, at minimum two, up to four threads per core. Unless you have a code that performs well with 120 to 240 threads in a shared memory environment of 8GiB of memory, the Phi card will dissapoint.

Cons

  • Only 8GiB of memory
  • Only 1GHz CPU clock
  • The CPU can only retire an instruction every second clock cycle; effectively each thread runs at 500MHz.
  • Must use at least two threads per core to use full clock cycles.
  • Code must scale well from 120 to 240 threads.

Pros

  • 512 bit wide vector processing unit: can execute z[i] = a[i]*x[i] + y[i] on 8 double-precision elements per tick.
  • 240 threads in one chip. (But you have to give them all enough work.)
/var/www/wiki/data/pages/guide/intelphi.txt · Last modified: 2015/03/18 11:54 by kevin