Mellanox has recently released a driver which supports SR-IOV on their ConnectX-3 HCA family. This means it is now possible to use Infiniband verbs from inside a VM. To support these Infiniband HCAs, and any other SR-IOV network device, in OpenNebula I developed an SR-IOV VMM driver.
1. Extract kvm-sriov.tar.gz to /var/lib/one/remotes/vmm/kvm-sriov
2. Edit /etc/one/oned.conf and add:
VM_MAD = [ name = "kvm_sriov", executable = "one_vmm_exec", arguments = "-t 15 -r 0 kvm-sriov", default = "vmm_exec/vmm_exec_kvm.conf", type = "kvm" ]
3. Use this guide to extract the bus slot and function addresses for the virtual functions: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Host_Configuration_and_Guest_Installation_Guide/sect-Virtualization_Host_Configuration_and_Guest_Installation_Guide-SR_IOV-How_SR_IOV_Libvirt_Works.html
4. In the /var/lib/one/remotes/vmm/kvm-sriov/vf_maps directory create a file with the name of the VFs root device, e.g. “ib0”. In the file write the bus, slot and function for each VF you want the driver to use. The address must be in hexadecimal. Each line in the file represents a VF, the bus, slot and function addresses are separated by a single space character “ “. There must be no additional text, spaces or lines in the table.
Example, four virtual functions:
/var/lib/one/remotes/vmm/kvm-sriov/vf_maps/ib0:
0x07 0x00 0x01 0x07 0x00 0x02 0x07 0x00 0x03 0x07 0x00 0x04
5. Edit /var/lib/one/remotes/vmm/kvm-sriov/kvmrc and set the “DUMMY_BRIDGE” and “DUMMY_MAC_PREFIX” the “DUMMY_MAC_PREFIX” must be different to the prefix with OpenNebula creates. The prefix will be used by the contextualisation scripts to differentiate SR-IOV and libvirt network interfaces.
6. On VM hosts using this driver edit /etc/libvirt/qemu.conf and change the user and group to “root” then restart libvirtd.
7. On the head node restart OpenNebula.
8. Create a virtual network in OpenNebula. The “Bridge” field must contain “sriov_” prepended to the name of the file that contains the VF mapings, e.g. “sriov_ib0”
9. Update the contextualisation scripts for the VMs with the one provided here.
10. Restart OpenNebula
11. Create hosts to the the “kvm_sriov” driver.
The “save” command saves the state of the VM's memory to the host's disk. This operation fails when a VM contains an SR-IOV device because it is passed through the hypervisor and not defined by it. As a resut the devices memory state cannot be read, causing the save operation to fail.
Libvirt does not provide functionality to hot plug SR-IOV devices therefore this function cannot be implemented.
Libvirt does provide a built-in method for tracking VF usage but it does not support the Mellanox OFED 2 drivers. As a result I had to create my own usage tracking mechanism. If you are using this driver with an SR-IOV card which supports VF assignment from a pool then you can use the instructions found below to update the scripts: http://wiki.libvirt.org/page/Networking#Assignment_with_.3Cinterface_type.3D.27hostdev.27.3E_.28SRIOV_devices_only.29
In the host's /tmp directory a vf_interfaces directory will be created by the driver. Inside the vf_interfaces directory a directory will be created for each SR-IOV root device. Inside the root device tracking folder files will be created to indicate that a VF is in use. For example, if the first, third and fourth VF are in use you see:
/tmp/vf_interfaces/ib0:
0 2 3
Edit the 00-network script (in our case: /srv/one-context.d/00-network). Update the gen_network_configuration() and gen_iface_conf() functions.
gen_iface_conf() { cat <<EOT DEVICE=$DEV BOOTPROTO=none ONBOOT=yes #################### # Update this # #################### TYPE=$TYPE #------------------- NETMASK=$MASK IPADDR=$IP EOT if [ -n "$GATEWAY" ]; then echo "GATEWAY=$GATEWAY" fi echo "" }
gen_network_configuration() { IFACES=`get_interfaces` for i in $IFACES; do MAC=`get_mac $i` ################################### # Update this # ################################### ib_vf=$(echo "$MAC" | cut -c 1-2) ib_pos=$(echo "$MAC" | cut -c 5) if [ "$ib_vf" == "AA" ]; then DEV="ib"$ib_pos UPCASE_DEV=`upcase $DEV` TYPE="Infiniband" else DEV=`get_dev $i` UPCASE_DEV=`upcase $DEV` TYPE="Ethernet" fi #----------------------------------- IP=$(get_ip) NETWORK=$(get_network) MASK=$(get_mask) GATEWAY=$(get_gateway) #################################### # Update this # #################################### if [ "$ib_vf" == "AA" ]; then gen_iface_conf > /etc/sysconfig/network-scripts/ifcfg-ib$ib_pos else gen_iface_conf > /etc/sysconfig/network-scripts/ifcfg-${DEV} fi #----------------------------------- done }
Take note of: