This is an old revision of the document!
1. Install all packages that are needed
[root@oss01 ~]# yum install device-mapper device-mapper-devel device-mapper-multipath kpartx [root@oss01 ~]# yum install sg3_utils sg3_utils-devel
2. Start multipath service
[root@oss01 ~]# service multipathd start Starting multipathd daemon: [ OK ] [root@oss01 ~]# chkconfig multipathd on [root@oss01 ~]#
3. Scan for devices and partitions
[root@oss01 ~]# rescan-scsi-bus.sh Scanning SCSI subsystem for new devices Scanning host 0 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning for device 0 2 0 0 ... OLD: Host: scsi0 Channel: 02 Id: 00 Lun: 00 Vendor: DELL Model: PERC H330 Mini Rev: 4.27 Type: Direct-Access ANSI SCSI revision: 05 Scanning host 1 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 2 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 3 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 4 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 5 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 6 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 7 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 8 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 9 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning host 10 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning for device 10 0 0 0 ... OLD: Host: scsi10 Channel: 00 Id: 00 Lun: 00 Vendor: PLDS Model: DVD-ROM DU-8D5LH Rev: 3D51 Type: CD-ROM ANSI SCSI revision: 05 Scanning host 11 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning for device 11 0 1 31 ... OLD: Host: scsi11 Channel: 00 Id: 01 Lun: 31 Vendor: DELL Model: Universal Xport Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 Scanning for device 11 0 1 1 ... NEW: Host: scsi11 Channel: 00 Id: 01 Lun: 01 Vendor: DELL Model: MD34xx Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 sg6 changed: LU not available (PQual 1) OLD: Host: scsi11 Channel: 00 Id: 01 Lun: 00 Vendor: DELL Model: MD34xx Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 Scanning for device 11 0 6 31 ... OLD: Host: scsi11 Channel: 00 Id: 06 Lun: 31 Vendor: DELL Model: Universal Xport Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 Scanning for device 11 0 6 1 ... NEW: Host: scsi11 Channel: 00 Id: 06 Lun: 01 Vendor: DELL Model: MD34xx Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 sg4 changed: LU not available (PQual 1) OLD: Host: scsi11 Channel: 00 Id: 06 Lun: 00 Vendor: DELL Model: MD34xx Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 Scanning host 12 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning for device 12 0 0 1 ... NEW: Host: scsi12 Channel: 00 Id: 00 Lun: 01 Vendor: DELL Model: MD34xx Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 Scanning for device 12 0 0 31 ... NEW: Host: scsi12 Channel: 00 Id: 00 Lun: 31 Vendor: DELL Model: Universal Xport Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 sg2 changed: LU not available (PQual 1) OLD: Host: scsi12 Channel: 00 Id: 00 Lun: 00 Vendor: DELL Model: MD34xx Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 Scanning for device 12 0 1 1 ... NEW: Host: scsi12 Channel: 00 Id: 01 Lun: 01 Vendor: DELL Model: MD34xx Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 Scanning for device 12 0 1 31 ... NEW: Host: scsi12 Channel: 00 Id: 01 Lun: 31 Vendor: DELL Model: Universal Xport Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 sg3 changed: LU not available (PQual 1) OLD: Host: scsi12 Channel: 00 Id: 01 Lun: 00 Vendor: DELL Model: MD34xx Rev: 0825 Type: Direct-Access ANSI SCSI revision: 05 6 new or changed device(s) found. [11:0:1:1] [11:0:6:1] [12:0:0:1] [12:0:0:31] [12:0:1:1] [12:0:1:31] 0 remapped or resized device(s) found. 0 device(s) removed.
4. Display Device mappings
[root@oss01 ~]# multipath -ll mpathc (3600a098000b63269000001bf5991053a) dm-1 DELL,MD34xx size=4.9T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=14 status=active | `- 12:0:1:1 sde 8:64 active ready running `-+- policy='round-robin 0' prio=9 status=enabled `- 12:0:0:1 sdd 8:48 active ready running mpathb (3600a098000b62bc70000018b599106b1) dm-0 DELL,MD34xx size=4.9T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=14 status=active | `- 11:0:1:1 sdb 8:16 active ready running `-+- policy='round-robin 0' prio=9 status=enabled `- 11:0:6:1 sdc 8:32 active ready running [root@oss01 ~]#
5. Grab the image of oss01 and reboot oss02 from the new image just to check if all is well.
[root@oss02 ~]# multipath -ll mpathc (3600a098000b62bc70000018b599106b1) dm-0 DELL,MD34xx size=4.9T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=14 status=active | `- 11:0:0:1 sdb 8:16 active ready running `-+- policy='round-robin 0' prio=9 status=enabled `- 11:0:1:1 sdc 8:32 active ready running mpathb (3600a098000b63269000001bf5991053a) dm-1 DELL,MD34xx size=4.9T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=14 status=active | `- 12:0:1:1 sde 8:64 active ready running `-+- policy='round-robin 0' prio=9 status=enabled `- 12:0:0:1 sdd 8:48 active ready running [root@oss02 ~]#
6. HA crossover cable for Corosync commutations make HA peers. [mds01-mds02, oss01-oss02]
++ Insert crossover cable on mds servers em2 and oss servers em2 ++ ++ Create Lustre HA Network ++ ++ Configure interfaces and IP addresses for all lustre nodes ++ [bright1->network]% add Lustre-HANet [bright1->network*[Lustre-HANet*]]% set baseaddress 10.0.0.0 [bright1->network*[Lustre-HANet*]]% set broadcastaddress 10.0.0.255 [bright1->network*[Lustre-HANet*]]% set domainname lustre-ha.cluster [bright1->network*[Lustre-HANet*]]% commit [bright1->device[oss01]->interfaces]% add physical eth1 [bright1->device*[oss01*]->interfaces*[eth1*]]% set network lustre-hanet [bright1->device*[oss01*]->interfaces*[eth1*]]% set ip 10.0.0.2 [bright1->device*[oss01*]->interfaces*[eth1*]]% commit [bright1->device[oss02]->interfaces]% add physical eth1 [bright1->device*[oss02*]->interfaces*[eth1*]]% set network lustre-hanet [bright1->device*[oss02*]->interfaces*[eth1*]]% set ip 10.0.0.3 [bright1->device*[oss02*]->interfaces*[eth1*]]% commit [bright1->device[mds01]->interfaces]% add physical eth1 [bright1->device*[mds01*]->interfaces*[eth1*]]% set network lustre-hanet [bright1->device*[mds01*]->interfaces*[eth1*]]% set ip 10.0.0.4 [bright1->device*[mds01*]->interfaces*[eth1*]]% commit [bright1->device[mds02]->interfaces]% add physical eth1 [bright1->device*[mds02*]->interfaces*[eth1*]]% set network lustre-hanet [bright1->device*[mds02*]->interfaces*[eth1*]]% set ip 10.0.0.5 [bright1->device*[mds02*]->interfaces*[eth1*]]% commit
6. IML, MDS and OSS node preparations
[root@imlnode ~]# yum remove epel-release [root@imlnode ~]# ssh-copy-id -i .ssh/id_dsa.pub mds01 [root@imlnode ~]# ssh-copy-id -i .ssh/id_dsa.pub oss01
7. Install Intel Manager for Lustreadmin
[root@imlnode iml]# tar -xvf ee-2.4.1.2.tar.gz [root@imlnode iml]# cd ee-2.4.1.2 [root@imlnode ee-2.4.1.2]# ./install . . . Installing Intel(R) Manager for Lustre* | Starting setup... Setting up PostgreSQL service... Creating database owner 'chroma'... Creating database 'chroma'... Creating database tables... Loaded 13 default power device types. Creating groups... An administrative user account will now be created using the credentials which you provide. Username: admin Email: smasoka@csir.co.za Password: Confirm password: User 'admin' successfully created. Building static directory... NTP Server [localhost]: 172.18.0.1 Writing ntp configuration: 172.18.0.1 Closing firewall for ntp Opening firewall for ntp Restarting ntp Starting RabbitMQ... Restarting rabbitmq-server: RabbitMQ is not running SUCCESS rabbitmq-server. Creating RabbitMQ user... Creating RabbitMQ vhost... Enabling daemons Starting daemons Checking service configuration... Setup complete. Registering profiles chroma-manager 2.4.1.2-4273 is currently installed Intel(R) Manager for Lustre* software installation completed successfully [root@imlnode ee-2.4.1.2]#
8. Open the Web browser and login using the admin account
https://10.128.24.157
9. Errors Adding servers [oss and mds] using the web portal.
libcom_err errors (but these are caused by other things) [root@mds01 ~]# yum remove libcom_err-devel During the installation, wrong e2fsprogs packages wouls be installed from base repo is "e2fsprogs" repo is not enabled [root@mds01 ~]# yum install -y --enablerepo=iml-agent,e2fsprogg chroma-agent-management Make sure the servers [oss and mds] are diskless [bright1->device[oss02]]%set disksetup /cm/shared/apps/cmgui/disk-setup/slave-diskless.xml <code> 10. Servers Successfully added 11. Configure Primary and Secondary servers 12. Create New Lustre File System Results: 13. Lustre Clients [All clients are CentOS 7.3, so I'm installing Luste Client 2.8] * Get the source RPM <code> [root@login01 shared]# rpm -hiv lustre-client-2.8.0-3.10.0_327.3.1.el7.x86_64.src.rpm Updating / installing... 1:lustre-client-2.8.0-3.10.0_327.3.################################# [100%] [root@login01 shared]#
[root@login01 ~]# cd /root/rpmbuild/SOURCES/ [root@login01 SOURCES]# ls lustre-2.8.0.tar.gz [root@login01 SOURCES]# tar xvf lustre-2.8.0.tar.gz -C /root/ [root@login01 SOURCES]# cd /root/lustre-2.8.0/ [root@login01 lustre-2.8.0]# yum install python-docutils [root@login01 lustre-2.8.0]# ./configure --with-o2ib=/usr/src/ofa_kernel/default [root@login01 lustre-2.8.0]# make rpms [root@login01 lustre-2.8.0]# ls -l lustre-client* -rw-r--r-- 1 root root 477684 Sep 18 15:38 lustre-client-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm -rw-r--r-- 1 root root 18090092 Sep 18 15:38 lustre-client-debuginfo-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm -rw-r--r-- 1 root root 2188464 Sep 18 15:38 lustre-client-modules-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm -rw-r--r-- 1 root root 12515736 Sep 18 15:38 lustre-client-source-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm -rw-r--r-- 1 root root 8791148 Sep 18 15:38 lustre-client-tests-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm [root@login01 lustre-2.8.0]#
[root@login01 lustre-client-rpms]# cp /root/lustre-2.8.0/lustre-client* /cm/shared/lustre-client-rpms [root@login01 ~]# rpm -hiv /cm/shared/lustre-client-rpms/lustre-client-modules-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm /cm/shared/lustre-client-rpms/lustre-client-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm /cm/shared/lustre-client-rpms/lustre-client-source-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm login01: Preparing... ######################################## login01: Updating / installing... login01: lustre-client-modules-2.8.0-3.10.0_514######################################## login01: lustre-client-2.8.0-3.10.0_514.2.2.el7######################################## login01: lustre-client-source-2.8.0-3.10.0_514.########################################
[root@login01 ~]# cat /etc/modprobe.d/lustre.conf options lnet networks=o2ib0(ib0) [root@login01 ~]# modprobe -v lustre insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/net/lustre/libcfs.ko insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/net/lustre/lnet.ko networks=o2ib0(ib0) insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/obdclass.ko insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/ptlrpc.ko insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/lov.ko insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/fid.ko insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/mdc.ko insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/fld.ko insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/lmv.ko insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/lustre.ko [root@login01 ~]# mkdir /mnt/lustre [root@login01 ~]# mount -t lustre 172.20.0.221@o2ib0:172.20.0.222@o2ib0:/lustre /mnt/lustre
Small Fix
Solution (hours of googling):
[root@mds02 ~]# lctl conf_param lustre.llite.lazystatfs=1
Check mounting manually
[root@login01 ~]# mount -t lustre 172.20.0.221@o2ib0:172.20.0.222@o2ib0:/lustre /mnt/lustre
lazystatfs - Allows for statfs() to be not blocked if some of the servers are down. In my case, lustre-OST0000 OSC is permanently unavailable (DOWN). “statfs” gets/returns statistics about the mounted file system.
[root@sched ~]# echo "mount -t lustre -o flock 172.20.0.222@o2ib0:172.20.0.221@o2ib0:/lustre /mnt/lustre" /etc/rc.local
This enables lustre filesystem to be mounted when the boots up.
Ref: https://jira.hpdd.intel.com/browse/LUDOC-153 https://jira.hpdd.intel.com/browse/LU-4397