User Tools

Site Tools


acelab:intel_enterprise_for_lustre

1. Install all packages that are needed

[root@oss01 ~]# yum install device-mapper device-mapper-devel device-mapper-multipath kpartx
[root@oss01 ~]# yum install sg3_utils sg3_utils-devel 

2. Start multipath service

[root@oss01 ~]# service multipathd start
Starting multipathd daemon: [ OK ]
[root@oss01 ~]# chkconfig multipathd on
[root@oss01 ~]#

3. Scan for devices and partitions

[root@oss01 ~]# rescan-scsi-bus.sh 
Scanning SCSI subsystem for new devices
Scanning host 0 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
 Scanning for device 0 2 0 0 ... 
OLD: Host: scsi0 Channel: 02 Id: 00 Lun: 00
  Vendor: DELL Model: PERC H330 Mini Rev: 4.27
  Type: Direct-Access ANSI SCSI revision: 05
Scanning host 1 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 2 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 3 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 4 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 5 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 6 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 7 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 8 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 9 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
Scanning host 10 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
 Scanning for device 10 0 0 0 ...          
OLD: Host: scsi10 Channel: 00 Id: 00 Lun: 00
  Vendor: PLDS Model: DVD-ROM DU-8D5LH Rev: 3D51
  Type: CD-ROM ANSI SCSI revision: 05
Scanning host 11 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
 Scanning for device 11 0 1 31 ... 
OLD: Host: scsi11 Channel: 00 Id: 01 Lun: 31
  Vendor: DELL Model: Universal Xport Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
 Scanning for device 11 0 1 1 ... 
NEW: Host: scsi11 Channel: 00 Id: 01 Lun: 01
  Vendor: DELL Model: MD34xx Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
sg6 changed: LU not available (PQual 1)    
OLD: Host: scsi11 Channel: 00 Id: 01 Lun: 00
  Vendor: DELL Model: MD34xx Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
 Scanning for device 11 0 6 31 ... 
OLD: Host: scsi11 Channel: 00 Id: 06 Lun: 31
  Vendor: DELL Model: Universal Xport Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
 Scanning for device 11 0 6 1 ... 
NEW: Host: scsi11 Channel: 00 Id: 06 Lun: 01
  Vendor: DELL Model: MD34xx Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
sg4 changed: LU not available (PQual 1)    
OLD: Host: scsi11 Channel: 00 Id: 06 Lun: 00
  Vendor: DELL Model: MD34xx Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
Scanning host 12 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs
 Scanning for device 12 0 0 1 ... 
NEW: Host: scsi12 Channel: 00 Id: 00 Lun: 01
  Vendor: DELL Model: MD34xx Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
 Scanning for device 12 0 0 31 ... 
NEW: Host: scsi12 Channel: 00 Id: 00 Lun: 31
  Vendor: DELL Model: Universal Xport Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
sg2 changed: LU not available (PQual 1)    
OLD: Host: scsi12 Channel: 00 Id: 00 Lun: 00
  Vendor: DELL Model: MD34xx Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
 Scanning for device 12 0 1 1 ... 
NEW: Host: scsi12 Channel: 00 Id: 01 Lun: 01
  Vendor: DELL Model: MD34xx Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
 Scanning for device 12 0 1 31 ... 
NEW: Host: scsi12 Channel: 00 Id: 01 Lun: 31
  Vendor: DELL Model: Universal Xport Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
sg3 changed: LU not available (PQual 1)    
OLD: Host: scsi12 Channel: 00 Id: 01 Lun: 00
  Vendor: DELL Model: MD34xx Rev: 0825
  Type: Direct-Access ANSI SCSI revision: 05
6 new or changed device(s) found.          
 [11:0:1:1]
 [11:0:6:1]
 [12:0:0:1]
 [12:0:0:31]
 [12:0:1:1]
 [12:0:1:31]
0 remapped or resized device(s) found.
0 device(s) removed.                 

4. Display Device mappings

[root@oss01 ~]# multipath -ll
mpathc (3600a098000b63269000001bf5991053a) dm-1 DELL,MD34xx
size=4.9T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=14 status=active
| `- 12:0:1:1 sde 8:64 active ready running
`-+- policy='round-robin 0' prio=9 status=enabled
  `- 12:0:0:1 sdd 8:48 active ready running
mpathb (3600a098000b62bc70000018b599106b1) dm-0 DELL,MD34xx
size=4.9T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=14 status=active
| `- 11:0:1:1 sdb 8:16 active ready running
`-+- policy='round-robin 0' prio=9 status=enabled
  `- 11:0:6:1 sdc 8:32 active ready running
[root@oss01 ~]#

5. Grab the image of oss01 and reboot oss02 from the new image just to check if all is well.

[root@oss02 ~]# multipath -ll
mpathc (3600a098000b62bc70000018b599106b1) dm-0 DELL,MD34xx
size=4.9T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=14 status=active
| `- 11:0:0:1 sdb 8:16 active ready running
`-+- policy='round-robin 0' prio=9 status=enabled
  `- 11:0:1:1 sdc 8:32 active ready running
mpathb (3600a098000b63269000001bf5991053a) dm-1 DELL,MD34xx
size=4.9T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=14 status=active
| `- 12:0:1:1 sde 8:64 active ready running
`-+- policy='round-robin 0' prio=9 status=enabled
  `- 12:0:0:1 sdd 8:48 active ready running
[root@oss02 ~]#

6. HA crossover cable for Corosync commutations make HA peers. [mds01-mds02, oss01-oss02]

++ Insert crossover cable on mds servers em2 and oss servers em2 ++
++ Create Lustre HA Network ++
++ Configure interfaces and IP addresses for all lustre nodes ++

[bright1->network]% add Lustre-HANet
[bright1->network*[Lustre-HANet*]]% set baseaddress 10.0.0.0
[bright1->network*[Lustre-HANet*]]% set broadcastaddress 10.0.0.255
[bright1->network*[Lustre-HANet*]]% set domainname lustre-ha.cluster
[bright1->network*[Lustre-HANet*]]% commit 


[bright1->device[oss01]->interfaces]% add physical eth1
[bright1->device*[oss01*]->interfaces*[eth1*]]% set network lustre-hanet 
[bright1->device*[oss01*]->interfaces*[eth1*]]% set ip 10.0.0.2
[bright1->device*[oss01*]->interfaces*[eth1*]]% commit

[bright1->device[oss02]->interfaces]% add physical eth1
[bright1->device*[oss02*]->interfaces*[eth1*]]% set network lustre-hanet 
[bright1->device*[oss02*]->interfaces*[eth1*]]% set ip 10.0.0.3
[bright1->device*[oss02*]->interfaces*[eth1*]]% commit

[bright1->device[mds01]->interfaces]% add physical eth1
[bright1->device*[mds01*]->interfaces*[eth1*]]% set network lustre-hanet 
[bright1->device*[mds01*]->interfaces*[eth1*]]% set ip 10.0.0.4
[bright1->device*[mds01*]->interfaces*[eth1*]]% commit

[bright1->device[mds02]->interfaces]% add physical eth1
[bright1->device*[mds02*]->interfaces*[eth1*]]% set network lustre-hanet 
[bright1->device*[mds02*]->interfaces*[eth1*]]% set ip 10.0.0.5
[bright1->device*[mds02*]->interfaces*[eth1*]]% commit

6. IML, MDS and OSS node preparations

[root@imlnode ~]# yum remove epel-release
[root@imlnode ~]# ssh-copy-id -i .ssh/id_dsa.pub mds01
[root@imlnode ~]# ssh-copy-id -i .ssh/id_dsa.pub oss01

7. Install Intel Manager for Lustreadmin

[root@imlnode iml]# tar -xvf ee-2.4.1.2.tar.gz
[root@imlnode iml]# cd ee-2.4.1.2
[root@imlnode ee-2.4.1.2]# ./install
.
.
.
Installing Intel(R) Manager for Lustre*
|
Starting setup...

Setting up PostgreSQL service...
Creating database owner 'chroma'...

Creating database 'chroma'...

Creating database tables...
Loaded 13 default power device types.
Creating groups...
An administrative user account will now be created using the credentials which you provide.
Username: admin
Email: smasoka@csir.co.za
Password: 
Confirm password: 
User 'admin' successfully created.
Building static directory...
NTP Server [localhost]: 172.18.0.1
Writing ntp configuration: 172.18.0.1 
Closing firewall for ntp
Opening firewall for ntp
Restarting ntp
Starting RabbitMQ...
Restarting rabbitmq-server: RabbitMQ is not running
SUCCESS
rabbitmq-server.
Creating RabbitMQ user...
Creating RabbitMQ vhost...
Enabling daemons
Starting daemons
Checking service configuration...

Setup complete.
Registering profiles
chroma-manager 2.4.1.2-4273 is currently installed

Intel(R) Manager for Lustre* software installation completed successfully

[root@imlnode ee-2.4.1.2]#

8. Open the Web browser and login using the admin account

https://10.128.24.157

9. Errors Adding servers [oss and mds] using the web portal.

libcom_err errors (but these are caused by other things)
[root@mds01 ~]# yum remove libcom_err-devel
During the installation, wrong e2fsprogs packages wouls be installed from base repo is "e2fsprogs" repo is not enabled 
[root@mds01 ~]# yum install -y --enablerepo=iml-agent,e2fsprogg chroma-agent-management
Make sure the servers [oss and mds] are diskless
[bright1->device[oss02]]%set disksetup /cm/shared/apps/cmgui/disk-setup/slave-diskless.xml

10. Servers Successfully added

11. Configure Primary and Secondary servers

12. Create New Lustre File System

Results:

13. Lustre Clients [All clients are CentOS 7.3, so I'm installing Luste Client 2.8]

  • Get the source RPM
[root@login01 shared]# rpm -hiv lustre-client-2.8.0-3.10.0_327.3.1.el7.x86_64.src.rpm
Updating / installing...
  1:lustre-client-2.8.0-3.10.0_327.3.################################# [100%]
[root@login01 shared]# 
  • Compile the lustre-client
[root@login01 ~]# cd /root/rpmbuild/SOURCES/
[root@login01 SOURCES]# ls
lustre-2.8.0.tar.gz
[root@login01 SOURCES]# tar xvf lustre-2.8.0.tar.gz -C /root/
[root@login01 SOURCES]# cd /root/lustre-2.8.0/
[root@login01 lustre-2.8.0]# yum install python-docutils
[root@login01 lustre-2.8.0]# ./configure --with-o2ib=/usr/src/ofa_kernel/default
[root@login01 lustre-2.8.0]# make rpms
[root@login01 lustre-2.8.0]# ls -l lustre-client*
-rw-r--r-- 1 root root 477684 Sep 18 15:38 lustre-client-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm
-rw-r--r-- 1 root root 18090092 Sep 18 15:38 lustre-client-debuginfo-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm
-rw-r--r-- 1 root root 2188464 Sep 18 15:38 lustre-client-modules-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm
-rw-r--r-- 1 root root 12515736 Sep 18 15:38 lustre-client-source-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm
-rw-r--r-- 1 root root 8791148 Sep 18 15:38 lustre-client-tests-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm
[root@login01 lustre-2.8.0]#
  • Move the shared directory and install on all nodes
[root@login01 lustre-client-rpms]# cp /root/lustre-2.8.0/lustre-client*  /cm/shared/lustre-client-rpms
[root@login01 ~]# rpm -hiv /cm/shared/lustre-client-rpms/lustre-client-modules-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm /cm/shared/lustre-client-rpms/lustre-client-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm /cm/shared/lustre-client-rpms/lustre-client-source-2.8.0-3.10.0_514.2.2.el7.x86_64.x86_64.rpm
login01: Preparing... ########################################
login01: Updating / installing...
login01: lustre-client-modules-2.8.0-3.10.0_514########################################
login01: lustre-client-2.8.0-3.10.0_514.2.2.el7########################################
login01: lustre-client-source-2.8.0-3.10.0_514.########################################
  • Add lustre kernel modules
[root@login01 ~]# cat /etc/modprobe.d/lustre.conf 
options lnet networks=o2ib0(ib0)
[root@login01 ~]# modprobe -v lustre
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/net/lustre/libcfs.ko 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/net/lustre/lnet.ko networks=o2ib0(ib0) 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/obdclass.ko 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/ptlrpc.ko 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/lov.ko 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/fid.ko 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/mdc.ko 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/fld.ko 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/lmv.ko 
insmod /lib/modules/3.10.0-514.2.2.el7.x86_64/extra/kernel/fs/lustre/lustre.ko 
[root@login01 ~]# mkdir /mnt/lustre
[root@login01 ~]# mount -t lustre 172.20.0.221@o2ib0:172.20.0.222@o2ib0:/lustre /mnt/lustre

Small Fix

Solution (hours of googling):

[root@mds02 ~]# lctl conf_param lustre.llite.lazystatfs=1

Check mounting manually

[root@login01 ~]# mount -t lustre 172.20.0.221@o2ib0:172.20.0.222@o2ib0:/lustre /mnt/lustre

lazystatfs - Allows for statfs() to be not blocked if some of the servers are down. In my case, lustre-OST0000 OSC is permanently unavailable (DOWN). “statfs” gets/returns statistics about the mounted file system.

[root@sched ~]# echo "mount -t lustre -o flock 172.20.0.222@o2ib0:172.20.0.221@o2ib0:/lustre /mnt/lustre" /etc/rc.local 

This enables lustre filesystem to be mounted when the boots up.

Ref: https://jira.hpdd.intel.com/browse/LUDOC-153 https://jira.hpdd.intel.com/browse/LU-4397

/var/www/wiki/data/pages/acelab/intel_enterprise_for_lustre.txt · Last modified: 2018/08/01 15:32 by smasoka