MDT Mirroring with ZFS and SRP: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 178: Line 178:
   
   
=== Snapshots ===
=== Snapshots ===
[[ZFS Snapshots for MDT Backup]]


This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases.  
This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases.  

Revision as of 12:12, 6 May 2015

Original information provided by Jesse Stroik, January 2014.

At SSEC we tested mirrored meta data targets for our Lustre file systems. The idea is to use ZFS to mirror storage targets in 2 different MDS - so the data is always available on both servers without using iscsi or other technologies. The basis of this idea comes from Charles Taylor's LUG 2012 presentation "High Availability Lustre Using SRP-Mirrored LUNs"

Instead of LVM, we will use ZFS to provide the mirror. SCST for infiniband RDMA providing the targets and ZFS mirrors performed well in our testing. We did not have a chance to test more thoroughly for production.

Below are notes from our investigation. The system worked very well in our brief testing, but was not tested enough to put in production.

Terminology

  • Target - The device to which data will be written. Usually it controls a group of LUNs (think OSS, not OST or individual disk).
  • Initiator - The system or device attempting to access the target. Client system in our case.

Protocols

SRP - SCSI RDMA Protocol.

Despite its name, this can be implemented w/o RMDA. As we would likely implement, it is a protocol used to communicate with SCSI devices directly over RMDA.

iSER - iSCSI Extensions for RDMA

A layer of abstraction on the iSCSI protocol implemented by a "Datamover Architecture" and with RDMA support. The basic idea is simple: RMDA allows devices to reach each other's memory directly. When an initiator beings an unsolicited write, the disk uses the protocol to read the data from the initiator directly while writing to itself. So the target effectively goes and reads the data off of the initiator.

SRP Implementations

  1. LIO - SRP implementation by Datera, a SV startup from 2011.

It appears that Datera got this thing into the Linux kernel, but deployment and usage documentation is nonexistent or very hard to actually find.

TargetCLI is a python CLI management interface for the targets.

  1. SCST - SCSI Target Framework (kernel - not in official tree).

This framework includes a few components:

  • Core/Engine software

Target "Drivers" - I put drivers in quotes because this part is implemented as a kernel module and they call it a driver, but it is software that controls the Target (think OSS) and doesn't really provide a hardware driver as far as I understand.

  • Storage Drivers - This is the part that implements the SCSI commands on the LUN (in our case, attached OSTs).

We will likely need to compile/link the target and storage drivers against a kernel version, and install only with that kernel version. We already link kernel versions with Lustre, so this may not be unreasonable.

iSER implementations

  • LIO - See information in SRP implementation. This also implements iSER.
  • STGT - SCSI Target Framework (userspace)
    • This doesn't perform as well as SCST according to the research. It's considered obsolete, but I included the definition because it may be mentioned in a lot of documentation.


Technologies Available w/ Summary

  1. TGTD/ISIR vis scsi-target utils
  2. LIO
  3. SCST
  4. Snapshots

LIO

This seems like it has disadvantages.

it appears to not be 'safe' for writes. You'd have to disable all caching, they want you to use their OS As far as we could tell, it's effectively not open. We probably cannot implement without their OS and their support. It's behind SCST.

SCST

We tested this. We had to grab their source, compile and link against our kernel, and install. This may be a temporary issue if we need to link against a newer kernel than Lustre is currently available against, but Lustre is getting accepted into the kernel as well, so we can plan for this to be our future technology if we cannot use it now.

Installing the SCST

NOTE BEFORE IMPLEMENTING: THIS APPEARS TO USE 128KB INODE SIZES WHEN EXPORTED VIA ZPOOLS

This requires the OFA OFED stack and links against it.

  1. Download and install the SCST package and scstadmin: http://sourceforge.net/projects/scst/files/
  1. Extract the SCST package, and verify your Makefile lines if necessary:
    1. export KDIR=/usr/src/kernels/2.6.32-431.el6.x86_64/ export KVER=2.6.32-431.el6.x86_64
  1. Build and install SRPT and SCST
    1. make scst && make scst_install make srpt && make srpt_install
    2. Then load the modules into the kernel, and set it up to start on boot.
    1. /usr/lib/lsb/install_initd scst chkconfig --add scst
    2. modprobe scst modprobe ib_srpt modprobe scst_vdisk

Setting up the Devices

NOTE: we ended up using HW RAID on the bottom, exported that via SRP, and ZFS mirrored at the top level. This very first step can be skipped

If you are using ZFS, you need to create a logical volume. For example, let's say we have the pool shps-meta which is comprised of some disks.

    1. zfs create -V 300G shps-meta/MDT zfs set canmount=off shps-meta
      1. Now you have the device /dev/zvol/shps-meta/MDT


On each system, once you have your LUN prepared (with the RAID controller or zpool) it's time to register that device:

scstadmin -open_dev MDT1 -handler vdisk_blockio -attributes filename=/dev/zvol/shps-meta/MDT

Then list the device and target:

scstadmin -list_device scstadmin -list_target ls -l /sys/kernel/scst_tgt/devices

You should get some info from each: the MDT1 dev you just created, and also ib_srpt_target0. If you don't get that, reload the ib_srpt module.

Define a security group (the hosts that can write):

scstadmin --add_group MDS -driver ib_srpt -target ib_srpt_target_0 scstadmin -list_group

Add initiators to the group:

(for testing, leave this open)


Assign the LUNs to the target scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group MDS -device MDT1

Now enable the target:

scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt

And enable the driver:

scstadmin -set_drv_attr ib_srpt -attributes enabled=1

Modprobe modifications to pass the driver. This example is access over one-target-per-HCA-port

cat /etc/modprobe.d/ib_srpt.conf    options ib_srpt one_target_per_port=1

Set up permissions for the LUN (necessary)

scstadmin -add_init '*' -driver ib_srpt -target ib_srpt_target_0 -group MDS

Initiator setup

On the target, ensure that this initiator has permission to access the disk:


First, load the module ib_srp:

modprobe ib_srp

Note: This module is part of OFED. OFED also includes the ib_srpt (target) module which is used to host the FS.


Now, search for the available targets:

srp_daemon -oacd/dev/infiniband/umad0

Note: There could be multiple /dev/infiniband/umad devices. (umad bro?)


Add the a

  1. scstadmin -add_target



find /sys -iname add_target -print echo "id_ext=0002c90300b77f40,ioc_guid=0002c90300b77f40, dgid=fe800000000000000002c90300b77f41,pkey=ffff, \ service_id=0002c90300b77f40" > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/infiniband_srp/srp-mlx4_0-1/add_target

Note: in the above example, the part echoed is the result of the previous srp_daemon command (there may be multiple devices to add this way), and the redirection is into the result of the find command.


Be sure to write the config:


scstadmin -write_config /etc/scst.conf

And then ensure the startup script is in chkconfig:

chkconfig --list scst ckconfig --list srpd chkconfig --list rdma /etc/rdma/rdma.conf must contain the line:

SRP_LOAD=yes

Metadata Backup

Snapshots

ZFS Snapshots for MDT Backup

This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases.

We always backup the metadata also. This is necessary even if we have a mirrored MDT.