Creating Pacemaker Resources for Lustre Storage Services

From Lustre Wiki
Jump to: navigation, search

Introduction

In order for Pacemaker to manage Lustre services, they must be represented within the HA framework as resources that can be started, stopped, monitored, and, if need be, moved between nodes within the cluster. Resources are managed by software applications, called resource agents (RA), that have well-defined interfaces for integration with the Pacemaker cluster resource manager.

Remember that a Lustre service is started by mounting an object storage device (OSD) volume formatted as either an MGT, MDT or OST, and stopped by unmounting the OSD. Therefore, in order to start and stop a Lustre service, Pacemaker needs a resource agent that can mount and unmount storage volumes.

There are currently two resource agents capable of managing the Lustre OSDs:

  1. ocf:heartbeat:Filesystem: distributed by ClusterLabs in the resource-agents package, the Filesystem RA is a very mature and stable application, and has been part of the Pacemaker project and its predecessor, Linux-HA Heartbeat, for several years. Filesystem provides generic support for mounting and unmounting storage devices, which indirectly includes Lustre. The main drawback is that it is not tailored to Lustre itself, and may not be able to anticipate error modes specific to Lustre storage. Some of its features could have a potentially adverse effect on Lustre OSDs if not carefully managed.
  2. ocf:lustre:Lustre: developed specifically for Lustre OSDs, this RA is distributed by the Lustre project and is available in Lustre releases from version 2.10.0 onwards. As a result of its narrower scope, it is less complex than ocf:heartbeat:Filesystem, and better suited, as a consequence, to managing Lustre storage resources.

OSDs that have been created using ZFS also require a resource agent that is capable of importing and exporting ZFS pools. The ZFS on Linux project does not provide a Pacemaker resource agent, but there is a ZFS resource agent being integrated into the resource-agents project managed by ClusterLabs on GitHub. The resource agent has been copied from the stmf-ha project (also on GitHub) and merged into resource-agents. A separate resource agent for ZFS pools has been developed by LLNL. It is also available on GitHub.

In the examples that follow, the Lustre project's ocf:lustre:Lustre and the ClusterLabs ocf:heartbeat:ZFS resource agents will be used.

Defining Resources in Pacemaker

The basic syntax for defining a resource in Pacemaker, using pcs is:

pcs resource create <resource name> <resource agent> \
[resource options] [...]

There are many options available, depending on the complexity of the resource being created and the cluster environment, details of which can be found in the pcs(8) man page, and in the documentation provided by the OS distribution vendor.

Creating resources for Lustre LDISKFS and Lustre ZFS OSDs will be covered later in the article, but before jumping to those sections, take the time to review Defining Constraints for Resources, as this will help to provide a more complete picture of how resources are managed within the Pacemaker framework.

Defining Constraints for Resources

Constraints in Pacemaker are used to determine where a given resource is permitted to run.

There are three forms of constraint:

  1. Location constraints define the nodes that a resource can run on, and nodes that it cannot run on. Location constraints can also define weighting to indicate the preferred node for a resource to run on.
  2. Colocation constraints define dependencies between resources to indicate that the resources should be run from the same node, or conversely, should never be run on the same node.
  3. Ordering constraints define the startup and shutdown sequence for a set of resources in a cluster.

In an ordinary HA Lustre cluster, it is usually the case that any Lustre OSD can be mounted on any node. This is referred to as a symmetric cluster: each node is equally capable of hosting any of the resources. In a symmetric cluster, if no constraints are defined, Pacemaker will decide where each resource will run, and will start and stop each resource in the cluster more or less simultaneously without consideration of any possible dependencies that might exist between the resources.

For small environments, letting Pacemaker manage the distribution of resources across nodes might be acceptable, but the behaviour is unpredictable. Specifying a location constraint allows the administrator to specify the preferred locations for resources to run, and explicitly balance the distribution of resources across the nodes.

When working with ZFS-based Lustre storage volumes, there are two resources defined for each Lustre OSD: one for ZFS, and one for Lustre itself. These resources must be colocated on the same node, therefore a colocation constraint is required. Furthermore, the ZFS resource must be started before attempting to start the Lustre resource, so an ordering constraint is also required.

Location Constraints

Location constraints are defined using the following syntax:

pcs constraint location <resource name> prefers|avoids <node>=<score>
  • <resource name> is the name of the Pacemaker resource
  • prefers|avoids: whether the resource is defined in terms of a preference to run on the node or to avoid the node
  • <node> is the name of the cluster node upon which the constraint will take effect
  • <score> is the weighting to apply. A higher score creates a stronger preference to run on the node or to avoid the node. The special value INFINITY changes the preference from "should" to "must".

A resource should have a location constraint defined for each node in the cluster. Location constraint scores usually reflect the order in which the --servicenode NIDs are listed on the mkfs.lustre command line when formatting an OSD.

For a Lustre cluster, location constraints are normally defined using the prefers syntax. For example:

pcs constraint location demo-OST0000 prefers rh7-oss01.lfs.intl=100
pcs constraint location demo-OST0000 prefers rh7-oss02.lfs.intl=50

This states that Lustre resource demo-OST0000 has a higher preference to run on node rh7-oss01.lfs.intl, and should normally be started on that node. If the resource cannot run on the node with the highest score, the node with the next highest score (rh7-oss02.lfs.intl in the example) will be chosen to run the resource instead.

Colocation Constraints

Colocation constraints are defined as relative dependencies: the location of one resource is dependent upon the location of another resource. For example, resource B needs to run on the same node as resource A. This has a side-effect that Pacemaker must determine where resource A should be located before the location of resource B can be determined.

Be careful when defining location constraints for resources that have a colocation dependency. The location constraint scores of each resource are used to determine initial placement of the first resource.

For example, suppose that the location of resource B depends on the location of resource A, with the following conditions:

  • Resource A prefers node 1 with a score of 20 and node 2 with a score of 10.
  • Resource B prefers node 1 with a score of 0 and node 2 with a score of 100.

When determining the initial placement of resource A, the scores for each resource on each node are added together.

  • The location score for node 1 is 20+0=20 and node 2 is 10+100=110.

Since the highest score determines placement of the resources, they will both be placed on node 2 by Pacemaker.

The syntax for creating a colocation constraint using pcs is:

pcs constraint colocation add \
<source resource name> with <target resource name> \
[score=<score>]
  • The command states that <source resource name> should be located on the same node as <target resource name>.
  • Setting the score to a value of INFINITY enforces a mandatory placement constraint: if <target resource name> cannot be run anywhere in the cluster, then source resource name> cannot run either.
  • Setting the value to -INFINITY states that the resources cannot be colocated on the same cluster node.
  • Setting the score to a finite value greater than -INFINITY and less than INFINITY creates an advisory placement rule. Pacemaker will normally try to honour the rule but may override the placement if would compromise the running state of cluster resources.

Ordering Constraints

An ordering constraint is used to specify the startup sequence of cluster resources. Ordering can be applied to any resources defined in the cluster: resources can be subject to ordering constraints even if they are running on different nodes within the cluster.

The syntax for creating an ordering constraint using pcs is:

pcs constraint order \
start <resource1 name> then \
start <resource2 name> \
[options]

By default, ordering is mandatory, meaning that if the first resource cannot be started, the second resource cannot be started either. Similarly, if the first resource is stopped, the second resource must also be stopped.

The ordering is also symmetrical by default: the resources will be stopped in reverse order.

Using Resource Groups to Simplify Constraints

If the resources in a group are always colocated and must be started sequentially and stopped in the reverse sequential order, then the definition of the constraint relationship can be simplified by defining a resource group.

The pcs syntax for creating a group is:

pcs resource group add <group name> \
<resource1 name> \
<resource2 name> \
[...]

Resources can be inserted into existing groups and optionally placed at specific points in the sequence using --before <resource name> or --after <resource name>. Resources can also be removed using the following command:

pcs resource group remove <group name> <resource name>

Adding Lustre LDISKFS OSD Resources to a Pacemaker Cluster

The syntax for creating a Lustre LDISKFS resource using pcs is:

pcs resource create <resource name> ocf:lustre:Lustre \
target=<device> \
mountpoint=<directory path>
  • <resource name> is an arbitrary label used identify the resource within the cluster framework. It is recommended to include the OSD label in the name. For example demo-OST0000 or lustre-demo-OST0000.
  • target=<device>: <device> is the path to the block device for the storage, usually a multipath device.
  • mountpoint=<directory path>: <directory path> is the mountpoint for the OSD. The directory must exist prior to creating and starting the resource. Make sure that the directory exists on each server in the Pacemaker cluster.

Add the location constraints for the resource:

pcs constraint location <resource name> prefers <node 1>=<score 1>
pcs constraint location <resource name> prefers <node 2>=<score 2>

One resource is created for each OSD in the HA cluster. Each resource has a location constraint defined for each cluster node. For example, if there are two cluster nodes, each resource in that cluster will have two location constraints.

Lustre LDISKFS Pacemaker Resource Example

Create the resource:

pcs resource create demo-OST0000 ocf:lustre:Lustre \
target=/dev/mapper/35000c5005f2e7edf \
mountpoint=/lustre/demo/OST0000

Add the location constraints for the resource, one constraint for each node in the cluster:

pcs constraint location demo-OST0000 prefers rh7-oss01.lfs.intl=100
pcs constraint location demo-OST0000 prefers rh7-oss02.lfs.intl=50

Adding Lustre ZFS OSD Resources to a Pacemaker Cluster

Before creating a ZFS resource in Pacemaker, make sure that each Lustre server has a resource agent capable of managing a ZFS pool.

  1. Use the following command to see if the ZFS resource agent is installed:
    pcs resource list ocf:heartbeat:ZFS
    
  2. If no match is found, then download the ZFS agent from the ClusterLabs resource-agents project. For example:
    wget -P /usr/lib/ocf/resource.d/heartbeat \
    https://raw.githubusercontent.com/ClusterLabs/resource-agents/master/heartbeat/ZFS
    chmod 755 /usr/lib/ocf/resource.d/heartbeat/ZFS
    chown root:root /usr/lib/ocf/resource.d/heartbeat/ZFS
    

The syntax for creating a ZFS resource using pcs is:

pcs resource create <ZFS pool resource name> ocf:heartbeat:ZFS \
pool="<pool name>"

To create the Lustre resource:

pcs resource create <Lustre OSD resource name> ocf:lustre:Lustre \
target=<zfs pool>/<dataset name> \
mountpoint=<directory path>

Note that for ZFS OSDs, the target parameter is set to the ZFS pool and dataset, not a device path.

Set the location constraints for both the ZFS pool and Lustre OSD resources. Each resource should be assigned the same score:

pcs constraint location <ZFS pool resource name> prefers <node 1>=<score 1>
pcs constraint location <Lustre OSD resource name> prefers <node 1>=<score 1>

pcs constraint location <ZFS pool resource name> prefers <node 2>=<score 2>
pcs constraint location <Lustre OSD resource name> prefers <node 2>=<score 2>

To create a resource group for the ZFS pool and its associated Lustre OSD, use the following pcs command:

pcs resource group add <group name> \
<ZFS pool resource name> \
<Lustre OSD resource name>

The ordering is significant: the ZFS pool resource must start before the Lustre OSD can be started.

By way of comparison, the following commands create the constraints equivalent to the resource group definition.

  • First, set the colocation constraint, such that the Lustre OSD starts on the same node as the ZFS pool:
    pcs constraint colocation add \
    <Lustre OSD resource name> with <ZFS pool resource name> \
    score=INFINITY
    
  • Next, set the ordering constraint to ensure that the ZFS pool starts before the Lustre OSD resource:
    pcs constraint order \
    start <ZFS pool resource name> then \
    start <Lustre OSD resource name>
    

The pcs group add syntax is simpler and will meet most requirements. It is therefore recommended unless there is a specific need to define the colocation and ordering constraints separately.

Every ZFS pool and Lustre OSD in the HA cluster must be added as resources to Pacemaker.

Lustre ZFS OSD Pacemaker Resource Example

Create the ZFS pool resource:

pcs resource create zfs-demo-ost0pool ocf:heartbeat:ZFS \
pool="demo-ost0pool"

Create the Lustre resource:

pcs resource create demo-OST0000 ocf:lustre:Lustre \
target="demo-ost0pool/ost0" \
mountpoint="/lustre/demo/OST0000"

Add the location constraints for the resource, one constraint for each node in the cluster:

pcs constraint location zfs-demo-ost0pool prefers rh7-oss01.lfs.intl=100
pcs constraint location demo-OST0000 prefers rh7-oss01.lfs.intl=100

pcs constraint location zfs-demo-ost0pool prefers rh7-oss02.lfs.intl=50
pcs constraint location demo-OST0000 prefers rh7-oss02.lfs.intl=50

Create a resource group to control colocation and startup ordering of the resources:

pcs resource group add group-demo-OST0000 \
zfs-demo-ost0pool \
demo-OST0000

Lustre Server Health Monitoring in Pacemaker

Pacemaker can be configured to monitor aspects of the cluster servers to help determine overall system health. This provides additional data points that are used to make decisions about where to run resources. Lustre version 2.10 introduced two monitoring resource agents:

  • ocf:lustre:healthLNET – used to monitor LNet connectivity.
  • ocf:lustre:healthLUSTRE – used to monitor Lustre's health status.

Setting up Pacemaker to use these monitoring resource agents is a two-step process:

  1. The monitoring resource is created in the cluster framework. The resource creates and updates an attribute that contains a score for the resource's health.
  2. A location constraint is created that reads the monitoring attribute, and move's the cluster resources if the value falls below a defined threshold.

Creating an LNET Monitoring Resource

ocf:lustre:healthLNET is a modification of the standard ocf:pacemaker:ping resource agent and uses lctl ping in place of the system ping command.

The syntax used to create the healthLNET resource is:

pcs resource create healthLNET ocf:lustre:healthLNET \
  lctl=true \
  multiplier=<m> \
  device=<network device> \
  host_list="<NID> [<NID> ...]" \
  --clone
  • lctl=true tells the resource agent to monitor an LNet NID using lctl ping. If this is not set, the regular system ping command is used.
  • multiplier is a positive integer value that is multiplied by the number of machines that respond to the ping. The multiplier needs to be larger than the resource stickiness value.
  • device is the network device to monitor e.g., eth1, ib0.
  • host_list is a space-separated list of LNet NIDs to try and ping. If lctl=false, host_list should contain regular host names or IP addresses.
  • --clone tells Pacemaker to start an instance of the resource on every node in the cluster.

Only one healthLNET resource agent needs to be defined per Pacemaker cluster. Because it is defined as a cloned resource, instances will be started by Pacemaker on each individual node.

The NIDs in host_list should be for machines that are known to be reliable and are normally expected to be online (usually servers). It is also highly recommended that at least two NIDs are listed, each one representing a different machine. If only one NID is listed and the ping command fails, the agent cannot determine if the fault is local to the originating machine (where the ping script is running) or if the remote machine is at fault. Adding more NIDs to host_list increases the diagnostic accuracy of the agent: if the ping fails to connect to any of the NIDs, there is a high probability that the problem is local to the originating machine, rather than one of the remote machines.

Where possible, choose NIDs for servers that are not members of the Pacemaker cluster. For example, OSS HA pairs could be set up to ping the MGS and MDS nodes, or nodes from other OSS HA pairs. Similarly, the MGS / MGS nodes could ping OSSs.

The healthLNET resource updates an attribute called pingd. When the ping resource fails to get a ping reply from the NIDs in host_list, the attribute has a value 0 (zero). Pacemaker can be configured with a location constraint that forces resources to move to another cluster node when this happens.

The following rule will force a resource to move when the pingd attribute is zero, or undefined:

pcs constraint location <resource name> \
rule score=-INFINITY pingd lte 0 or not_defined pingd

This rule should be applied to every resource in the cluster.

Example LNET Health Check Resource

pcs resource create healthLNET ocf:lustre:healthLNET \
  lctl=true \
  multiplier=1001 \
  device=ib0 \
  host_list="10.10.227.11@o2ib0 10.10.227.12@o2ib0" \
  --clone

# Add monitoring constraint to ZFS Pool
pcs constraint location zfs-demo-ost0pool \
rule score=-INFINITY pingd lte 0 or not_defined pingd

# Add monitoring constraint to Lustre OSD
pcs constraint location demo-OST0000 \
rule score=-INFINITY pingd lte 0 or not_defined pingd

Creating a Lustre Health Monitoring Resource

ocf:lustre:healthLUSTRE follows the same implementation model as ocf:lustre:healthLNET, except that instead of monitoring LNet NIDs, ocf:lustre:healthLUSTRE monitors the content of Lustre's health_check file and maintains an attribute called lustred.

If the health_check file contains status healthy, then the resource agent sets the value of lustred to 1 (one). Otherwise, the attribute is set to 0 (zero).

The syntax used to create the healthLUSTRE resource is:

pcs resource create healthLUSTRE ocf:lustre:healthLUSTRE --clone
  • --clone tells Pacemaker to start an instance of the resource on every node in the cluster.

Only one healthLUSTRE resource agent needs to be defined per Pacemaker cluster. Because it is defined as a cloned resource, instances will be started by Pacemaker on each individual node.

Pacemaker can be configured with a location constraint that forces resources to move to another cluster node when the Lustre health check fails.

The following rule will force a resource to move when the lustred attribute is zero, or undefined:

pcs constraint location <resource name> \
rule score=-INFINITY lustred lte 0 or not_defined lustred

This rule should be applied to every resource in the cluster.

Example Lustre Health Check Resource

pcs resource create healthLUSTRE ocf:lustre:healthLUSTRE --clone

# Add monitoring constraint to ZFS Pool
pcs constraint location zfs-demo-ost0pool \
rule score=-INFINITY lustred lte 0 or not_defined lustred

# Add monitoring constraint to Lustre OSD
pcs constraint location demo-OST0000 \
rule score=-INFINITY lustred lte 0 or not_defined lustred