Creating Pacemaker Resources for Lustre Storage Services

Introduction
In order for Pacemaker to manage Lustre services, they must be represented within the HA framework as resources that can be started, stopped, monitored, and, if need be, moved between nodes within the cluster. Resources are managed by software applications, called resource agents (RA), that have well-defined interfaces for integration with the Pacemaker cluster resource manager.

Remember that a Lustre service is started by mounting an object storage device (OSD) volume formatted as either an MGT, MDT or OST, and stopped by unmounting the OSD. Therefore, in order to start and stop a Lustre service, Pacemaker needs a resource agent that can mount and unmount storage volumes.

There are currently two resource agents capable of managing the Lustre OSDs:


 * : distributed by ClusterLabs in the  package, the Filesystem RA is a very mature and stable application, and has been part of the Pacemaker project and its predecessor, Linux-HA Heartbeat, for several years.   provides generic support for mounting and unmounting storage devices, which indirectly includes Lustre. The main drawback is that it is not tailored to Lustre itself, and may not be able to anticipate error modes specific to Lustre storage. Some of its features could have a potentially adverse effect on Lustre OSDs if not carefully managed.
 * : developed specifically for Lustre OSDs, this RA is distributed by the Lustre project and is available in Lustre releases from version 2.10.0 onwards. As a result of its narrower scope, it is less complex than, and better suited, as a consequence, to managing Lustre storage resources.

OSDs that have been created using ZFS also require a resource agent that is capable of importing and exporting ZFS pools. The ZFS on Linux project does not provide a Pacemaker resource agent, but there is a ZFS resource agent being integrated into the resource-agents project managed by ClusterLabs on GitHub. The resource agent has been copied from the stmf-ha project (also on GitHub) and merged into. A separate resource agent for ZFS pools has been developed by LLNL. It is also available on GitHub.

In the examples that follow, the Lustre project's  and the ClusterLabs   resource agents will be used.

Defining Resources in Pacemaker
The basic syntax for defining a resource in Pacemaker, using  is:

 pcs resource create \ [resource options] [...]

There are many options available, depending on the complexity of the resource being created and the cluster environment, details of which can be found in the  man page, and in the documentation provided by the OS distribution vendor.

Creating resources for Lustre LDISKFS and Lustre ZFS OSDs will be covered later in the article, but before jumping to those sections, take the time to review Defining Constraints for Resources, as this will help to provide a more complete picture of how resources are managed within the Pacemaker framework.

Defining Constraints for Resources
Constraints in Pacemaker are used to determine where a given resource is permitted to run.

There are three forms of constraint:


 * 1) Location constraints define the nodes that a resource can run on, and nodes that it cannot run on. Location constraints can also define weighting to indicate the preferred node for a resource to run on.
 * 2) Colocation constraints define dependencies between resources to indicate that the resources should be run from the same node, or conversely, should never be run on the same node.
 * 3) Ordering constraints define the startup and shutdown sequence for a set of resources in a cluster.

In an ordinary HA Lustre cluster, it is usually the case that any Lustre OSD can be mounted on any node. This is referred to as a symmetric cluster: each node is equally capable of hosting any of the resources. In a symmetric cluster, if no constraints are defined, Pacemaker will decide where each resource will run, and will start and stop each resource in the cluster more or less simultaneously without consideration of any possible dependencies that might exist between the resources.

For small environments, letting Pacemaker manage the distribution of resources across nodes might be acceptable, but the behaviour is unpredictable. Specifying a location constraint allows the administrator to specify the preferred locations for resources to run, and explicitly balance the distribution of resources across the nodes.

When working with ZFS-based Lustre storage volumes, there are two resources defined for each Lustre OSD: one for ZFS, and one for Lustre itself. These resources must be colocated on the same node, therefore a colocation constraint is required. Furthermore, the ZFS resource must be started before attempting to start the Lustre resource, so an ordering constraint is also required.

Location Constraints
Location constraints are defined using the following syntax:

 pcs constraint location prefers|avoids = is the name of the Pacemaker resource
 * whether the resource is defined in terms of a preference to run on the node or to avoid the node
 * is the name of the cluster node upon which the constraint will take effect
 * is the weighting to apply. A higher score creates a stronger preference to run on the node or to avoid the node. The special value  changes the preference from "should" to "must".

A resource should have a location constraint defined for each node in the cluster. Location constraint scores usually reflect the order in which the  NIDs are listed on the   command line when formatting an OSD.

For a Lustre cluster, location constraints are normally defined using the  syntax. For example:

 pcs constraint location demo-OST0000 prefers rh7-oss01.lfs.intl=100 pcs constraint location demo-OST0000 prefers rh7-oss02.lfs.intl=50

This states that Lustre resource  has a higher preference to run on node , and should normally be started on that node. If the resource cannot run on the node with the highest score, the node with the next highest score ( in the example) will be chosen to run the resource instead.

Colocation Constraints
Colocation constraints are defined as relative dependencies: the location of one resource is dependent upon the location of another resource. For example, resource B needs to run on the same node as resource A. This has a side-effect that Pacemaker must determine where resource A should be located before the location of resource B can be determined.

Be careful when defining location constraints for resources that have a colocation dependency. The location constraint scores of each resource are used to determine initial placement of the first resource.

For example, suppose that the location of resource B depends on the location of resource A, with the following conditions:


 * Resource A prefers node 1 with a score of 20 and node 2 with a score of 10.
 * Resource B prefers node 1 with a score of 0 and node 2 with a score of 100.

When determining the initial placement of resource A, the scores for each resource on each node are added together.


 * The location score for node 1 is 20+0=20 and node 2 is 10+100=110.

Since the highest score determines placement of the resources, they will both be placed on node 2 by Pacemaker.

The syntax for creating a colocation constraint using  is:

 pcs constraint colocation add \ should be located on the same node as.
 * Setting the score to a value of  enforces a mandatory placement constraint: if   cannot be run anywhere in the cluster, then   cannot run either.
 * Setting the value to  states that the resources cannot be colocated on the same cluster node.
 * Setting the score to a finite value greater than  and less than   creates an advisory placement rule. Pacemaker will normally try to honour the rule but may override the placement if would compromise the running state of cluster resources.

Ordering Constraints
An ordering constraint is used to specify the startup sequence of cluster resources. Ordering can be applied to any resources defined in the cluster: resources can be subject to ordering constraints even if they are running on different nodes within the cluster.

The syntax for creating an ordering constraint using  is:

 pcs constraint order \ start then \ start \ [options]

By default, ordering is mandatory, meaning that if the first resource cannot be started, the second resource cannot be started either. Similarly, if the first resource is stopped, the second resource must also be stopped.

The ordering is also symmetrical by default: the resources will be stopped in reverse order.

Using Resource Groups to Simplify Constraints
If the resources in a group are always colocated and must be started sequentially and stopped in the reverse sequential order, then the definition of the constraint relationship can be simplified by defining a resource group.

The  syntax for creating a group is:

 pcs resource group add \ \ \ [...]

Resources can be inserted into existing groups and optionally placed at specific points in the sequence using  or. Resources can also be removed using the following command:

 pcs resource group remove

Adding Lustre LDISKFS OSD Resources to a Pacemaker Cluster
The syntax for creating a Lustre LDISKFS resource using  is:

 pcs resource create ocf:lustre:Lustre \ target= \ mountpoint=


 * is an arbitrary label used identify the resource within the cluster framework. It is recommended to include the OSD label in the name. For example  or.
 * :  is the path to the block device for the storage, usually a multipath device.
 * :  is the mountpoint for the OSD. The directory must exist prior to creating and starting the resource. Make sure that the directory exists on each server in the Pacemaker cluster.

Add the location constraints for the resource:

 pcs constraint location prefers = is:

 pcs resource create  ocf:heartbeat:ZFS \ params pool=" "

To create the Lustre resource:

 pcs resource create  ocf:lustre:Lustre \ target= / parameter is set to the ZFS pool and dataset, not a device path.

Set the location constraints for both the ZFS pool and Lustre OSD resources. Each resource should be assigned the same score:

 pcs constraint location  prefers = command:

 pcs resource group add \ <ZFS pool resource name> \ <Lustre OSD resource name>

The ordering is significant: the ZFS pool resource must start before the Lustre OSD can be started.

By way of comparison, the following commands create the constraints equivalent to the resource group definition.

<ul> <li>First, set the colocation constraint, such that the Lustre OSD starts on the same node as the ZFS pool:

<pre style="overflow-x:auto;"> pcs constraint colocation add \ <Lustre OSD resource name> with <ZFS pool resource name> \ score=INFINITY </li> <li>Next, set the ordering constraint to ensure that the ZFS pool starts before the Lustre OSD resource:

<pre style="overflow-x:auto;"> pcs constraint order \ start <ZFS pool resource name> then \ start <Lustre OSD resource name> </li> </ul>

The  syntax is simpler and will meet most requirements. It is therefore recommended unless there is a specific need to define the colocation and ordering constraints separately.

Every ZFS pool and Lustre OSD in the HA cluster must be added as resources to Pacemaker.

Lustre ZFS OSD Pacemaker Resource Example
Create the ZFS pool resource:

<pre style="overflow-x:auto;"> pcs resource create zfs-demo-ost0pool ocf:heartbeat:ZFS \ params pool="demo-ost0pool"

Create the Lustre resource:

<pre style="overflow-x:auto;"> pcs resource create demo-OST0000 ocf:lustre:Lustre \ target="demo-ost0pool/ost0" \ mountpoint="/lustre/demo/OST0000"

Add the location constraints for the resource, one constraint for each node in the cluster:

<pre style="overflow-x:auto;"> pcs constraint location zfs-demo-ost0pool prefers rh7-oss01.lfs.intl=100 pcs constraint location demo-OST0000 prefers rh7-oss01.lfs.intl=100

pcs constraint location zfs-demo-ost0pool prefers rh7-oss02.lfs.intl=50 pcs constraint location demo-OST0000 prefers rh7-oss02.lfs.intl=50

Create a resource group to control colocation and startup ordering of the resources:

<pre style="overflow-x:auto;"> pcs resource group add group-demo-OST0000 \ zfs-demo-ost0pool \ demo-OST0000

Lustre Server Health Monitoring in Pacemaker
Pacemaker can be configured to monitor aspects of the cluster servers to help determine overall system health. This provides additional data points that are used to make decisions about where to run resources. Lustre version 2.10 introduced two monitoring resource agents:


 * – used to monitor LNet connectivity.
 * – used to monitor Lustre's health status.

Setting up Pacemaker to use these monitoring resource agents is a two-step process:


 * 1) The monitoring resource is created in the cluster framework. The resource creates and updates an attribute that contains a score for the resource's health.
 * 2) A location constraint is created that reads the monitoring attribute, and move's the cluster resources if the value falls below a defined threshold.

Creating an LNET Monitoring Resource
is a modification of the standard  resource agent and uses   in place of the system   command.

The syntax used to create the  resource is:

<pre style="overflow-x:auto;"> pcs resource create healthLNET ocf:lustre:healthLNET \ lctl=true \ multiplier=<m> \ device= \ host_list="<NID> [<NID> ...]" \ --clone


 * tells the resource agent to monitor an LNet NID using . If this is not set, the regular system ping command is used.
 * is a positive integer value that is multiplied by the number of machines that respond to the ping. The multiplier needs to be larger than the resource stickiness value.
 * is the network device to monitor e.g.,,.
 * is a space-separated list of LNet NIDs to try and ping. If,   should contain regular host names or IP addresses.
 * tells Pacemaker to start an instance of the resource on every node in the cluster.

Only one healthLNET resource agent needs to be defined per Pacemaker cluster. Because it is defined as a cloned resource, instances will be started by Pacemaker on each individual node.

The NIDs in  should be for machines that are known to be reliable and are normally expected to be online (usually servers). It is also highly recommended that at least two NIDs are listed, each one representing a different machine. If only one NID is listed and the ping command fails, the agent cannot determine if the fault is local to the originating machine (where the ping script is running) or if the remote machine is at fault. Adding more NIDs to   increases the diagnostic accuracy of the agent: if the ping fails to connect to any of the NIDs, there is a high probability that the problem is local to the originating machine, rather than one of the remote machines.

Where possible, choose NIDs for servers that are not members of the Pacemaker cluster. For example, OSS HA pairs could be set up to ping the MGS and MDS nodes, or nodes from other OSS HA pairs. Similarly, the MGS / MGS nodes could ping OSSs.

The  resource updates an attribute called. When the ping resource fails to get a ping reply from the NIDs in, the attribute has a value 0 (zero). Pacemaker can be configured with a location constraint that forces resources to move to another cluster node when this happens.

The following rule will force a resource to move when the  attribute is zero, or undefined:

<pre style="overflow-x:auto;"> pcs constraint location \ rule score=-INFINITY pingd lte 0 or not_defined pingd

This rule should be applied to every resource in the cluster.

Example LNET Health Check Resource
<pre style="overflow-x:auto;"> pcs resource create healthLNET ocf:lustre:healthLNET \ lctl=true \ multiplier=1001 \ device=ib0 \ host_list="10.10.227.11@o2ib0 10.10.227.12@o2ib0" \ --clone

pcs constraint location zfs-demo-ost0pool \ rule score=-INFINITY pingd lte 0 or not_defined pingd
 * 1) Add monitoring constraint to ZFS Pool

pcs constraint location demo-OST0000 \ rule score=-INFINITY pingd lte 0 or not_defined pingd
 * 1) Add monitoring constraint to Lustre OSD

Creating a Lustre Health Monitoring Resource
follows the same implementation model as, except that instead of monitoring LNet NIDs,   monitors the content of Lustre's   file and maintains an attribute called.

If the  file contains status , then the resource agent sets the value of   to 1 (one). Otherwise, the attribute is set to 0 (zero).

The syntax used to create the  resource is:

<pre style="overflow-x:auto;"> pcs resource create healthLUSTRE ocf:lustre:healthLUSTRE --clone


 * tells Pacemaker to start an instance of the resource on every node in the cluster.

Only one healthLUSTRE resource agent needs to be defined per Pacemaker cluster. Because it is defined as a cloned resource, instances will be started by Pacemaker on each individual node.

Pacemaker can be configured with a location constraint that forces resources to move to another cluster node when the Lustre health check fails.

The following rule will force a resource to move when the  attribute is zero, or undefined:

<pre style="overflow-x:auto;"> pcs constraint location \ rule score=-INFINITY lustred lte 0 or not_defined lustred

This rule should be applied to every resource in the cluster.

Example Lustre Health Check Resource
<pre style="overflow-x:auto;"> pcs resource create healthLUSTRE ocf:lustre:healthLUSTRE --clone

pcs constraint location zfs-demo-ost0pool \ rule score=-INFINITY lustred lte 0 or not_defined lustred
 * 1) Add monitoring constraint to ZFS Pool

pcs constraint location demo-OST0000 \ rule score=-INFINITY lustred lte 0 or not_defined lustred
 * 1) Add monitoring constraint to Lustre OSD