Create a Virtual HPC Storage Cluster with Vagrant

From Lustre Wiki
Revision as of 04:06, 17 May 2017 by Malcolm (talk | contribs)
Jump to navigation Jump to search

Introduction

Vagrant is a powerful platform for programmatically creating and managing virtual machines. It is easy to install and is capable of supporting multiple virtual machine platforms, including VirtualBox, HyperV and VMWare. Additional VM providers can be added by way of plugins. Vagrant is also largely platform-neutral: it can run on Windows, Mac and Linux.

More information about Vagrant is available here:

https://www.vagrantup.com/

This article describes how to use Vagrant to very quickly establish a virtual HPC cluster on a single host suitable for use as a testbed for Lustre or as a training environment. This is intended as a platform for learning about how Lustre works, evaluating features, developing software, testing processes and patches.

Pre-requisites

The Vagrant project has comprehensive documentation covering installation requirements, available here:

https://www.vagrantup.com/docs/

The platform used in this document is based on Fedora 25 and Oracle's VirtualBox. The target machine needs enough storage capacity to accommodate the virtual machines that will be deployed, which may be several gigabytes each, as well as RAM to allow for multiple VMs to run concurrently.

The default cluster configuration allocates 2GB RAM to the admin node, and 1GB RAM to each of the metadata servers, object storage servers and clients. The basic cluster has 1 admin node, 2 metadata servers, 2 object storage servers and 2 clients, and so requires 8GB RAM available on the host to run well. Naturally, if more VMs are launched, more resources will be consumed.

Memory is probably the single most important resource to optimise. Systems with 16GB or more are recommended. While lots of CPU cores definitely helps, one can be surprisingly frugal. For example, an Intel i5 dual-core NUC with 32GB RAM is capable of supporting 12 VMs concurrently, and is perfectly suitable for the test purposes.

Installing Vagrant

Vagrant is available for download from the project's site:

https://www.vagrantup.com/downloads.html

The CentOS download on the page will also work for Red Hat Enterprise Linux and Fedora OS distributions.

Note: There are versions of Vagrant distributed in the package repositories for several OS distributions. The Vagrant project discourages their use in favour of their canonical release, but it may be easier to use the distribution version.

Install the package. For Fedora, RHEL 7, or CentOS 7:

dnf install https://releases.hashicorp.com/vagrant/1.9.4/vagrant_1.9.4_x86_64.rpm

Note: there is a known defect in the official distribution provided by the Vagrant project that can cause the command line tool to crash whenever it is run. The issue affects multiple platforms and is documented here:

https://github.com/mitchellh/vagrant/issues/8519

The workaround is to run the following command after the installation is complete:

vagrant plugin install vagrant-share --plugin-version 1.1.8

This will install an updated version of the vagrant-share gem into the user's local configuration, overriding the system version installed by the Vagrant RPM.

Once installed, Vagrant does not require super-user privileges to run.

Installing VirtualBox

VirtualBox is a virtualization platform available as a free download for multiple operating systems here:

https://www.virtualbox.org/

Instructions for installing VirtualBox are available here:

https://www.virtualbox.org/wiki/Linux_Downloads

For Fedora users, run the following commands to create a definition for the VirtualBox repository and install the software:

sudo dnf config-manager --add-repo \
  http://download.virtualbox.org/virtualbox/rpm/fedora/virtualbox.repo
sudo dnf --disablerepo=rpmfusion*  install VirtualBox-5.1

Note: RPM fusion repository carries an unofficial build, so disable it before running an install or update if RPMFusion repositories have been configured on the host.

Note: The version number is part of the package name, so must be included.

Creating a Vagrant Project

Download a virtual machine template from Vagrant's online catalogue. These are very basic VM images called boxes and will be used to as the base upon which each VM will be created. The full catalogue can be found here:

https://atlas.hashicorp.com/boxes/search

It is also possible to create boxes from scratch but this will not be covered in this document.

Since most Lustre development targets RHEL or CentOS, download the CentOS 7 box:

vagrant box add hashicorp/precise64

When prompted, select the virtualbox provider.

For example:

[malcolm@mini ~]$ vagrant box add centos/7
==> box: Loading metadata for box 'centos/7'
    box: URL: https://atlas.hashicorp.com/centos/7
This box can work with multiple providers! The providers that it
can work with are listed below. Please review the list and choose
the provider you will be working with.

1) libvirt
2) virtualbox
3) vmware_desktop

Enter your choice: 2
==> box: Adding box 'centos/7' (v1704.01) for provider: virtualbox
    box: Downloading: https://atlas.hashicorp.com/centos/boxes/7/versions/1704.01/providers/virtualbox.box
==> box: Successfully added box 'centos/7' (v1704.01) for 'virtualbox'!

Boxes are only downloaded once, irrespective of how many virtual machines will be created. Each VM instance, will create a clone of the original box as their root disk.

Create a directory within which to contain a vagrant project:

mkdir <project dir>

For example:

mkdir -p $HOME/vagrant-projects/demo

The project directory will be used to store all of the configuration files for the project, and the vagrant commands will look for configuration information relative to the directory from which it is run.

To test the environment, create a very simple configuration file:

cd $HOME/vagrant-projects/demo
cat > Vagrantfile <<\__EOF
Vagrant.configure("2") do |config|
  config.vm.box = "centos/7"
end
__EOF

Start the VM:

vagrant up

The virtual machine will be intialised and will automatically boot. For example:

[malcolm@mini demo]$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'centos/7'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'centos/7' is up to date...
==> default: Setting the name of the VM: demo_default_1495012812720_94484
==> default: Fixed port collision for 22 => 2222. Now on port 2206.
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
==> default: Forwarding ports...
    default: 22 (guest) => 2206 (host) (adapter 1)
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2206
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: 
    default: Vagrant insecure key detected. Vagrant will automatically replace
    default: this with a newly generated keypair for better security.
    default: 
    default: Inserting generated public key within guest...
    default: Removing insecure key from the guest if it's present...
    default: Key inserted! Disconnecting and reconnecting using new SSH key...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
    default: No guest additions were detected on the base box for this VM! Guest
    default: additions are required for forwarded ports, shared folders, host only
    default: networking, and more. If SSH fails on this machine, please install
    default: the guest additions and repackage the box to continue.
    default: 
    default: This is not an error message; everything may continue to work properly,
    default: in which case you may ignore this message.
==> default: Rsyncing folder: /home/malcolm/vagrant-projects/demo/ => /vagrant

The messages in the output are informational and do not represent an error in the configuration. In particular, the official CentOS boxes are not distributed with the VirtualBox guest additions, which need to be added separately if required.

When the command completes, login to the VM:

vagrant ssh

The VM is up and running, and has a basic OS configured. The VM guest has a single Ethernet interface which has a NAT connection via the host machine.

Note: The vagrant ssh command does not need a hostname argument. IF there is only one VM in the project, Vagrant will automatically connect to it. If there are multiple VMs, one can be nominated as the default, and will be used for SSH connections when no VM name is specified.

Close the SSH connection to exit from the VM. When you are finished with the VM, it can be deleted as follows:

vagrant destroy

Vagrant Command Summary

To get a complete list of available commands:

vagrant list-commands

For example:

[malcolm@mini demo]$ vagrant list-commands
Below is a listing of all available Vagrant commands and a brief
description of what they do.

box             manages boxes: installation, removal, etc.
cap             checks and executes capability
connect         connect to a remotely shared Vagrant environment
destroy         stops and deletes all traces of the vagrant machine
docker-exec     attach to an already-running docker container
docker-logs     outputs the logs from the Docker container
docker-run      run a one-off command in the context of a container
global-status   outputs status Vagrant environments for this user
halt            stops the vagrant machine
help            shows the help for a subcommand
init            initializes a new Vagrant environment by creating a Vagrantfile
list-commands   outputs all available Vagrant subcommands, even non-primary ones
login           log in to HashiCorp's Atlas
package         packages a running vagrant environment into a box
plugin          manages plugins: install, uninstall, update, etc.
port            displays information about guest port mappings
powershell      connects to machine via powershell remoting
provider        show provider for this environment
provision       provisions the vagrant machine
push            deploys code in this environment to a configured destination
rdp             connects to machine via RDP
reload          restarts vagrant machine, loads new Vagrantfile configuration
resume          resume a suspended vagrant machine
rsync           syncs rsync synced folders to remote machine
rsync-auto      syncs rsync synced folders automatically when files change
share           share your Vagrant environment with anyone in the world
snapshot        manages snapshots: saving, restoring, etc.
ssh             connects to machine via SSH
ssh-config      outputs OpenSSH valid configuration to connect to the machine
status          outputs status of the vagrant machine
suspend         suspends the machine
up              starts and provisions the vagrant environment
validate        validates the Vagrantfile
version         prints current and latest Vagrant version

To start a VM:

vagrant up [<vm name>]

To connect to a VM:

vagrant ssh [<vm name>]

To reboot a VM:

vagrant reload [<vm name>]

Note: Don't reboot a VM from within the guest shell. Always use the vagrant reload command from the host. Rebooting from within the guest causes invalidates the SSH configuration and will prevent a connection being made on the next boot.

To get the current status of all VMs in a project:

vagrant status

To suspend the VM:

vagrant suspend

To restore a suspended VM:

vagrant resume

To review the SSH configuration that was used to establish a connection, run this command from the project directory:

vagrant ssh-config [<vm name>]

To tear down and delete the VM:

vagrant destroy [-f]

Refer to the Vagrant documentation for more information:

https://www.vagrantup.com/docs

Create the Virtual HPC Cluster Project

Create a new project directory:

mkdir $HOME/vagrant-projects/vhpc

Download the following vagrant configuration: File:Vagrantfile.gz

Copy the file into the project directory and uncompress it:

gunzip $HOME/vagrant-projects/vhpc/Vagrantfile.gz

This Vagrantfile contains the information needed to create a virtual HPC storage cluster comprising:

  1. 1 Admin server
    1. Used to host administration and monitoring software
  2. 2 metadata servers with 2 shared storage volumes, one for the MGS, one for the MDT
    1. MGT is 512MB
    2. MDT is 5GB
  3. 2 or 4 object storage servers with 8 shared storage volumes per pair for OSTs
    1. 2 OSS created by default
    2. Each volume is 5GB. The relatively large number of volumes is useful for creating ZFS pools
  4. Up to 8 compute nodes / Lustre clients
    1. 2 nodes created by default

Each node has a NAT Ethernet device on eth0 that is used for communication via the host.

There are two additional networks created across the cluster, one intended to simulate a management network used for system monitoring and maintenance, and one to simulate the data network for Lustre and other application-centric traffic. The networks have identical characteristics, differing only by name, and are private to the cluster nodes.

The MDS, OSS and compute nodes are each connected to both the management network and the application data network. The admin server is connected to the management network only.

In addition, each pair of server nodes (MDS, OSS) share a private interconnect intended to simulate an additional dedicated communication ring for use with the Corosync HA software.

The networks are assigned to the node interfaces as follows:

  • eth0: NAT network to the hypervisor host
    • Present on all nodes
  • eth1: Management network
    • Present on all nodes
  • eth2: Application data network
    • Present on MDS, OSS and compute nodes
  • eth3: "Cross-over" network (corosync ring1)
    • Present on MDS and OSS nodes

Starting and Stopping the Cluster

To create the cluster, run:

vagrant up

To stop the whole cluster and destroy the VMs:

vagrant destroy [-f]

Starting and Stopping Individual Nodes

To start individual node or set of nodes:

vagrant up <node name> [<node name> ...]

Note: due to a bug in the Vagrantfile that the author has not yet resolved, Lustre server nodes need to be brought up in pairs. If they are started individually, the second node will fail to start.

To stop individual nodes:

vagrant destroy <node name>

Connecting to the VMs

The admin node is the default cluster node. Connect to it via ssh as follows:

vagrant ssh

One can also connect to other cluster nodes by specifying the vagrant node name, e.g.:

vagrant ssh mds1

The vagrant node names are:

  • adm
  • mds[1-2]
  • oss[1-4]
  • c[1-8]

Note: Users are logged into an account called vagrant when connecting via SSH. This account has sudo privileges for super-user access. While authentication is based on secure keys, the environment is not secure by default, since the keys do not have passphrases.