Create a Virtual HPC Storage Cluster with Vagrant

Introduction
Vagrant is a powerful platform for programmatically creating and managing virtual machines. It is easy to install and is capable of supporting multiple virtual machine platforms, including VirtualBox, HyperV and VMWare. Additional VM providers can be added by way of plugins. Vagrant is also largely platform-neutral: it can run on Windows, Mac and Linux.

More information about Vagrant is available here:

https://www.vagrantup.com/

This article describes how to use Vagrant to very quickly establish a virtual HPC cluster on a single host suitable for use as a testbed for Lustre or as a training environment. This is intended as a platform for learning about how Lustre works, evaluating features, developing software, testing processes and patches.

Prerequisites
The Vagrant project has comprehensive documentation covering installation requirements, available here:

https://www.vagrantup.com/docs/

The platform used in this document is based on Fedora 25 and Oracle's VirtualBox. The target machine needs enough storage capacity to accommodate the virtual machines that will be deployed, which may be several gigabytes each, as well as RAM to allow for multiple VMs to run concurrently.

The default cluster configuration allocates 2GB RAM to the admin node, and 1GB RAM to each of the metadata servers, object storage servers and clients. The basic cluster has 1 admin node, 2 metadata servers, 2 object storage servers and 2 clients, and so requires 8GB RAM available on the host to run well. Naturally, if more VMs are launched, more resources will be consumed.

Memory is probably the single most important resource to optimise. Systems with 16GB or more are recommended. While lots of CPU cores definitely helps, one can be surprisingly frugal. For example, an Intel i5 dual-core NUC with 32GB RAM is capable of supporting 12 VMs concurrently, and is perfectly suitable for the test purposes.

Installing Vagrant
Vagrant is available for download from the project's site:

https://www.vagrantup.com/downloads.html

The CentOS download on the page will also work for Red Hat Enterprise Linux and Fedora OS distributions.

Note: There are versions of Vagrant distributed in the package repositories for several OS distributions. The Vagrant project discourages their use in favour of their canonical release, but it may be easier to use the distribution version.

Install the package. For Fedora, RHEL 7, or CentOS 7:

dnf install https://releases.hashicorp.com/vagrant/1.9.4/vagrant_1.9.4_x86_64.rpm

Note: there is a known defect in the official distribution provided by the Vagrant project that can cause the command line tool to crash whenever it is run. The issue affects multiple platforms and is documented here:

https://github.com/mitchellh/vagrant/issues/8519

The workaround is to run the following command after the installation is complete:

vagrant plugin install vagrant-share --plugin-version 1.1.8

This will install an updated version of the vagrant-share gem into the user's local configuration, overriding the system version installed by the Vagrant RPM.

Once installed, Vagrant does not require super-user privileges to run.

Installing VirtualBox
VirtualBox is a virtualization platform available as a free download for multiple operating systems here:

https://www.virtualbox.org/

Instructions for installing VirtualBox are available here:

https://www.virtualbox.org/wiki/Linux_Downloads

For Fedora users, run the following commands to create a definition for the VirtualBox repository and install the software:

 sudo dnf config-manager --add-repo \ http://download.virtualbox.org/virtualbox/rpm/fedora/virtualbox.repo sudo dnf --disablerepo=rpmfusion* install VirtualBox-5.1

Note: RPM fusion repository carries an unofficial build, so disable it before running an install or update if RPMFusion repositories have been configured on the host.

Note: The version number is part of the package name, so must be included.

Creating a Vagrant Project
Download a virtual machine template from Vagrant's online catalogue. These are very basic VM images called boxes and will be used to as the base upon which each VM will be created. The full catalogue can be found here:

https://atlas.hashicorp.com/boxes/search

It is also possible to create boxes from scratch but this will not be covered in this document.

Since most Lustre development targets RHEL or CentOS, download the CentOS 7 box:

vagrant box add centos/7

When prompted, select the  provider.

For example:

 [malcolm@mini ~]$ vagrant box add centos/7 ==> box: Loading metadata for box 'centos/7' box: URL: https://atlas.hashicorp.com/centos/7 This box can work with multiple providers! The providers that it can work with are listed below. Please review the list and choose the provider you will be working with.

1) libvirt 2) virtualbox 3) vmware_desktop

Enter your choice: 2 ==> box: Adding box 'centos/7' (v1704.01) for provider: virtualbox box: Downloading: https://atlas.hashicorp.com/centos/boxes/7/versions/1704.01/providers/virtualbox.box ==> box: Successfully added box 'centos/7' (v1704.01) for 'virtualbox'!

Boxes are only downloaded once, irrespective of how many virtual machines will be created. Each VM instance, will create a clone of the original box as their root disk.

Create a directory within which to contain a vagrant project:

mkdir

For example:

mkdir -p $HOME/vagrant-projects/demo

The project directory will be used to store all of the configuration files for the project, and the vagrant commands will look for configuration information relative to the directory from which it is run.

To test the environment, create a very simple configuration file:

 cd $HOME/vagrant-projects/demo cat > Vagrantfile <<\__EOF Vagrant.configure("2") do |config| config.vm.box = "centos/7" end __EOF

Start the VM:

vagrant up

The virtual machine will be intialised and will automatically boot. For example:

 [malcolm@mini demo]$ vagrant up Bringing machine 'default' up with 'virtualbox' provider... ==> default: Importing base box 'centos/7'... ==> default: Matching MAC address for NAT networking... ==> default: Checking if box 'centos/7' is up to date... ==> default: Setting the name of the VM: demo_default_1495012812720_94484 ==> default: Fixed port collision for 22 => 2222. Now on port 2206. ==> default: Clearing any previously set network interfaces... ==> default: Preparing network interfaces based on configuration... default: Adapter 1: nat ==> default: Forwarding ports... default: 22 (guest) => 2206 (host) (adapter 1) ==> default: Booting VM... ==> default: Waiting for machine to boot. This may take a few minutes... default: SSH address: 127.0.0.1:2206 default: SSH username: vagrant default: SSH auth method: private key default: default: Vagrant insecure key detected. Vagrant will automatically replace default: this with a newly generated keypair for better security. default: default: Inserting generated public key within guest...   default: Removing insecure key from the guest if it's present...    default: Key inserted! Disconnecting and reconnecting using new SSH key... ==> default: Machine booted and ready! ==> default: Checking for guest additions in VM...   default: No guest additions were detected on the base box for this VM! Guest default: additions are required for forwarded ports, shared folders, host only default: networking, and more. If SSH fails on this machine, please install default: the guest additions and repackage the box to continue. default: default: This is not an error message; everything may continue to work properly, default: in which case you may ignore this message. ==> default: Rsyncing folder: /home/malcolm/vagrant-projects/demo/ => /vagrant

The messages in the output are informational and do not represent an error in the configuration. In particular, the official CentOS boxes are not distributed with the VirtualBox guest additions, which need to be added separately if required.

When the command completes, login to the VM:

vagrant ssh

The VM is up and running, and has a basic OS configured. The VM guest has a single Ethernet interface which has a NAT connection via the host machine.

Note: The  command does not need a hostname argument. IF there is only one VM in the project, Vagrant will automatically connect to it. If there are multiple VMs, one can be nominated as the default, and will be used for SSH connections when no VM name is specified.

Close the SSH connection to exit from the VM. When you are finished with the VM, it can be deleted as follows:

vagrant destroy

Vagrant Command Summary
To get a complete list of available commands:

vagrant list-commands

For example:

 [malcolm@mini demo]$ vagrant list-commands Below is a listing of all available Vagrant commands and a brief description of what they do.

box            manages boxes: installation, removal, etc. cap             checks and executes capability connect        connect to a remotely shared Vagrant environment destroy        stops and deletes all traces of the vagrant machine docker-exec    attach to an already-running docker container docker-logs    outputs the logs from the Docker container docker-run     run a one-off command in the context of a container global-status  outputs status Vagrant environments for this user halt           stops the vagrant machine help           shows the help for a subcommand init           initializes a new Vagrant environment by creating a Vagrantfile list-commands  outputs all available Vagrant subcommands, even non-primary ones login          log in to HashiCorp's Atlas package        packages a running vagrant environment into a box plugin         manages plugins: install, uninstall, update, etc. port            displays information about guest port mappings powershell     connects to machine via powershell remoting provider       show provider for this environment provision      provisions the vagrant machine push           deploys code in this environment to a configured destination rdp            connects to machine via RDP reload         restarts vagrant machine, loads new Vagrantfile configuration resume         resume a suspended vagrant machine rsync          syncs rsync synced folders to remote machine rsync-auto     syncs rsync synced folders automatically when files change share          share your Vagrant environment with anyone in the world snapshot       manages snapshots: saving, restoring, etc. ssh             connects to machine via SSH ssh-config     outputs OpenSSH valid configuration to connect to the machine status         outputs status of the vagrant machine suspend        suspends the machine up             starts and provisions the vagrant environment validate       validates the Vagrantfile version        prints current and latest Vagrant version

To start a VM:

vagrant up [ ]

To connect to a VM:

vagrant ssh [ ]

To reboot a VM:

vagrant reload [ ]

Note: Don't reboot a VM from within the guest shell. Always use the  command from the host. Rebooting from within the guest causes invalidates the SSH configuration and will prevent a connection being made on the next boot.

To get the current status of all VMs in a project:

vagrant status

To suspend the VM:

vagrant suspend

To restore a suspended VM:

vagrant resume

To review the SSH configuration that was used to establish a connection, run this command from the project directory:

vagrant ssh-config [ ]

To tear down and delete the VM:

vagrant destroy [-f]

Refer to the Vagrant documentation for more information:

https://www.vagrantup.com/docs

Create the Virtual HPC Cluster Project
Create a new project directory:

mkdir $HOME/vagrant-projects/vhpc

Download the following vagrant configuration:

Copy the file into the project directory and uncompress it:

gunzip $HOME/vagrant-projects/vhpc/Vagrantfile.gz

This Vagrantfile contains the information needed to create a virtual HPC storage cluster comprising:


 * 1 Admin server
 * Used to host administration and monitoring software
 * The admin server also has a passphraseless SSH key for access to the other nodes in the virtual cluster.
 * 2 metadata servers with 2 shared storage volumes, one for the MGS, one for the MDT
 * MGT is 512MB
 * MDT is 5GB
 * 2 or 4 object storage servers with 8 shared storage volumes per pair for OSTs
 * 2 OSS created by default
 * Each volume is 5GB. The relatively large number of volumes is useful for creating ZFS pools
 * Up to 8 compute nodes / Lustre clients
 * 2 nodes created by default

Each node has a NAT Ethernet device on eth0 that is used for communication via the host.

There are two additional networks created across the cluster, one intended to simulate a management network used for system monitoring and maintenance, and one to simulate the data network for Lustre and other application-centric traffic. The networks have identical characteristics, differing only by name, and are private to the cluster nodes.

The MDS, OSS and compute nodes are each connected to both the management network and the application data network. The admin server is connected to the management network only.

In addition, each pair of server nodes (MDS, OSS) share a private interconnect intended to simulate an additional dedicated communication ring for use with the Corosync HA software.

The networks are assigned to the node interfaces as follows:


 * eth0: NAT network to the hypervisor host
 * Present on all nodes
 * eth1: Management network
 * Present on all nodes
 * eth2: Application data network
 * Present on MDS, OSS and compute nodes
 * eth3: "Cross-over" network (corosync ring1)
 * Present on MDS and OSS nodes

Starting and Stopping the Cluster
To create the cluster, run:

vagrant up

To stop the whole cluster and destroy the VMs:

vagrant destroy [-f]

Starting and Stopping Individual Nodes
To start individual node or set of nodes:

vagrant up [ ...]

Note: due to a bug in the Vagrantfile that the author has not yet resolved, Lustre server nodes need to be brought up in pairs. If they are started individually, the second node will fail to start.

To stop individual nodes:

vagrant destroy [ ...]

Connecting to the VMs
The admin node is the default cluster node. Connect to it via ssh as follows:

vagrant ssh

In addition, port 8443 on the hypervisor host is mapped via port forwarding to port 443 on the admin VM guest, so that web services running on the admin server can be accessed from outside the VM. Use the hypervisor host network name or IP address and connect to port 8443, and this will redirect a browser or other TCP client to port 443 on the admin VM.

To get a list of forwarded ports, run:

vagrant port

For example:

 [malcolm@mini vhpc]$ vagrant port adm The forwarded ports for the machine are listed below. Please note that these values may differ from values configured in the Vagrantfile if the provider supports automatic port collision detection and resolution.

22 (guest) => 2222 (host) 443 (guest) => 8443 (host)

One can also connect to other cluster nodes by specifying the vagrant node name, e.g.:

vagrant ssh mds1

The vagrant node names are:


 * adm
 * mds[1-2]
 * oss[1-4]
 * c[1-8]

Note: Users are logged into an account called  when connecting via SSH. This account has  privileges for super-user access. While authentication is based on secure keys, the environment is not secure by default, since the keys do not have passphrases.