Testing Setup To Induce Race Conditions

Rationale
Typically developer-scale testing only ensures basic grade correctness of the code, but as complex products such as Lustre get deployed at the really large scale systems and subjected to really high loads, all sorts of unlikely race conditions and failure modes tend to crop up. This write-up explains how to achieve extra hardness in testing on regular hardware without involving super-scaled systems. The approach turned out to be a lot more powerful than originally anticipated, for example in it's early life a rare race condition that took about a week to manifest itself on Top #10 class supercomputer only took about 15 minutes to hit in this setup on a single system.

Opening up a race windows theory
Typically race conditions have a very small race window sometimes only one CPU instruction long so they are hard to hit. This is where Virtual Machines come to the rescue. While it's feasible to create a full CPU emulator with random delays between every instruction, it's much too labor intensive. An alternative approach here is to create several virtual machines with many CPU cores allocated such that total number of cores across these VMs are much larger than the actual number of CPU cores available on the host. When all of these virtual machines are run at the same time with CPU-heavy loads, host kernel would preempt them at random intervals introducing big delays in execution of a perceived single instruction stream of one core while the other cores in this VM continue at full speed. Additional CPU pressure could be exerted from outside the virtual machines by running some additional CPU-heavy loads on the host machine.

When creating these VMs another important consideration is memory allocation. We don't really want the virtual machines to dip into swap, as that would make them really slow.

My particular setup
Initially I had two systems at my disposal. A 4 core i7 with HT (showing 8 cores to the host) desktop with 32G RAM and a 4 core mobile i7 with HT (also showing 8 cores to the host) laptop with 16G RAM. I have decided that 3G of RAM should be enough for the virtual machines in question, which gave me 7 VMs for the desktop (occupying 21G of RAM) and 4 VMs for the laptop (occupying 12G of RAM). Every VM also got a dedicated "virtual block device" backed by an ssd that it used as a swap. Virtual CPU-wise, every VM was configured with 8 CPU cores. Here's the config of one of such VMs (libvirt on Fedora is used):

In order to conserve space and effort, all these VMs are network booted and NFS-rooted off the same source NFS root (Redhat-based distros allow this really easily, I cannot find the exact HOWTO I followed, but it was roughly along the lines of Open Shared Root). The build happens inside of a chrooted session into the NFS-root which allows me to then go into that same dir in every VM and run the tests out of the build tree directly without any need for building RPMs and such. It's also very important to configure kernel crash dumping support.

Additional protections with kernel debug options
Since we are on the correctness path here, another natural thing to do is to enable all heavy-handed kernel checks that are typically turned off for production deployments. Especially really expensive things like unmapping of freed memory that would allow very easy detection of use-after-free errors, but also spinlock checking and other useful checks. This is how my kernel debugging part of kernel config looks like for rhel6 kernel (it's even more extensive for the newer ones):

Running actual tests
Since majority of outcomes of kinds of bugs we are aiming at here are crashes, I decided to have groups of tests that would be run in an infinite loops until something crashes (this is where kernel crashdumps become useful) or hangs (this is where gdb support you can see in my VM config is useful, but you can also forcefully dump core of a hung VM with virsh dump command and crash tool knows how to use this too which is convenient at times). Due to these choices, discussed testing is not a replacement for a regular sanity testing where you do want to look at regular test failures.

So far all my testing starts with: and then a selection of one of the below actual testing lines: Typically with 11 VMs at my disposal, I have 2 running sanity* loop, and 3 each running racer loop, replay* loop and recovery-small loop.

conf-sanity loop is rarely used since it's currently broken for running off a build tree. You can see that quite a bunch of extra tests could also be made to run like this that I am not yet doing.