Testing Setup To Induce Race Conditions

From Lustre Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Rationale

Typically developer-scale testing only ensures basic grade correctness of the code, but as complex products such as Lustre get deployed at the really large scale systems and subjected to really high loads, all sorts of unlikely race conditions and failure modes tend to crop up. This write-up explains how to achieve extra hardness in testing on regular hardware without involving super-scaled systems. The approach turned out to be a lot more powerful than originally anticipated, for example in it's early life a rare race condition that took about a week to manifest itself on Top #10 class supercomputer only took about 15 minutes to hit in this setup on a single system.

Opening up a race windows theory

Typically race conditions have a very small race window sometimes only one CPU instruction long so they are hard to hit. This is where Virtual Machines come to the rescue. While it's feasible to create a full CPU emulator with random delays between every instruction, it's much too labor intensive. An alternative approach here is to create several virtual machines with many CPU cores allocated such that total number of cores across these VMs are much larger than the actual number of CPU cores available on the host. When all of these virtual machines are run at the same time with CPU-heavy loads, host kernel would preempt them at random intervals introducing big delays in execution of a perceived single instruction stream of one core while the other cores in this VM continue at full speed. Additional CPU pressure could be exerted from outside the virtual machines by running some additional CPU-heavy loads on the host machine.

When creating these VMs another important consideration is memory allocation. We don't really want the virtual machines to dip into swap, as that would make them really slow.

My particular setup

Initially I had two systems at my disposal. A 4 core i7 with HT (showing 8 cores to the host) desktop with 32G RAM and a 4 core mobile i7 with HT (also showing 8 cores to the host) laptop with 16G RAM. I have decided that 3G of RAM should be enough for the virtual machines in question, which gave me 7 VMs for the desktop (occupying 21G of RAM) and 4 VMs for the laptop (occupying 12G of RAM). Every VM also got a dedicated "virtual block device" backed by an ssd that it used as a swap. Virtual CPU-wise, every VM was configured with 8 CPU cores. Here's the config of one of such VMs (libvirt on Fedora is used):

 <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>centos6-0</name>
  <uuid>c80b11ad-552b-aaaa-9bdf-48807db09054</uuid>
  <memory unit='KiB'>3097152</memory>
  <currentMemory unit='KiB'>3097152</currentMemory>
  <vcpu>8</vcpu>
  <os>
    <type arch='x86_64' machine='pc'>hvm</type>
    <boot dev='network'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/vg_intelbox/centos6-0'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
    <controller type='usb' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:a1:ce:de'/>
      <source bridge='br1'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'/>
    <input type='mouse' bus='ps2'/>
    <graphics type='spice' autoport='yes'/>
    <video>
      <model type='qxl' vram='65536' heads='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </memballoon>
  </devices>
  <qemu:commandline>
    <qemu:arg value='-gdb'/>
    <qemu:arg value='tcp::1200'/>
  </qemu:commandline>
 </domain>

In order to conserve space and effort, all these VMs are network booted and NFS-rooted off the same source NFS root (Redhat-based distros allow this really easily, I cannot find the exact HOWTO I followed, but it was roughly along the lines of Open Shared Root). The build happens inside of a chrooted session into the NFS-root which allows me to then go into that same dir in every VM and run the tests out of the build tree directly without any need for building RPMs and such. It's also very important to configure kernel crash dumping support.

Additional protections with kernel debug options

Since we are on the correctness path here, another natural thing to do is to enable all heavy-handed kernel checks that are typically turned off for production deployments. Especially really expensive things like unmapping of freed memory that would allow very easy detection of use-after-free errors, but also spinlock checking and other useful checks. This is how my kernel debugging part of kernel config looks like for rhel6 kernel (it's even more extensive for the newer ones):

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_BOOTPARAM_HARDLOCKUP_ENABLED=y
CONFIG_BOOTPARAM_HARDLOCKUP_ENABLED_VALUE=1
CONFIG_DETECT_HUNG_TASK=y
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_DEBUG_NMI_TIMEOUT=30
CONFIG_TIMER_STATS=y
CONFIG_DEBUG_OBJECTS=y
# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
CONFIG_DEBUG_OBJECTS_FREE=y
# CONFIG_DEBUG_OBJECTS_TIMERS is not set
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_STACKTRACE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_WRITECOUNT=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_LIST=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
CONFIG_DEBUG_PAGEALLOC=y

Running actual tests

Since majority of outcomes of kinds of bugs we are aiming at here are crashes, I decided to have groups of tests that would be run in an infinite loops until something crashes (this is where kernel crashdumps become useful) or hangs (this is where gdb support you can see in my VM config is useful, but you can also forcefully dump core of a hung VM with virsh dump command and crash tool knows how to use this too which is convenient at times). Due to these choices, discussed testing is not a replacement for a regular sanity testing where you do want to look at regular test failures.

So far all my testing starts with:

slogin root@$VMTESTNODE
cd /home/green/git/lustre-release/lustre/tests

and then a selection of one of the below actual testing lines:

while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh sanity.sh ; sh llmountcleanup.sh ; rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh sanityn.sh ; sh llmountcleanup.sh ;done
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes DURATION=$((900*3)) PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super cache" DEBUG_SIZE=100 sh racer.sh ; sh llmountcleanup.sh ; done
SLOW=yes TSTID=500 TSTID2=499 TSTUSR=green TSTUSR2=saslauth sh sanity-quota.sh
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh replay-single.sh ; sh llmountcleanup.sh ; SLOW=yes REFORMAT=yes sh replay-ost-single.sh ; sh llmountcleanup.sh ; SLOW=yes REFORMAT=yes sh replay-dual.sh ; sh llmountcleanup.sh ; done
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh recovery-small.sh ; sh llmountcleanup.sh ; done
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh conf-sanity.sh ; sh llmountcleanup.sh ; for i in `seq 0 7` ; do losetup -d /dev/loop$i ; done ; rm -rf /tmp/* ; done

Typically with 11 VMs at my disposal, I have 2 running sanity* loop, and 3 each running racer loop, replay* loop and recovery-small loop.

conf-sanity loop is rarely used since it's currently broken for running off a build tree. You can see that quite a bunch of extra tests could also be made to run like this that I am not yet doing.