Difference between revisions of "Testing+Setup+To+Induce Race+Conditions"

From Lustre Wiki
Jump to navigation Jump to search
(Blanked the page)
(Undo revision 821 by Green (talk))
 
Line 1: Line 1:
 +
=== Rationale ===
 +
Typically developer-scale testing only ensures basic grade correctness of the code, but as complex products such as Lustre get deployed at the really large scale systems and subjected to really high loads, all sorts of unlikely race conditions and failure modes tend to crop up.
 +
This write-up explains how to achieve extra hardness in testing on regular hardware without involving super-scaled systems.
 +
The approach turned out to be a lot more powerful than originally anticipated, for example in it's early life a are race condition that took about a week to manifest itself on Top #10 class supercomputer only took about 15 minutes to hit in this setup.
  
 +
=== Opening up a race windows theory ===
 +
Typically race conditions have a very small race window sometimes only 1 CPU instruction long so they are hard to hit. This is where Virtual Machines come to the rescue. While it's feasible to create a full CPU emulator with random delays between every instruction, it's too much labor intensive.
 +
An alternative approach here is to create several virtual machines with many CPU cores allocated such that total number of cores across these VMs are greatly larger than the actual number of CPU cores available on the host. When all of these virtual machines are run at the same time with cpu-heavy loads, host kernel would preempt them at random intervals introducing big delays in execution of a perceived single instruction stream on every core. Additional CPU pressure could be excerted from outside of virtual machines by running some other cpu-heavy loads.
 +
When creating these VMs another important consideration is memory allocation, we don't really want for the virtual machines to dip into swap as that would make them really slow.
 +
 +
=== My particular setup ===
 +
Initially I had two systems at my disposal. A 4 core i7 with HT (showing 8 cores to the host) desktop with 32G RAM and a 4 core mobile i7 with HT (also showing 8 cores to the host) laptop with 16G RAM.
 +
I have decided that 3G of RAM should be enough for the virtual machines in question, which gave me 7 VMs for the desktop (occupying 21G of RAM) and 4 VMs for the laptop (occupying 12G of RAM). Every VM also got a dedicated "virtual block device" backed by an ssd that it used as a swap. virtual CPU-wise, every VM got 8 cpu cores allocated.
 +
Here's the config of one of such VMs (libvirt on Fedora is used):
 +
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
 +
  <name>centos6-0</name>
 +
  <uuid>c80b11ad-552b-aaaa-9bdf-48807db09054</uuid>
 +
  <memory unit='KiB'>3097152</memory>
 +
  <currentMemory unit='KiB'>3097152</currentMemory>
 +
  <vcpu>8</vcpu>
 +
  <os>
 +
    <type arch='x86_64' machine='pc'>hvm</type>
 +
    <boot dev='network'/>
 +
  </os>
 +
  <features>
 +
    <acpi/>
 +
    <apic/>
 +
    <pae/>
 +
  </features>
 +
  <clock offset='utc'/>
 +
  <on_poweroff>destroy</on_poweroff>
 +
  <on_reboot>restart</on_reboot>
 +
  <on_crash>restart</on_crash>
 +
  <devices>
 +
    <emulator>/usr/libexec/qemu-kvm</emulator>
 +
    <disk type='block' device='disk'>
 +
      <driver name='qemu' type='raw' cache='none'/>
 +
      <source dev='/dev/vg_intelbox/centos6-0'/>
 +
      <target dev='vda' bus='virtio'/>
 +
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
 +
    </disk>
 +
    <controller type='usb' index='0'>
 +
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
 +
    </controller>
 +
    <controller type='virtio-serial' index='0'>
 +
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
 +
    </controller>
 +
    <interface type='bridge'>
 +
      <mac address='52:54:00:a1:ce:de'/>
 +
      <source bridge='br1'/>
 +
      <model type='virtio'/>
 +
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
 +
    </interface>
 +
    <serial type='pty'>
 +
      <target port='0'/>
 +
    </serial>
 +
    <console type='pty'>
 +
      <target type='serial' port='0'/>
 +
    </console>
 +
    <channel type='spicevmc'>
 +
      <target type='virtio' name='com.redhat.spice.0'/>
 +
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
 +
    </channel>
 +
    <input type='tablet' bus='usb'/>
 +
    <input type='mouse' bus='ps2'/>
 +
    <graphics type='spice' autoport='yes'/>
 +
    <video>
 +
      <model type='qxl' vram='65536' heads='1'/>
 +
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
 +
    </video>
 +
    <memballoon model='virtio'>
 +
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
 +
    </memballoon>
 +
  </devices>
 +
  <qemu:commandline>
 +
    <qemu:arg value='-gdb'/>
 +
    <qemu:arg value='tcp::1200'/>
 +
  </qemu:commandline>
 +
</domain>
 +
 +
In order to conserve space and effort, all these VMs are network booted and nfs-rooted off the same source NFS root (Redhat-based distros allow this really easily).
 +
The build happens inside of a chrooted session into the NFS-root which allows me to then go into that same dir in every VM and run the tests out of the build tree directly without any need for building RPMs and such.
 +
It's also very important to configure kernel crash dumping support.
 +
 +
=== Additional protections with kernel options ===
 +
Since we are on the correctness path here, another natural thing to do is to enable all heavy-handed kernel checks that are typically left off for production deployments. Esp. really expensive things like unmapping of freed memory that would allow very easy detection of use after free errors. But also spinlock checking and other useful checks.
 +
This is how my kernel debugging part of kernel config looks like for rhel6 kernel (it's even more extensive for the newer ones):
 +
CONFIG_DEBUG_KERNEL=y
 +
CONFIG_DEBUG_SHIRQ=y
 +
CONFIG_DETECT_SOFTLOCKUP=y
 +
CONFIG_LOCKUP_DETECTOR=y
 +
CONFIG_HARDLOCKUP_DETECTOR=y
 +
CONFIG_BOOTPARAM_HARDLOCKUP_ENABLED=y
 +
CONFIG_BOOTPARAM_HARDLOCKUP_ENABLED_VALUE=1
 +
CONFIG_DETECT_HUNG_TASK=y
 +
CONFIG_SCHED_DEBUG=y
 +
CONFIG_SCHEDSTATS=y
 +
CONFIG_DEBUG_NMI_TIMEOUT=30
 +
CONFIG_TIMER_STATS=y
 +
CONFIG_DEBUG_OBJECTS=y
 +
# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
 +
CONFIG_DEBUG_OBJECTS_FREE=y
 +
# CONFIG_DEBUG_OBJECTS_TIMERS is not set
 +
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
 +
CONFIG_DEBUG_SLAB=y
 +
CONFIG_DEBUG_SPINLOCK=y
 +
CONFIG_DEBUG_MUTEXES=y
 +
CONFIG_DEBUG_SPINLOCK_SLEEP=y
 +
CONFIG_STACKTRACE=y
 +
CONFIG_DEBUG_BUGVERBOSE=y
 +
CONFIG_DEBUG_INFO=y
 +
# CONFIG_DEBUG_VM is not set
 +
# CONFIG_DEBUG_VIRTUAL is not set
 +
CONFIG_DEBUG_WRITECOUNT=y
 +
CONFIG_DEBUG_MEMORY_INIT=y
 +
CONFIG_DEBUG_LIST=y
 +
CONFIG_ARCH_WANT_FRAME_POINTERS=y
 +
CONFIG_FRAME_POINTER=y
 +
CONFIG_DEBUG_PAGEALLOC=y
 +
 +
=== Running actual tests ===
 +
Since majority of outcomes of kinds of bugs we are aiming at here are crashes, I decided to have groups of tests that would be run in an infinite loops until something crashes (this is where kernel crashdumps become useful) or hangs (this is where gdb support you can see in my VM config is useful, but you can also forcefully dump core of a hung VM with virsh dump command and crash tool knows how to use this too which is convenient at times).
 +
Due to these choices, discussed testing is not a replacement for a regular sanity testing where you do want to look at regular test failures.
 +
 +
So far all my testing starts with:
 +
 +
slogin [email protected]$VMTESTNODE
 +
cd /home/green/git/lustre-release/lustre/tests
 +
 +
and then a selection of one of the below actual testing lines:
 +
 +
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh sanity.sh ; sh llmountcleanup.sh ; rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh sanityn.sh ; sh llmountcleanup.sh ;done
 +
 +
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes DURATION=$((900*3)) PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super cache" DEBUG_SIZE=100 sh racer.sh ; sh llmountcleanup.sh ; done
 +
 +
SLOW=yes TSTID=500 TSTID2=499 TSTUSR=green TSTUSR2=saslauth sh sanity-quota.sh
 +
 +
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh replay-single.sh ; sh llmountcleanup.sh ; SLOW=yes REFORMAT=yes sh replay-ost-single.sh ; sh llmountcleanup.sh ; SLOW=yes REFORMAT=yes sh replay-dual.sh ; sh llmountcleanup.sh ; done
 +
 +
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh recovery-small.sh ; sh llmountcleanup.sh ; done
 +
 +
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh conf-sanity.sh ; sh llmountcleanup.sh ; for i in `seq 0 7` ; do losetup -d /dev/loop$i ; done ; rm -rf /tmp/* ; done
 +
 +
Typically with 11 VMs at my disposal, I have 2 running sanity* loop, and 3 each running racer loop, replay* loop and recovery-small loop.
 +
 +
conf-sanity loop is rarely used since it's currently broken for running off a build tree.
 +
You can see that quite a bunch of extra tests could also be made to run like this that I am not yet doing.

Latest revision as of 14:23, 3 July 2020

Rationale

Typically developer-scale testing only ensures basic grade correctness of the code, but as complex products such as Lustre get deployed at the really large scale systems and subjected to really high loads, all sorts of unlikely race conditions and failure modes tend to crop up. This write-up explains how to achieve extra hardness in testing on regular hardware without involving super-scaled systems. The approach turned out to be a lot more powerful than originally anticipated, for example in it's early life a are race condition that took about a week to manifest itself on Top #10 class supercomputer only took about 15 minutes to hit in this setup.

Opening up a race windows theory

Typically race conditions have a very small race window sometimes only 1 CPU instruction long so they are hard to hit. This is where Virtual Machines come to the rescue. While it's feasible to create a full CPU emulator with random delays between every instruction, it's too much labor intensive. An alternative approach here is to create several virtual machines with many CPU cores allocated such that total number of cores across these VMs are greatly larger than the actual number of CPU cores available on the host. When all of these virtual machines are run at the same time with cpu-heavy loads, host kernel would preempt them at random intervals introducing big delays in execution of a perceived single instruction stream on every core. Additional CPU pressure could be excerted from outside of virtual machines by running some other cpu-heavy loads. When creating these VMs another important consideration is memory allocation, we don't really want for the virtual machines to dip into swap as that would make them really slow.

My particular setup

Initially I had two systems at my disposal. A 4 core i7 with HT (showing 8 cores to the host) desktop with 32G RAM and a 4 core mobile i7 with HT (also showing 8 cores to the host) laptop with 16G RAM. I have decided that 3G of RAM should be enough for the virtual machines in question, which gave me 7 VMs for the desktop (occupying 21G of RAM) and 4 VMs for the laptop (occupying 12G of RAM). Every VM also got a dedicated "virtual block device" backed by an ssd that it used as a swap. virtual CPU-wise, every VM got 8 cpu cores allocated. Here's the config of one of such VMs (libvirt on Fedora is used):

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
 <name>centos6-0</name>
 <uuid>c80b11ad-552b-aaaa-9bdf-48807db09054</uuid>
 <memory unit='KiB'>3097152</memory>
 <currentMemory unit='KiB'>3097152</currentMemory>
 <vcpu>8</vcpu>
 <os>
   <type arch='x86_64' machine='pc'>hvm</type>
   <boot dev='network'/>
 </os>
 <features>
   <acpi/>
   <apic/>
   <pae/>
 </features>
 <clock offset='utc'/>
 <on_poweroff>destroy</on_poweroff>
 <on_reboot>restart</on_reboot>
 <on_crash>restart</on_crash>
 <devices>
   <emulator>/usr/libexec/qemu-kvm</emulator>
   <disk type='block' device='disk'>
     <driver name='qemu' type='raw' cache='none'/>
     <source dev='/dev/vg_intelbox/centos6-0'/>
     <target dev='vda' bus='virtio'/>
     <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
   </disk>
   <controller type='usb' index='0'>
     <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
   </controller>
   <controller type='virtio-serial' index='0'>
     <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
   </controller>
   <interface type='bridge'>
     <mac address='52:54:00:a1:ce:de'/>
     <source bridge='br1'/>
     <model type='virtio'/>
     <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
   </interface>
   <serial type='pty'>
     <target port='0'/>
   </serial>
   <console type='pty'>
     <target type='serial' port='0'/>
   </console>
   <channel type='spicevmc'>
     <target type='virtio' name='com.redhat.spice.0'/>
     <address type='virtio-serial' controller='0' bus='0' port='1'/>
   </channel>
   <input type='tablet' bus='usb'/>
   <input type='mouse' bus='ps2'/>
   <graphics type='spice' autoport='yes'/>
   <video>
     <model type='qxl' vram='65536' heads='1'/>
     <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
   </video>
   <memballoon model='virtio'>
     <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
   </memballoon>
 </devices>
 <qemu:commandline>
   <qemu:arg value='-gdb'/>
   <qemu:arg value='tcp::1200'/>
 </qemu:commandline>
</domain>

In order to conserve space and effort, all these VMs are network booted and nfs-rooted off the same source NFS root (Redhat-based distros allow this really easily). The build happens inside of a chrooted session into the NFS-root which allows me to then go into that same dir in every VM and run the tests out of the build tree directly without any need for building RPMs and such. It's also very important to configure kernel crash dumping support.

Additional protections with kernel options

Since we are on the correctness path here, another natural thing to do is to enable all heavy-handed kernel checks that are typically left off for production deployments. Esp. really expensive things like unmapping of freed memory that would allow very easy detection of use after free errors. But also spinlock checking and other useful checks. This is how my kernel debugging part of kernel config looks like for rhel6 kernel (it's even more extensive for the newer ones):

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_BOOTPARAM_HARDLOCKUP_ENABLED=y
CONFIG_BOOTPARAM_HARDLOCKUP_ENABLED_VALUE=1
CONFIG_DETECT_HUNG_TASK=y
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_DEBUG_NMI_TIMEOUT=30
CONFIG_TIMER_STATS=y
CONFIG_DEBUG_OBJECTS=y
# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
CONFIG_DEBUG_OBJECTS_FREE=y
# CONFIG_DEBUG_OBJECTS_TIMERS is not set
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_STACKTRACE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_WRITECOUNT=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_LIST=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
CONFIG_DEBUG_PAGEALLOC=y

Running actual tests

Since majority of outcomes of kinds of bugs we are aiming at here are crashes, I decided to have groups of tests that would be run in an infinite loops until something crashes (this is where kernel crashdumps become useful) or hangs (this is where gdb support you can see in my VM config is useful, but you can also forcefully dump core of a hung VM with virsh dump command and crash tool knows how to use this too which is convenient at times). Due to these choices, discussed testing is not a replacement for a regular sanity testing where you do want to look at regular test failures.

So far all my testing starts with:

slogin [email protected]$VMTESTNODE
cd /home/green/git/lustre-release/lustre/tests

and then a selection of one of the below actual testing lines:

while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh sanity.sh ; sh llmountcleanup.sh ; rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh sanityn.sh ; sh llmountcleanup.sh ;done
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes DURATION=$((900*3)) PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super cache" DEBUG_SIZE=100 sh racer.sh ; sh llmountcleanup.sh ; done
SLOW=yes TSTID=500 TSTID2=499 TSTUSR=green TSTUSR2=saslauth sh sanity-quota.sh
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh replay-single.sh ; sh llmountcleanup.sh ; SLOW=yes REFORMAT=yes sh replay-ost-single.sh ; sh llmountcleanup.sh ; SLOW=yes REFORMAT=yes sh replay-dual.sh ; sh llmountcleanup.sh ; done
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh recovery-small.sh ; sh llmountcleanup.sh ; done
while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes sh conf-sanity.sh ; sh llmountcleanup.sh ; for i in `seq 0 7` ; do losetup -d /dev/loop$i ; done ; rm -rf /tmp/* ; done

Typically with 11 VMs at my disposal, I have 2 running sanity* loop, and 3 each running racer loop, replay* loop and recovery-small loop.

conf-sanity loop is rarely used since it's currently broken for running off a build tree. You can see that quite a bunch of extra tests could also be made to run like this that I am not yet doing.