Lustre Quota Troubleshooting
Introduction
In the past, Lustre quotas frequently caused problems. One reason are numerous implementation changes, e.g. due to adaptions to underlying file systems.
Recently, the bug explained in https://jira.hpdd.intel.com/browse/LU-4345 ( fix for b2_5 branch at http://review.whamcloud.com/11435 ) caused quotas to be forged. If you have been running file systems with version Lustre 2.4 your quotas are probably still (slightly) incorrect.
During the first part of his talk at LAD'15 Roland Laifer explained tools for wrong quota investigation. Below you can find links to the tools and further information.
Adding to This Guide
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.
Quota Troubleshooting Tools
du
If you know that a user or group has all their data below a directory you can simply use 'du -hks <directory>' and compare the output with 'lfs quota -u <user> <directory>' or 'lfs quota -g <group> <directory>'. Of course, small differences are normal in case there is I/O activity while the commands are running. Note that du is intelligent enough to not count sizes many times if files are hard linked.
check_quotas.pl
The perl script check_quotas.pl can be used to scan complete file systems, execute a stat for each file, add up the capacity for each user or group, and compare the output with output of 'lfs quota'. In order to not count sizes of hard linked files many times, the tools stores all inode numbers internally and therefore needs a lot of main memory. You can download the tool here: https://www.scc.kit.edu/scc/sw/lustre_tools/check_quotas.tgz
compare_user_group_ost_quotas.pl
In case user and group quotas are activated you can add up all user and group quotas reported by the OSS for each OST. (On the OSS you can get this information with 'lctl get_param osd-ldiskfs.<ost_name>.quota_slave.acct_user' and 'lctl get_param osd-ldiskfs.<ost_name>.quota_slave.acct_group'.) The result should be pretty much identical but our investigation showed that this is not true, especially for older file systems. However, we do not yet understand the reason for this difference. The perl script to easily do this comparison is called compare_user_group_ost_quotas.pl. It can be downloaded here: https://www.scc.kit.edu/scc/sw/lustre_tools/compare_user_group_ost_quotas.tgz
References and Links
- Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf