Lustre Quota Troubleshooting

From Lustre Wiki
Revision as of 07:16, 23 December 2015 by Rlaifer (talk | contribs)
Jump to navigation Jump to search

Introduction

In the past, Lustre quotas frequently caused problems. One reason are numerous implementation changes, e.g. due to adaptions to underlying file systems.

Recently, the bug explained in https://jira.hpdd.intel.com/browse/LU-4345 ( fix for b2_5 branch at http://review.whamcloud.com/11435 ) caused quotas to be forged. If you have been running file systems with version Lustre 2.4 your quotas are probably still (slightly) incorrect.

During the first part of his talk at LAD'15 Roland Laifer explained tools for wrong quota investigation. Below you can find links to the tools and further information.

Adding to This Guide

If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.

Quota Troubleshooting Tools

du

If you know that a user or group has all their data below a directory you can simply use 'du -hks <directory>' and compare the output with 'lfs quota -u <user> <directory>' or 'lfs quota -g <group> <directory>'. Of course, small differences are normal in case there is I/O activity while the commands are running. Note that du is intelligent enough to not count sizes many times if files are hard linked.

check_quotas.pl

The perl script check_quotas.pl can be used to scan complete file systems, execute a stat for each file, add up the capacity for each user or group, and compare the output with output of 'lfs quota'. In order to not count sizes of hard linked files many times, the tools stores all inode numbers internally and therefore needs a lot of main memory. You can download the tool here: https://www.scc.kit.edu/scc/sw/lustre_tools/check_quotas.tgz

compare_user_group_ost_quotas.pl

In case user and group quotas are activated you can add up all user and group quotas reported by the OSS for each OST. (On the OSS you can get this information with 'lctl get_param osd-ldiskfs.<ost_name>.quota_slave.acct_user' and 'lctl get_param osd-ldiskfs.<ost_name>.quota_slave.acct_group'.) The result should be pretty much identical but our investigation showed that this is not true, especially for older file systems. However, we do not yet understand the reason for this difference. The perl script to easily do this comparison is called compare_user_group_ost_quotas.pl. It can be downloaded here: https://www.scc.kit.edu/scc/sw/lustre_tools/compare_user_group_ost_quotas.tgz

References and Links