Handling Full OSTs: Difference between revisions
KenRawlings (talk | contribs) (Initial creation as part of accelerated wiki migration) |
(→Disabling MDS Object Creation on OST: add "-P" for permanent setting) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<small>''(Updated: Mar 2018)''</small> | |||
__TOC__ | |||
Sometimes the OSTs in a file system have unbalanced usage, either due to the addition of new OSTs, or because of user error such as explicitly specifying the same starting OST index (e.g. '''-i 0''') for a large number of files, or when creating a single large file on one OST. If an OST is full and an attempt is made to write more information to that OST (e.g. extending an existing file), an error may occur. | |||
If the filesystem usage is always growing and old files are not regularly deleted, it is recommended to plan tha additional OST capacity is added before the filesystem hits 80% usage, and at least 25% capacity is added each time, so that the MDS can effectively balance space usage before it becomes critically low on any OST. | |||
The MDS will automatically reduce allocations on OSTs that have less free space than other OSTs (free space difference over 17%, controlled by '''lod.*.qos_threshold_rr'''). In many cases, this will be sufficient to put more new files on OSTs with more free space. This will work well for files that are striped over a subset of OSTs (within a pool, if any), to allow the MDS to skip the OSTs that are low on space. However, if there is something that is ''forcing'' the MDS to continue to allocate objects on those OSTs (eg. using '''-c -1''' to stripe over all available OSTs), then the MDS will continue to use the more-full OSTs until they are almost totally full (up to 99.9% by default, controlled by '''osp.*.reserved_mb_low'''). | |||
The procedures below describe options on how to manually reduce the usage of a full OST. | |||
== Checking File System Usage == | == Checking File System Usage == | ||
Line 12: | Line 13: | ||
<pre> | <pre> | ||
root@ | root@client01 ~]# lfs df -h | ||
UUID bytes Used Available Use% Mounted on | UUID bytes Used Available Use% Mounted on | ||
testfs-MDT0000_UUID 4.4G 214.5M 3.9G 4% /mnt/testfs[MDT:0] | |||
testfs-MDT0001_UUID 4.4G 144.5M 4.0G 4% /mnt/testfs[MDT:1] | |||
testfs-OST0000_UUID 2.0T 751.3G 1.1G 37% /mnt/testfs[OST:0] | |||
testfs-OST0001_UUID 2.0T 755.3G 1.1G 37% /mnt/testfs[OST:1] | |||
testfs-OST0002_UUID 2.0T 1.9T 55.1M 99% /mnt/testfs[OST:2] <- | |||
testfs-OST0003_UUID 2.0T 751.3G 1.1G 37% /mnt/testfs[OST:3] | |||
testfs-OST0004_UUID 2.0T 747.3G 1.1G 37% /mnt/testfs[OST:4] | |||
testfs-OST0005_UUID 2.0T 743.3G 1.1G 36% /mnt/testfs[OST:5] | |||
filesystem summary: 11. | filesystem summary: 11.8T 5.5T 5.7T 46% /mnt/lustre | ||
</pre> | </pre> | ||
In this case, '' | In this case, ''OST0002'' is almost full and when an attempt is made to write additional data to the file system (with uniform striping over all the OSTs), the write command fails after the OST0002 stripe uses the remaining 100MiB on that OST: | ||
<pre> | <pre> | ||
[root@ | [root@client01 ~]# lfs setstripe -c -1 /mnt/testfs | ||
[root@ | [root@client01 ~]# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100 | ||
dd: writing `/mnt/ | dd: writing `/mnt/testfs/test_3': No space left on device | ||
98+0 records in | 98+0 records in | ||
97+0 records out | 97+0 records out | ||
Line 36: | Line 38: | ||
</pre> | </pre> | ||
== | == Disabling MDS Object Creation on OST == | ||
To enable continued use of the file system, the full OST has to | To enable continued use of the file system, the full OST has to have object creation disabled. This needs to be done on all MDS nodes, since the MDS allocates OST objects for new files. | ||
1. '' | 1. As the root user on the MDS use the ''lctl set_param'' command to disable object creation on the OST: | ||
<pre> | <pre> | ||
[root@ | [root@mds01 ~]# lctl set_param osp.testfs-OST0002-*.max_create_count=0 | ||
osp.testfs-OST0002-MDT0000.max_create_count=0 | |||
osp.testfs-OST0002-MDT0001.max_create_count=0 | |||
</pre> | </pre> | ||
If an OST is going to be removed permanently, then use the '''lfs set_param -P ...''' to set the parameters permanently, otherwise the ```max_create_count``` setting will be reset if the MDS is restarted or unmounted. | |||
The MDS connections to OST0002 will no longer create objects there. This process should be repeated for other MDS nodes and MDTa if present. If a new file is now written to the file system, the write will be successful as the stripes are allocated across the remaining active OSTs. | |||
2. Once the OST is no longer full (e.g. objects deleted or migrated off the full OST), it should be enabled again: | |||
<pre> | <pre> | ||
[root@ | [root@mds01 ~]# lctl set_param osp.testfs-OST0002-*.max_create_count=20000 | ||
osp.testfs-OST0002-MDT0000.max_create_count=20000 | |||
osp.testfs-OST0002-MDT0001.max_create_count=20000 | |||
</pre> | </pre> | ||
'''Note:''' for releases 2.10.6 and earlier, the '''create_count''' must also be set to a non-zero value after '''max_create_count''' is restored: | |||
<pre> | <pre> | ||
[root@ | [root@mds01 ~]# lctl set_param osp.testfs-OST0002-*.create_count=128 | ||
osp.testfs-OST0002-MDT0000.create_count=128 | |||
osp.testfs-OST0002-MDT0001.create_count=128 | |||
</pre> | </pre> | ||
== Migrating Data within a File System == | == Migrating Data within a File System == | ||
Data from existing files can be migrated to other OSTs using the ''lfs_migrate'' command. This can be done either while the full OST is deactivated, as described above, or while the OST is still active (in which case the full OST will have a reduced, but not zero, chance of being used for new files). | |||
1. | 1. Identify the file(s) to be moved. In the example below, output from the ''getstripe'' command indicates that the file ''test_2'' is located entirely on OST2: | ||
<pre> | <pre> | ||
[root@ | [root@client01 ~]# lfs find /mnt/testfs -size +1T --ost 2 | ||
/mnt/testfs/test_2 | |||
lmm_stripe_count: 1 | |||
1 | lmm_stripe_size: 4194304 | ||
lmm_pattern: 1 | |||
lmm_layout_gen: 0 | |||
lmm_stripe_offset: 2 | |||
obdidx objid objid group | |||
2 1424032 0x15baa0 0 | |||
obdidx | |||
</pre> | </pre> | ||
2. '' | 2. If this is very large, it should also be striped across multiple OSTs. Use ''lfs_migrate'' to move the file(s) to new OSTs: | ||
<pre> | <pre> | ||
[root@ | [root@client01 ~]# lfs_migrate -c 4 /mnt/testfs/test_2 | ||
</pre> | </pre> | ||
3. | 3. Check the file system balance. The ''df'' output in the example below shows a more balanced system compared to the ''df'' output in the example in [[Handling Full OSTs]]. | ||
<pre> | <pre> | ||
[root@ | [root@client01 ~]# lfs df -h | ||
UUID bytes Used Available Use% Mounted on | UUID bytes Used Available Use% Mounted on | ||
testfs-MDT0000_UUID 4.4G 214.5M 3.9G 4% /mnt/testfs[MDT:0] | |||
testfs-MDT0001_UUID 4.4G 144.5M 4.0G 4% /mnt/testfs[MDT:1] | |||
testfs-OST0000_UUID 2.0T 1.3T 598.1G 65% /mnt/testfs[OST:0] | |||
testfs-OST0001_UUID 2.0T 1.3T 594.1G 65% /mnt/testfs[OST:1] | |||
testfs-OST0002_UUID 2.0T 913.4G 1000.0G 45% /mnt/testfs[OST:2] | |||
testfs-OST0003_UUID 2.0T 1.3T 602.1G 65% /mnt/testfs[OST:3] | |||
testfs-OST0004_UUID 2.0T 1.3T 606.1G 64% /mnt/testfs[OST:4] | |||
testfs-OST0005_UUID 2.0T 1.3T 610.1G 64% /mnt/testfs[OST:5] | |||
filesystem summary: 11.8T 7.3T 3.9T 61% /mnt/testfs | |||
</pre> | </pre> | ||
Latest revision as of 16:28, 24 May 2023
(Updated: Mar 2018)
Sometimes the OSTs in a file system have unbalanced usage, either due to the addition of new OSTs, or because of user error such as explicitly specifying the same starting OST index (e.g. -i 0) for a large number of files, or when creating a single large file on one OST. If an OST is full and an attempt is made to write more information to that OST (e.g. extending an existing file), an error may occur.
If the filesystem usage is always growing and old files are not regularly deleted, it is recommended to plan tha additional OST capacity is added before the filesystem hits 80% usage, and at least 25% capacity is added each time, so that the MDS can effectively balance space usage before it becomes critically low on any OST.
The MDS will automatically reduce allocations on OSTs that have less free space than other OSTs (free space difference over 17%, controlled by lod.*.qos_threshold_rr). In many cases, this will be sufficient to put more new files on OSTs with more free space. This will work well for files that are striped over a subset of OSTs (within a pool, if any), to allow the MDS to skip the OSTs that are low on space. However, if there is something that is forcing the MDS to continue to allocate objects on those OSTs (eg. using -c -1 to stripe over all available OSTs), then the MDS will continue to use the more-full OSTs until they are almost totally full (up to 99.9% by default, controlled by osp.*.reserved_mb_low).
The procedures below describe options on how to manually reduce the usage of a full OST.
Checking File System Usage
The example below shows an unbalanced file system:
root@client01 ~]# lfs df -h UUID bytes Used Available Use% Mounted on testfs-MDT0000_UUID 4.4G 214.5M 3.9G 4% /mnt/testfs[MDT:0] testfs-MDT0001_UUID 4.4G 144.5M 4.0G 4% /mnt/testfs[MDT:1] testfs-OST0000_UUID 2.0T 751.3G 1.1G 37% /mnt/testfs[OST:0] testfs-OST0001_UUID 2.0T 755.3G 1.1G 37% /mnt/testfs[OST:1] testfs-OST0002_UUID 2.0T 1.9T 55.1M 99% /mnt/testfs[OST:2] <- testfs-OST0003_UUID 2.0T 751.3G 1.1G 37% /mnt/testfs[OST:3] testfs-OST0004_UUID 2.0T 747.3G 1.1G 37% /mnt/testfs[OST:4] testfs-OST0005_UUID 2.0T 743.3G 1.1G 36% /mnt/testfs[OST:5] filesystem summary: 11.8T 5.5T 5.7T 46% /mnt/lustre
In this case, OST0002 is almost full and when an attempt is made to write additional data to the file system (with uniform striping over all the OSTs), the write command fails after the OST0002 stripe uses the remaining 100MiB on that OST:
[root@client01 ~]# lfs setstripe -c -1 /mnt/testfs [root@client01 ~]# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100 dd: writing `/mnt/testfs/test_3': No space left on device 98+0 records in 97+0 records out 1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s
Disabling MDS Object Creation on OST
To enable continued use of the file system, the full OST has to have object creation disabled. This needs to be done on all MDS nodes, since the MDS allocates OST objects for new files.
1. As the root user on the MDS use the lctl set_param command to disable object creation on the OST:
[root@mds01 ~]# lctl set_param osp.testfs-OST0002-*.max_create_count=0 osp.testfs-OST0002-MDT0000.max_create_count=0 osp.testfs-OST0002-MDT0001.max_create_count=0
If an OST is going to be removed permanently, then use the lfs set_param -P ... to set the parameters permanently, otherwise the ```max_create_count``` setting will be reset if the MDS is restarted or unmounted.
The MDS connections to OST0002 will no longer create objects there. This process should be repeated for other MDS nodes and MDTa if present. If a new file is now written to the file system, the write will be successful as the stripes are allocated across the remaining active OSTs.
2. Once the OST is no longer full (e.g. objects deleted or migrated off the full OST), it should be enabled again:
[root@mds01 ~]# lctl set_param osp.testfs-OST0002-*.max_create_count=20000 osp.testfs-OST0002-MDT0000.max_create_count=20000 osp.testfs-OST0002-MDT0001.max_create_count=20000
Note: for releases 2.10.6 and earlier, the create_count must also be set to a non-zero value after max_create_count is restored:
[root@mds01 ~]# lctl set_param osp.testfs-OST0002-*.create_count=128 osp.testfs-OST0002-MDT0000.create_count=128 osp.testfs-OST0002-MDT0001.create_count=128
Migrating Data within a File System
Data from existing files can be migrated to other OSTs using the lfs_migrate command. This can be done either while the full OST is deactivated, as described above, or while the OST is still active (in which case the full OST will have a reduced, but not zero, chance of being used for new files).
1. Identify the file(s) to be moved. In the example below, output from the getstripe command indicates that the file test_2 is located entirely on OST2:
[root@client01 ~]# lfs find /mnt/testfs -size +1T --ost 2 /mnt/testfs/test_2 lmm_stripe_count: 1 lmm_stripe_size: 4194304 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 2 obdidx objid objid group 2 1424032 0x15baa0 0
2. If this is very large, it should also be striped across multiple OSTs. Use lfs_migrate to move the file(s) to new OSTs:
[root@client01 ~]# lfs_migrate -c 4 /mnt/testfs/test_2
3. Check the file system balance. The df output in the example below shows a more balanced system compared to the df output in the example in Handling Full OSTs.
[root@client01 ~]# lfs df -h UUID bytes Used Available Use% Mounted on testfs-MDT0000_UUID 4.4G 214.5M 3.9G 4% /mnt/testfs[MDT:0] testfs-MDT0001_UUID 4.4G 144.5M 4.0G 4% /mnt/testfs[MDT:1] testfs-OST0000_UUID 2.0T 1.3T 598.1G 65% /mnt/testfs[OST:0] testfs-OST0001_UUID 2.0T 1.3T 594.1G 65% /mnt/testfs[OST:1] testfs-OST0002_UUID 2.0T 913.4G 1000.0G 45% /mnt/testfs[OST:2] testfs-OST0003_UUID 2.0T 1.3T 602.1G 65% /mnt/testfs[OST:3] testfs-OST0004_UUID 2.0T 1.3T 606.1G 64% /mnt/testfs[OST:4] testfs-OST0005_UUID 2.0T 1.3T 610.1G 64% /mnt/testfs[OST:5] filesystem summary: 11.8T 7.3T 3.9T 61% /mnt/testfs