OST Pool Quotas Test Report

Regression testing
All issues that were found during work at "LU-11023 quota: quota pools for OSTs" were fixed before landing.

Below are links to the test results from the latest patchset before landing(https://review.whamcloud.com/#/c/35615/51): Passed enforced test review-ldiskfs on CentOS 7.0/x86_64 uploaded by Trevis Autotest2 from trevis-47vm1: https://testing.whamcloud.com/test_sessions/bcfa089d-338e-4d33-9a1d-c73d053f072a ran 5 tests. Passed enforced test review-zfs on CentOS 7.0/x86_64 uploaded by Trevis Autotest from trevis-9vm1: https://testing.whamcloud.com/test_sessions/4d96f1b7-c651-4a50-b4b7-a8d51cb7ffcb ran 8 tests. Passed enforced test review-dne-part-1 on CentOS 7.0/x86_64 uploaded by Trevis Autotest from trevis-10vm6: https://testing.whamcloud.com/test_sessions/4794d3c7-f2ba-4eb6-91e5-d1d8dd1d1d0b ran 6 tests. Passed enforced test review-dne-part-2 on CentOS 7.0/x86_64 uploaded by Trevis Autotest from trevis-5vm5: https://testing.whamcloud.com/test_sessions/f91519d8-1cd0-403a-9972-43f30ac39629 ran 11 tests. Passed enforced test review-dne-selinux on CentOS 7.0/x86_64 uploaded by Trevis Autotest2 from trevis-38vm1: https://testing.whamcloud.com/test_sessions/70e2ccc7-5c9b-45bf-b02e-946b26a67832 ran 5 tests. Passed enforced test review-dne-zfs-part-2 on CentOS 7.0/x86_64 uploaded by Trevis Autotest from trevis-19vm1: https://testing.whamcloud.com/test_sessions/2890691e-2230-44cc-bfbe-c4b8eef434b0 ran 11 tests. Passed enforced test review-dne-zfs-part-3 on CentOS 7.0/x86_64 uploaded by Trevis Autotest2 from trevis-40vm1: https://testing.whamcloud.com/test_sessions/9f1fd8a4-e676-45a3-b9cd-3dd020c56e3e ran 3 tests. Passed enforced test review-dne-part-3 on CentOS 7.0/x86_64 uploaded by Trevis Autotest from trevis-3vm1: https://testing.whamcloud.com/test_sessions/4b48f1bf-b7db-406f-96be-6509b33e17b5 ran 3 tests. Passed enforced test review-dne-part-4 on CentOS 7.0/x86_64 uploaded by Onyx Autotest from onyx-61vm6: https://testing.whamcloud.com/test_sessions/19adeb6b-7658-4c08-9da8-b7474a328dfc ran 10 tests. Passed enforced test review-dne-part-4 on CentOS 7.0/x86_64 uploaded by Onyx Autotest from onyx-49vm1: https://testing.whamcloud.com/test_sessions/885887f7-e0e0-486c-a830-8993e5f284f5 ran 10 tests. Passed enforced test review-dne-zfs-part-1 on CentOS 7.0/x86_64 uploaded by Trevis Autotest from trevis-6vm6: https://testing.whamcloud.com/test_sessions/75bec5ab-37ed-4f46-b70b-b4204c539a76 ran 6 tests. Passed enforced test review-ldiskfs-arm on CentOS 7.0/x86_64, CentOS 8.0/aarch64 uploaded by Onyx Autotest from onyx-90vm27: https://testing.whamcloud.com/test_sessions/de70c50f-fe3e-44cb-8961-e205ee6a3d1c ran 5 tests. Passed enforced test review-dne-zfs-part-4 on CentOS 7.0/x86_64 uploaded by Trevis Autotest from trevis-5vm5: https://testing.whamcloud.com/test_sessions/32aa0561-a287-437f-8ac3-577f351d571e ran 10 tests.

Now there is no known issues related to OST Pool Quotas.

New feature testing
To test new feature were added following tests in sanity-quota.sh: 1b,1c,1d,1e,1f,1g,3b,3c,67,68,69,70,71a,71b,72. See tests description in OST Pool Quota Test Plan.

71a and 71b sometimes failed resulting in https://jira.whamcloud.com/browse/LU-13677.

Now this ticket is closed after landing "LU-13677 quota: qunit sorting doesn't work".

The last one was landed sanity-quota_1g. See the results of https://review.whamcloud.com/#/c/39469/7:

review-zfs on CentOS 7.0/x86_64 https://testing.whamcloud.com/test_sets/8909687c-6505-4df8-ac3b-ee5060698872 review-dne-part-4 on CentOS 7.0/x86_64 https://testing.whamcloud.com/test_sets/c9b86b3a-619e-4eff-9266-2c999ec4552c review-dne-zfs-part-4 on CentOS 7.0/x86_64 https://testing.whamcloud.com/test_sets/5798be5c-701e-40a7-a62a-b8f97621d381

Cluster that was used for performance, stress and failover testing
To this types of testing was used internal HPE cluster with following configuration: -- Hostname   Role       Power State  Service State  Targets  HA Partner  HA Resources -- cslmo1700  MGMT       On           N/a            0 / 0    cslmo1701   None cslmo1701 (MGMT)     On           N/a            0 / 0    cslmo1700   None cslmo1702 MGS,(MDS)  On           Started        1 / 1    cslmo1703   Local cslmo1703 MDS,(MGS)  On           Started        1 / 1    cslmo1702   Local cslmo1704 OSS        On           Started        1 / 1    cslmo1705   Local cslmo1705 OSS        On           Started        1 / 1    cslmo1704   Local cslmo1706 OSS        On           Started        1 / 1    cslmo1707   Local cslmo1707 OSS        On           Started        1 / 1    cslmo1706   Local -- Lustre version at server side - lustre-2.13.56_3.10.0_957.1.3957.1.3.x4.4.35.x86_64.

All server nodes had about 62GB of total memory.

Disk sizes: cslmo1702: /dev/md65                                                    806G  5.2M  798G   1% /data/cslmo1703:md65 cslmo1703: /dev/md66                                                    2.9T  117M  2.8T   1% /data/cslmo1703:md66 cslmo1706: /dev/md0                                                     112T   11G  111T   1% /data/cslmo1706:md0 cslmo1707: /dev/md1                                                     112T   11G  111T   1% /data/cslmo1706:md1 cslmo1704: /dev/md0                                                     112T   17G  111T   1% /data/cslmo1704:md0 cslmo1705: /dev/md1                                                     112T   16G  111T   1% /data/cslmo1704:md1

Load was created from 5 clients - lustre-client-2.12.4.1_cray_180_gee19431_3.10.0_957.5.1.el7.x86_64.

Each client had about 62GB of total memory.

max_dirty_mb was set to 10MB at each client: lctl set_param osc.*.max_dirty_mb=10 10MB was chosen to avoid storing too much data in a client cache. For example, if some user makes IO load in a system with 4OSTs from 5 clients with max_dirty_mb set to 100, clients may store in a cache 2GB(100MB per each OST at 5 clients). As data from the cache may be written over quota, max_diry_mb shouldn't be high to avoid unexpected EDQUOT. 1MB was refused to don't clear caches too often, so 10MB was finally picked as a medium value.

Performance testing
As a baseline was used d0452cf - the patch before "LU-11023 quota: quota pools for OSTs"(09f9fb32).

It was compared with 09f9fb32 + 8704d14c(LU-13677 quota: qunit sorting doesn't work). 8704d14c was applied directly at 09f9fb32 as this patch solves a problem introduced in the main pool quotas patch.

1 user from 5 clients made load with mdtest and ior. Pools configuration: global0: OST0000, OST0001 global1: OST0002, OST0003

max_dirty_mb was set to 10MB at each client lctl set_param osc.*.max_dirty_mb=10 Quota limits: Each user: block hard limit 1T to enforce quota. Each user: block hard limit 200G per each pool.

Stress testing
5 users from 5 clients make load with mdtest and ior. Pools configuration: global0: OST0000, OST0001 global1: OST0002, OST0003

max_dirty_mb was set to 10MB at each client lctl set_param osc.*.max_dirty_mb=10

Quota limits: Each user: block hard limit 1T to enforce quota. Each user: block hard limit 1G per each pool. Typical quota output for one of users: Disk quotas for usr quota15_1 (uid 61501): Filesystem kbytes   quota   limit   grace   files   quota   limit   grace /mnt/cslmo17 1527304      0 1073741824       -       0       0       0       - cslmo17-MDT0000_UUID 0      -       0       -       0       -       0       - cslmo17-OST0000_UUID 773000*     -  773000       -       -       -       -       - cslmo17-OST0001_UUID 754304*     -  754304       -       -       -       -       - cslmo17-OST0002_UUID 0      -       0       -       -       -       -       - cslmo17-OST0003_UUID 0      -       0       -       -       -       -       - Total allocated inode limit: 0, total allocated block limit: 1527304 uid 61501 is using default file quota setting Disk quotas for usr quota15_1 (uid 61501): Filesystem kbytes   quota   limit   grace   files   quota   limit   grace /mnt/cslmo17 1527304*     0 1048576       -       0       0       0       - Pool: cslmo17.global0 cslmo17-OST0000_UUID 773000*     -  773000       -       -       -       -       - cslmo17-OST0001_UUID 754304*     -  754304       -       -       -       -       - Total allocated inode limit: 0, total allocated block limit: 1527304 Disk quotas for usr quota15_1 (uid 61501): Filesystem kbytes   quota   limit   grace   files   quota   limit   grace /mnt/cslmo17      0       0 1048576       -       0       0       0       - Pool: cslmo17.global1 cslmo17-OST0002_UUID 0      -       0       -       -       -       -       - cslmo17-OST0003_UUID 0      -       0       -       -       -       -       - Total allocated inode limit: 0, total allocated block limit: 0

When IOR makes load from uneven clients(1,3,5), it writes 7.8GB over pool limit(1G).

When IOR makes load from even clients(2,4), it writes 400MB.

Pass criteria:
 * clients 1,3,5 should fail with EDQUOT on each IOR iteration
 * no write failures on clients 2,4
 * no kernel panics
 * used space for each user should be correct at the end of testing, i.e. 0 when all files are removed

Results: passed.

Failover testing
Was used the same quota and pool configuration with stress testing. Except periodical failover/failback of servers. Depending on test "victim node" was only MDS, MDS and OSS or 2 OSS.

During failover testing was faced LU-14033. It is a FS corruption caused by IO failure at ldiskfs layer. Now there is no arguments that Pool Quotas may be connected with the issue. Furthermore, LU-14033 was also faced in a test where IOR and mdtest made load from a user without quota limits(ha-ost-ior-mdtest).

ha-mds-ior-QP
-v cslmo1703 (MDS) IORP="$(echo '"-a POSIX -b 1G -t 1M -v -C -w -r -W -i 1 -T 15 -k" "-a POSIX -b 50M -t    1M -v -C -w -r -W -i 25  -T 15 -k"')" ha_mpi_loads="ior" POWER_DOWN="sysrqcrash"

ha-mdsostConcurrent-ior-mdtest-QP
-v cslmo1703 (MDS),cslmo1706(OSS) ha_mpi_loads="ior mdtest" IORP="$(echo '"-a PO   SIX -b 3G -t 1M -v -C -w -r -W -i 1 -T 15 -k" "-a POSIX -b 1G -t 1M -v -C -w -r -W -i 1 -T 15 -k"')" MDTESTP="$(echo '" -r -u -L -V 2 -i 10 -I 50 -z 1 -b 10 " "-T -F -r -C -u -L -V 2 -i 10 -I 50 -z 1 -b 10"')" POWER_DOWN="sysrqcrash"

Failed with LU-14033.

ha-ost-ior-QP
-v cslmo1704(OSS),cslmo1706(OSS) ha_mpi_loads="ior" IORP="$   (echo '"-a POSIX -b 1GM -t 1M -v -C -w -r -W -i 1 -T 15 -k" "-a POSIX -b 50M -t 1M -v -C -w -r -W -i 20  -T 15 -k"')" POWER_DOWN="sysrqcrash"

Results: passed.

ha-ost-ior-mdtest
Note this test was started from user that hasn't any quota limits.

-v cslmo1704(OSS),cslmo1706(OSS) IORP="$(echo '"-f /storage/shared/cslmo17/ssf" "-f /storage/shared/cslmo17/fpp"')" MDTESTP="$(echo '"-T -F -r -C -u -L -V 1 -i 500 -I 50 -z 1 -b 10"')" ha_mpi_loads="ior mdtest" POWER_DOWN="sysrqcrash"

Failed with LU-14033.

ha-ost-ior-mdtest-QP-no-FAIL
-v cslmo1704(OSS),cslmo1706(OSS) ha_mpi_loads="ior mdtest" IORP="$(echo '"-f /storage/shared/cslmo17/ssf" "-f /storage/shared/cslmo17/fpp"')" MDTESTP="$(echo '"-T -F -r -C -u -L -V 1 -i 500 -I 50 -z 1 -b 10"')" POWER_DOWN="sysrqcrash"

Failed with LU-14033.

Interoperability testing
To check interoperability was submitted https://review.whamcloud.com/#/c/40175/ with following test parameters: Test-Parameters: clientversion=2.12.3 testlist=sanity-quota clientdistro=el7.6 Test-Parameters: serverversion=2.12.3 testlist=sanity-quota serverdistro=el7.6 Test-Parameters: clientversion=2.10.8 testlist=sanity-quota clientdistro=el7.6 Test-Parameters: serverversion=2.10.8 testlist=sanity-quota serverdistro=el7.6

If server doesn't support PQ, newly added sanity-quota tests(1b,1c,1d,1e,1f,1g,3b,3c,67,68,69,71a,71b,72) are skipped. Only sanity-quota 70 is started even if server doesn't support QP. It checks that "old" server works correctly with a client that support QP.

Below are links to results: server 2.13.56.7, client 2.10.8 - https://testing.whamcloud.com/test_sets/016a537c-cf20-4f57-a7b9-e13c7624438f server 2.13.56.7, client 2.12.3 - https://testing.whamcloud.com/test_sets/32dc544b-df98-4df0-9669-b87130166af4 server 2.12.3, client 2.13.56.7 - https://testing.whamcloud.com/test_sets/15326174-13f4-4970-873c-dc830e1ea1d1 server 2.10.8, client 2.13.56.7 - https://testing.whamcloud.com/test_sets/0b1c0133-2040-41f4-bc35-68a035aa60bd Similar testing was also performed during preparing patch for landing, see patch set 39(https://review.whamcloud.com/#/c/39469/39).

No new issues was found during testing.

OST Pool Quotas with PFL
sanity-quota_71a was added to check Pool Quotas with PFL. Passed custom-101 on CentOS 7.0/x86_64, sanity-quota_71a: https://testing.whamcloud.com/sub_tests/6418f188-5602-4036-87ae-5f7dda454c1d Passed review-dne-part-4 on RHEL 7.8/x86_64, sanity-quota_71a: https://testing.whamcloud.com/sub_tests/b92aa52f-d3d5-4ced-b15a-820ef5da8b16

OST Pool Quotas with SEL
sanity-quota_71b was added to check Pool Quotas with SEL. Passed custom-101 on CentOS 7.0/x86_64, sanity-quota_71b: https://testing.whamcloud.com/sub_tests/d8667a2d-2ca4-4e90-8844-916a323aaebf Passed review-dne-part-4 on RHEL 7.8/x86_64, sanity-quota_71b: https://testing.whamcloud.com/sub_tests/0e822ed2-1308-4d02-9b97-9bd549878c82

OST Pool Quotas with DOM
sanity-quota_69 was added to check Pool Quotas with DOM. Passed enforced test review-zfs on CentOS 7.0/x86_64, sanity-quota_69: https://testing.whamcloud.com/sub_tests/7283ef98-fc5d-4cbc-a2c2-890ae742b4d5 Passed enforced test review-dne-part-4 on CentOS 7.0/x86_64, sanity-quota_69: https://testing.whamcloud.com/sub_tests/81fed51d-f50b-4bc5-ad02-80bc5cff9570 Passed enforced test review-dne-part-4 on CentOS 7.0/x86_64, sanity-quota_69: https://testing.whamcloud.com/sub_tests/ba00ae21-17af-4cdb-94c0-1a078b9f5aa2 Passed enforced test review-dne-zfs-part-4 on CentOS 7.0/x86_64, sanity-quota_69: https://testing.whamcloud.com/sub_tests/59439aec-fa4b-4cd4-a658-3f77136f7362

OST Pool Quotas with DNE
DNE feature goal is distributing metadata between MDTs. As currently Pool Quotas work only for OSTs and can't control metadata, DNE test cases are not needed. From OST Pool Quotas point of view there is no reason where stored metadata - it takes into account only quota acquiring requests from OSTs.