Configuring Lustre File Striping

From Lustre Wiki
Jump to navigation Jump to search

One of the main factors leading to the high performance of Lustre™ file systems is the ability to stripe data over multiple OSTs. The stripe count can be set on a file system, directory, or file level. An example showing the use of striping is provided below.

The Lustre Striping Guide provides a good overview of how Lustre file striping works.

For more detailed information, see Chapter 19: Managing File Striping and Free Space in the Lustre Operations Manual.

Setting Up Striping

To see the layout of a particular file, or the default layout to be used for new files in a particular directory, use the command lfs getstripe {file|directory|root}. If run against the filesystem root directory, it will show the default layout for all files created in the filesystem that do not otherwise specify a layout at creation time or inherit it from a layout on the parent directory. For example, running the command on the filesystem root directory shows the following (the -d option limits the output to the specified directory):

root@client# lfs getstripe -d /mnt/testfs
stripe_count:  2 stripe_size:   4194304 pattern:       0 stripe_offset: -1

In this example, the default stripe_count is 2 (that is, data blocks are striped alternately over two OSTs), the default stripe_size is 4 MB (that is, each OST reads or writes 4MB of data for the first file stripe before going to the second file stripe on the next OST), and files do not start on a specific OST index (that is, the MDS will balance new file creations across all OSTs in the filesystem for maximum performance).

Note that the stripe_size does not affect the allocation size of the file on disk (which is controlled by the underlying OST filesystem, typically 4KB for ldiskfs), but only the distribution of file data across OSTs.

The command to set the above default layout on the file system looked like this:

 root@client# lfs setstripe -S 4M -c 2 /mnt/testfs

If a new 2000MB file is created with these filesystem default settings, the following results are seen:

root@client# dd if=/dev/zero of=/mnt/testfs/test1 bs=20M count=100
root@client# lfs df -h
UUID                  bytes     Used  Available   Use%   Mounted on
testfs-MDT0000_UUID    4.4G   214.5M       3.9G     4%   /mnt/testfs[MDT:0]
testfs-OST0000_UUID    2.0G     1.1G     830.1M    53%   /mnt/testfs[OST:0]
testfs-OST0001_UUID    2.0G     1.1G     830.1M    53%   /mnt/testfs[OST:1]
testfs-OST0002_UUID    2.0G    83.3M       1.8G     4%   /mnt/testfs[OST:2]
testfs-OST0003_UUID    2.0G    83.3M       1.8G     4%   /mnt/testfs[OST:3]
testfs-OST0004_UUID    2.0G    83.3M       1.8G     4%   /mnt/testfs[OST:4]
testfs-OST0005_UUID    2.0G    83.3M       1.8G     4%   /mnt/testfs[OST:5]

filesystem_summary:   11.8G     2.5G       8.8G    20%   /mnt/testfs

In this example, the entire file was written to the first two OSTs (1000MB per OST) with no usage of the other four OSTs. That only two OSTs are used is expected/requested, since other files will be created on those other OSTs to balance space and bandwidth usage. Note that the layout (stripe_count, stripe_size, OSTs) of a file is fixed when the file is first created (opened). To change the layout of a file after it is created, the lfs migrate command (which takes the same parameters as lfs setstripe) is needed to move the file data to different OSTs.

Continuing with this example, if a new 1000MB file is created with an explicit stripe_count of -1 to specify striping over all available OSTs instead of using the filesystem default:

 root@client# lfs setstripe -c -1 /mnt/testfs/test2

Now, when this file is written, the new stripe setting evenly distributes about 160MB of the filek data over each the available OSTs. Using a widely-striped file is good if the file is very large, or a lot of clients will be accessing it concurrently, but typically it is best to have individual files striped over only 1 or 2 OSTs for minimal overhead, and let multiple processes creating separate files handle the parallelism across different OSTs.

root@client# dd if=/dev/zero of=/mnt/testfs/test1 bs=10M count=100
root@client# lfs df -h
UUID                  bytes     Used  Available   Use%   Mounted on
testfs-MDT0000_UUID    4.4G   214.5M       3.9G     4%  /mnt/testfs[MDT:0]
testfs-OST0000_UUID    2.0G     1.3G     670.2M    61%  /mnt/testfs[OST:0]
testfs-OST0001_UUID    2.0G     1.3G     670.2M    61%  /mnt/testfs[OST:1]
testfs-OST0002_UUID    2.0G   251.3M       1.6G    12%  /mnt/testfs[OST:2]
testfs-OST0003_UUID    2.0G   251.3M       1.6G    12%  /mnt/testfs[OST:3]
testfs-OST0004_UUID    2.0G   247.3M       1.6G    12%  /mnt/testfs[OST:4]
testfs-OST0005_UUID    2.0G   247.3M       1.6G    12%  /mnt/testfs[OST:5]

filesystem_summary:   11.8G     3.5G       7.7G    12%  /mnt/testfs

Displaying Layout Information for a File

The lfs getstripe command can be used to display information that shows which specific OSTs a file is distributed over. For example, the output from the following command indicates that the file test2 is striped over all six active OSTs in the filesystem, both because of the lmm_stripe_count: line, and because it shows 6 separate l_fid: objects allocated for the file starting at OST0002 because the test1 file had just allocated objects on OST0000 and OST0001 (the -y option formats the output nicely in YAML format):

root@client# lfs getstripe -y /mnt/testfs/test2
lmm_stripe_count:  6
lmm_stripe_size:   4194304
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 2
lmm_objects:
      - l_ost_idx: 2
        l_fid:     0x100020000:0x2:0x0
      - l_ost_idx: 3
        l_fid:     0x100030000:0x2:0x0
      - l_ost_idx: 4
        l_fid:     0x100040000:0x2:0x0
      - l_ost_idx: 5
        l_fid:     0x100050000:0x2:0x0
      - l_ost_idx: 0
        l_fid:     0x100000000:0x3:0x0
      - l_ost_idx: 1
        l_fid:     0x100010000:0x3:0x0

In contrast, the output from the following command, which shows a lmm_stripe_count: of two and lists only two l_fid entries, indicates that the file test1 is stored on two OSTs, namely OST0000 and OST0001:

root@client# lfs getstripe -y /mnt/testfs/test1
lmm_stripe_count:  2
lmm_stripe_size:   4194304
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 3
lmm_objects:
      - l_ost_idx: 0
        l_fid:     0x100000000:0x2:0x0
      - l_ost_idx: 1
        l_fid:     0x100010000:0x2:0x0

See the lfs-getstripe(1) and lfs-setstripe(1) man pages for full details of what options are available.

Setting Up Progressive File Layouts

With Lustre 2.10 and later, it is possible to configure Progressive File Layouts (PFL) on a file, which avoids much of the need to explicitly specify layouts for files of different sizes. A PFL file can have different layout parameters for different regions of a single file, and as the file size increases it activates the later parts of the file layout. This can allow lower overhead for small files that only need a single stripe, increased bandwidth for larger files, and wide distribution of space usage for a very large file.

To create a PFL file layout, the lfs setstripe -E <size> option is used to specify the layout for each extent of the file up to the specified size, and the parameters following -E can be set arbitrarily for each extent of the file. Typically, small files should have a lower stripe_count (for low overhead) and as the file size increases the stripe_count should also increase (to distribute space usage and increase bandwidth):

root@client# lfs setstripe -E 256M -c 1 -E 4G -c 4 -E -1 -c -1 -S 4M /mnt/testfs/test3
root@client# lfs getstripe /mnt/testfs/test3
/mnt/testfs/test3
  lcm_layout_gen:    3
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id:             1
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   268435456
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] }

    lcme_id:             2
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 268435456
    lcme_extent.e_end:   4294967296
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

    lcme_id:             3
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 4294967296
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

This example shows a file with 3 components, the first component has a single stripe up to 256MB in size, the second component will have 4 stripes up to 4GB in size, and the last component goes to the end of the file (-E -1) and will stripe over all OSTs (-c -1) with a stripe size of 4MB. The first component of a file is always initialized (has objects allocted), while the later components will only have objects allocated once the file grows larger:

root@client# dd if=/dev/zero of=/mnt/testfs/test3 bs=10M count=30
root@client# lfs getstripe /mnt/testfs/test3
/mnt/testfs/test3
  lcm_layout_gen:    4
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id:             1
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   268435456
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] }

    lcme_id:             2
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 268435456
    lcme_extent.e_end:   4294967296
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 2
      lmm_objects:
      - 0: { l_ost_idx: 2, l_fid: [0x100020000:0x4:0x0] }
      - 1: { l_ost_idx: 3, l_fid: [0x100030000:0x4:0x0] }
      - 2: { l_ost_idx: 4, l_fid: [0x100040000:0x4:0x0] }
      - 3: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] }

    lcme_id:             3
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 4294967296
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1