Lreflink High Level Design

Introduction
This documentation describes the design of Reflink for Lustre (Lreflink). Reflink is a feature that exists in Btrfs/XFS (not Ext4 or ZFS). When copying a files with cp --reflink=always, it takes advantage copy-on-write in order to do a quick copy of data by merely adding another reference to the shared data blocks, rather than having to read all the data and write it out again.

Introduction slides can be found here: https://www.eofs.eu/_media/events/devsummit19/lustre_reflink.pdf

Client side read/write to Lreflink file is as normal
When client reads/writes to reflink files, it writes or reads in the same way as normal files. However, client needs to provide interfaces to show the Lreflink info of a file.

Lreflink creation for striped files
Lustre client needs to support ioctl(FICLONERANGE/FICLONE) properly. Client needs to handle the striping correctly and send the clone ioctl to corresponding OST objects. If the cloned file has multiple stripes, all of the files' objects need to be cloned separately on different OSTs.

Lreflink doesn't clone data across OSTs
When cloning a data object, the newly created object should located on the same OST. Lreflink won't clone the object to another OST. That means, if an OST is full, Lreflink won't be able to clone any object even other OSTs still have free spaces/inodes.

Lreflink won't change data structure of backend file system
Lreflink will be implemented on the OSD level and upper levels. Not much things will be changed in Ext4/ZFS level. No disk-level modification for Ext4/ZFS at all.

All internal information needed by reflink will be all saved as extended attributes (EAs) of inodes/objects. It Lreflink requires a few APIs of Ext4/ZFS to read/write EAs efficiently.

Binary tree of reflinks
Creating a reflink (target reflink) of Lustre file (source file) will be essentially mapped to the operation of creating reflink(s) of the object(s) belonged to the source file.

When creating a reflink of an exiting OST object (father object), two new OST object will be created. One object (son object) is the object for the new file. The other one (brother object) is a new object for the old file. Both son object and brother object are empty and point to father object. The target reflink will use son object as its data object. By contrast, the source file needs to change its layout. The object of source file on that OST is no longer father object, instead it will be swapped to brother object.

When client trying to read data from son object or brother object, because the object is empty, data needs to be fetched from father object. father object itself could be a reflink to another object (grandpa object). In that case, only part (could be none) of the data come from father object, and the rest data comes from grandpa object and the father of grandpa object if any, and so on.

Thus, the reflink objects that derive from the same root object form a tree. One the leaf in the tree is always pointed by one Lustre file. And none of the leaves are orphan, meaning a leaf should always have its corresponding Lustre file. By contrast, the non-leaf node of the tree shouldn't be pointed directly by any Lustre file. When reading a data of a Lustre file, OST starts from the corresponding leaf, and going down through the reflink tree towards the root, until gets the needed data.

Whenever creating a reflink on a leaf, the leaf becomes non-leaf node, and two new leaves are added as the inferiors of the node.

Each leaf-node (except normal object that has no reflinks) has and always has a brother object, and its brother object is created when relink creation happens. Thus the reflink tree is a binary tree. And each non-leaf node always have two children.

Bitmaps of Lreflink
Each node in the reflink tree, no matter leaf or non-leaf node, maintains a bitmap to remember what part of the data is saved on the current node or not. When read happens on the node, the bitmap is checked to see whether the data are saved in the current object. If the data can't be read from the current object, the traversal of the reflink tree goes on, and bitmap of the parent node will be checked, and so on.

The bitmap of an OST object is saved as the EA of the object. One bit of the bitmap indicates a given range of block size, assuming 1 MiB. For example, the bitmap of 0B001001 indicates that the updated [0, 1 MiB) and [3 MB, 4 MiB) range are saved in the current object.

Reading/Writing interfaces of bitmap in Ext4/ZFS
For data of 1 PiB, the bitmap size of each object is 128 MiB. Ext4 feature ea_inode should be enabled to support huge EA.

In order to improve the efficiency of reading/writing bitmap EAs, interfaces should be added to Ext4/ZFS. The interfaces should enable the reading of writing a small part of EA at any given offset.

Immutable protection of non-leaf node of reflink tree
Writing should always be applied to the leaf object of the reflink tree, never on the non-leaf node. Thus, the non-leaf objects can be set to immutable so that no modification can be done to them.

Root of reflink tree
When a object is created by normal I/O patch (i.e. file creation, not reflink creation), the object is, and will always be the root of the potential reflink tree. The root of a reflink tree should always have all-one bitmap. That means, whenever the traversal of the reflink tree reaches the root, data can always be got from the root object.

Copy on write of Lreflink
Whenever there is a writing to the leaf object, its bitmap need to be checked. If the write locates on the range saved on the current object, the data can be written to the object directly.

Otherwise, that part of the data need to be read from the object's ancestor(s). And the corresponding bit(s) will be set. The modification will be applied to the data, and the modified data will be written to the current object. The future read/write from/to the same range will be able to be applied to current object directly, and the bitmap knows enough information for that.

Removal of reflinks
Removing a reflink would cause removing of a leaf from the reflink tree. When removing the last child leaf from a parent node, the parent node/object can be removed too. When the last leaf of the reflink tree is removed, the entire reflink tree is removed.

Optimization of reflink tree
There are some ways to optimize the reflink tree for improvement of I/O performace and space efficiency.

1. If any leaf node of the tree has all-one (or almost all-one) bitmap, the leaf can be detached from the tree and becomes a stand-along object. 2. When the step 1 happens, if any non-leaf node has no child, the node can be removed. 3. A non-leaf node can only has one or two children. When the step 1 or step 2 happens, or reflink removal happens, a non-leaf node loses one of its children and only has one child. In that case, its child leaf can merge with this node, and the node becomes a leaf node.

Space efficiency of Lreflink
For normal files, nothing need to be saved for Lreflink, so zero space overhead.

For a newly created relink, no bimap needs to be written to the leaf object. The only space overhead is the non-leaf object. Creating N reflinks of a file would cost N objects as the overhead.

For data with size S, maintaining a reflink would cost most space if all of the data is modified after reflink is created. In that case, the bitmap space cost of a reflink would be S/8M.

Because of CoW, 1 MB block is copied even a small part of the block is modified. That means, small writes would cause duplicated data copies.

For a reflink tree with L leaves, there are L files point to the tree. And the entire node/object number in the tree is L * 2 - 1. In the worst case, these L files will keep L * 2 - 1 entire copies of the data, meaning the common data is duplicated for L * 2 - 1 times. By contrast, copying files directly would only need L copies. That means, in the worst case, comparing to no-reflink copies, the space cost of reflink will no worst than (L * 2 - 1) / L, i.e. approximately 2. This will only happen when all the files pointed to the reflink tree are modified entirely, including the original files and the newly created reflinks.

Comparing to L reflinks with absolutely no modification, no-reflink copies cost L times more space.

In most of the cases, the reflinks are created in sequential chain way, i.e. create a reflink, modify the reflink, create an reflink of the reflink, and so on. In that case, the wost case is the data of the reflinks are always modified entirely. In that worst case, the space usage of reflink is approximately the same with no-reflink copies.

Time efficiency of Lreflink
For normal files with no reflink, nothing in the I/O path need to be changed for Lreflink, so zero time overhead.

The bitmap reading/writing through EA should be as efficient as reading/writing normal files, since ea_inode feature is enabled for Ext4.

For data with M levels of reflink tree, read would be least efficient if the data need to be got from the root node. In that case, bitmap needs to be read for M - 1 times. There is no other obvious time overhead for read. IO size ....

For data with M levels of reflink tree, write would be least efficient if the data need to be got from the root node. In that case, bitmaps needs to be read for M - 1 times. And even if the write is tiny (e.g. 4 KB), data of 1 MB still need to be got from the root node, and entire 1 MB write need to be issued to the leaf object too. For sequential small writes, the first write of the 1 MB block is the least efficient. The following small write of the 1MB block would be more efficient because only one bitmap read is needed and no CoW is needed any more. Fortunately, the CoW happens on the same OST thus is relatively quick comparing to network latency.

Deduplication based on Lreflink
Deduplication is the technique of finding and removing duplicate copies of data. Lreflink enables us to implement dedplication in Lustre.

Snapshot of directory tree based on Lreflink
Simply traversing the directory tree, copying the directory structure and creating reflinks to all files under that directory tree would implements a rough version of directory-level snapshot. That directory-level snapshot is efficient in the sense of space usage (Lreflink's efficiency) and performance (metadata is the same with normal directory tree, and data is the same with Lreflink's efficiency). However, directory inodes need to be copied, thus less efficient on inode usage.

More sophisticated directory-level snapshot needs more complex design.

Space usage of Lreflink
An data object could be shared by multiple files. And these files could be owned by different users. The space/inode accounting of quota, as well as du for project quota should be handled properly in some way.

LFSCK support for Lreflink
LFSCK should be modified to handle Lreflink objects and EAs properly.

Fiemap for bitmap
It might be possible to use fiemap for bitmap of Lreflink.