Data on MDT High Level Design: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(initial import from https://wiki.hpdd.intel.com/pages/viewpage.action?spaceKey=opensfs&title=Data+On+MDT+High+Level+Design without formatting or diagrams)
 
(→‎Introduction: add formatting and titles)
Line 3: Line 3:
The Data-on-MDT project aims to achieve good performance for small file IO. The motivations for this and current behaviour of small file IO are described in [[Data on MDT Solution Architecture]] document.
The Data-on-MDT project aims to achieve good performance for small file IO. The motivations for this and current behaviour of small file IO are described in [[Data on MDT Solution Architecture]] document.


This high-level design document provides more details about approach to implement this feature in the Lustre* file system.
This high-level design document provides more details about approach to implement this feature in the Lustre file system.
Implementation Requirements
==Implementation Requirements==
Create new layout for Data-on-MDT
===Create new layout for Data-on-MDT===
While [[Layout Enhancement]] project is responsible for a new layout, we must add appropriate changes in LOV to properly recognize the new layout and use LMV/MDC cl_object interface to work with data. LOV should understand and properly handle also maximum data size for the new layout. Another part of related work is to change `lfs` tool so it will be able to set new layout on directory as default.


While Layout Enhancement project is responsible for a new layout, we must add appropriate changes in LOV to properly recognize the new layout and use LMV/MDC cl_object interface to work with data. LOV should understand and properly handle also maximum data size for the new layout. Another part of related work is to change lfs tool so it will be able to set new layout on directory as default.
===CLIO should be able to work with MDT===
CLIO should be able to work with MDT
====Make MDC devices part of CLIO====
Make MDC devices part of CLIO
MDC devices must become `cl_device` and part of CLIO stack and be used by LOV for IO if new layout. Some unification is possible between OSC and MDC methods.
 
MDC devices must become cl_device and part of CLIO stack and be used by LOV for IO if new layout. Some unification is possible between OSC and MDC methods.


That code is orthogonal to the current MDC and is placed there for simpler access to import and request-related code. This code will become eventually the generic client that will be used by MDC and OSC.
That code is orthogonal to the current MDC and is placed there for simpler access to import and request-related code. This code will become eventually the generic client that will be used by MDC and OSC.
MDT support for IO requests
IO request handlers


===MDT support for IO requests===
====IO request handlers====
Unified Target made this possible, but specific handlers do not currently exist for the MDT. They will be added as part of this work.
Unified Target made this possible, but specific handlers do not currently exist for the MDT. They will be added as part of this work.
LOD must understand new layout


====LOD must understand new layout====
LOD must be aware of a new layout and handle it properly bypassing IO methods and other operations to the local storage instead of OST
LOD must be aware of a new layout and handle it properly bypassing IO methods and other operations to the local storage instead of OST
Lustre tools changes


The lfs setstripe command must be extended to recognize new layout setup and set maximum size for it.
===Lustre tools changes===
Functional Specification
The `lfs setstripe` command must be extended to recognize new layout setup and set maximum size for it.
New DOM layout
Wire/disk changes


The lov_stripe_md (lsm) stores information about DOM layout. The lw_pattern has specific flag LOV_PATTERN_DOM set, the lw_stripe_size contains maximum data size value for DOM, the lw_stripe_count is 0, no lsm_oinfo is allocated. This does not change wire protocol because LOV_PATTERN_DOM replaces existing LOV_PATTERN_FIRST.
==Functional Specification==
In-memory structures
===New DOM layout===
====Wire/disk changes====


The LOV implements new interface for new LLT_DOM layout type. This set of methods are closely related with LLT_RAID0 interface and some may be reused.
The `lov_stripe_md` (`lsm`) stores information about DOM layout. The `lw_pattern` has specific flag `LOV_PATTERN_DOM` set, the `lw_stripe_size` contains maximum data size value for DOM, the `lw_stripe_count` is 0, no `lsm_oinfo` is allocated. This does not change wire protocol because `LOV_PATTERN_DOM` replaces existing `LOV_PATTERN_FIRST`.
====In-memory structures====
 
The LOV implements new interface for new `LLT_DOM` layout type. This set of methods are closely related with `LLT_RAID0` interface and some may be reused.


Upon new object allocation the LOV does the following:
Upon new object allocation the LOV does the following:


    lov_layout_type() checks lsm and returns related type;
* `lov_layout_type()` checks `lsm` and returns related type;
    lov_dispatch() points to proper interface for this type;
* `lov_dispatch()` points to proper interface for this type;


so the proper stack of objects is built with using sub-objects points to MDC.
so the proper stack of objects is built with using sub-objects points to MDC.
union lov_layout_state {
 
union lov_layout_state {
     struct lov_layout_raid0 {
     struct lov_layout_raid0 {
         ...
         ...
Line 46: Line 47:
         struct lovsub_object *lo_dom;
         struct lovsub_object *lo_dom;
     } dom;
     } dom;
}
}


The lov_layout_dom refers just to single object below LOV.
The `lov_layout_dom` refers just to single object below LOV.
New LOV subdevices


For DOM we need pass IO through the MDC device, therefore LOV should have such target similar with other targets pointing to OSCs. With DNE in mind we should be able to query FLDB by FID to get proper MDC device according. The LOV implements own lov_fld_lookup() for that and may need to setup also own fld client OR use one from LMV.
===New LOV subdevices===
struct lov_device {
For DOM we need pass IO through the MDC device, therefore LOV should have such target similar with other targets pointing to OSCs. With DNE in mind we should be able to query FLDB by FID to get proper MDC device according. The LOV implements own `lov_fld_lookup()` for that and may need to setup also own FLD client OR use one from LMV.
 
struct lov_device {
     ...
     ...
    
    
Line 59: Line 61:
     struct lovsub_device    **ld_md_tgts;
     struct lovsub_device    **ld_md_tgts;
     struct lu_client_fld    *ld_fld;
     struct lu_client_fld    *ld_fld;
};
};
Adding MDC


MDC devices are added in client LOV config log like them are added to the LMV. Upon new MDC addition the device stack may be not yet ready and must be saved to be added later in the lov_device_init(). The LOV device has new arrays of MDC targets ld_device:ld_md_tgts[] with the same type lovsub_device as used for OSC but with different set of methods.
====Adding MDC====
Deleting MDC
MDC devices are added in client LOV config log like them are added to the LMV. Upon new MDC addition the device stack may be not yet ready and must be saved to be added later in the `lov_device_init()`. The LOV device has new arrays of MDC targets `ld_device::ld_md_tgts[]` with the same type `lovsub_device` as used for OSC but with different set of methods.


====Deleting MDC====
Unlike OSC the MDC might be cleaned up before LOV because they are controlled still by MD stack of devices. The problem can be solved by taking extra reference for MDC device at LOV or better by notification of LOV about MDC is going to shutdown.  
Unlike OSC the MDC might be cleaned up before LOV because they are controlled still by MD stack of devices. The problem can be solved by taking extra reference for MDC device at LOV or better by notification of LOV about MDC is going to shutdown.  
LOV lovdom_device
 
====LOV `lovdom_device`====


The lovdom device replaces lovsub device for OSC. New set of methods is introduced to handle LOV-MDC interactions.
The lovdom device replaces lovsub device for OSC. New set of methods is introduced to handle LOV-MDC interactions.
static const struct lu_device_operations lovdom_lu_ops = {
 
static const struct lu_device_operations lovdom_lu_ops = {
     .ldo_object_alloc      = lovdom_object_alloc,
     .ldo_object_alloc      = lovdom_object_alloc,
     .ldo_process_config    = NULL,
     .ldo_process_config    = NULL,
     .ldo_recovery_complete = NULL
     .ldo_recovery_complete = NULL
};
};
   
   
static const struct cl_device_operations lovdom_cl_ops = {
static const struct cl_device_operations lovdom_cl_ops = {
         .cdo_req_init = lovsub_req_init
         .cdo_req_init = lovsub_req_init
};
};
   
   
   
  static struct lu_device *lovdom_device_alloc(const struct lu_env *env,
static struct lu_device *lovdom_device_alloc(const struct lu_env *env,
                         struct lu_device_type *t,
                         struct lu_device_type *t,
                         struct lustre_cfg *cfg)
                         struct lustre_cfg *cfg)
{
{
     struct lu_device   *d;
     struct lu_device *d;
     struct lovsub_device   *lsd;
     struct lovsub_device *lsd;
 
     OBD_ALLOC_PTR(lsd);
     OBD_ALLOC_PTR(lsd);
     if (lsd != NULL) {
     if (lsd != NULL) {
Line 101: Line 105:
     }
     }
     return d;
     return d;
}
}
static const struct lu_device_type_operations lovdom_device_type_ops = {
static const struct lu_device_type_operations lovdom_device_type_ops = {
     .ldto_device_alloc = lovdom_device_alloc,
     .ldto_device_alloc = lovdom_device_alloc,
     .ldto_device_free = lovsub_device_free,
     .ldto_device_free = lovsub_device_free,
     .ldto_device_init = lovsub_device_init,
     .ldto_device_init = lovsub_device_init,
     .ldto_device_fini = lovsub_device_fini
     .ldto_device_fini = lovsub_device_fini
};
};


The lovdom device is much simpler than lovsub_device because it is not a top device of new sub-stack but part of the stack. That is why device alloc is different but at the same time we reuse other lovsub methods.
The `lovdom` device is much simpler than `lovsub_device `because it is not a top device of new sub-stack but part of the stack. That is why device alloc is different but at the same time we reuse other `lovsub` methods.
DOM object layering


DOM object is referred by single FID of MDS object and has no stripes, so it is presented by plain stack ccc->lov->lovdom[mds_number]->mdc referred by FID of MDS object.  
===DOM object layering===
 
DOM object is referred by single FID of MDS object and has no stripes, so it is presented by plain stack `ccc->lov->lovdom[mds_number]->mdc` referred by FID of MDS object.  
DOM object


====DOM object====
Considering there is always single stripe, the lov+lovdom layers are trivial and have almost no extra functions/checks. All real code goes to the MDC layer.
Considering there is always single stripe, the lov+lovdom layers are trivial and have almost no extra functions/checks. All real code goes to the MDC layer.
Manage max_stripe_size


====Manage max_stripe_size====
LOV specific code is max_stripe_size boundary check and code to start migration upon file growing.
LOV specific code is max_stripe_size boundary check and code to start migration upon file growing.


The cl_io_operations:cio_io_iter_init() method for lovdom layer checks that operation is not out of max_stripe_size boundary. The Phase I action is just -ENOSPC return code, the Phase II starts migration at this point.
The `cl_io_operations:cio_io_iter_init()` method for lovdom layer checks that operation is not out of `max_stripe_size` boundary. The Phase I action is just `-ENOSPC` return code, the Phase II starts migration at this point.
MDC
MDC CLIO methods


MDC fully reuses OSC CLIO methods with several exceptions, e.g. the device initialization methods will be different as it must call mdc_setup() instead of osc_setup(). That means we can separate that code from OSC and make it generic.
===MDC===
Locking
====MDC CLIO methods====
MDC fully reuses OSC CLIO methods with several exceptions, e.g. the device initialization methods will be different as it must call `mdc_setup()` instead of `osc_setup()`. That means we can separate that code from OSC and make it generic.


All IO for DOM object is done under PW LAYOUT lock. It protects whole data on MDT exclusively. The llite is responsible for that and MDC CLIO locks operatoins are all lockless to avoid real lock enqueueing.  
====Locking====
Glimpse
All IO for DOM object is done under `PW LAYOUT` lock. It protects whole data on MDT exclusively. The llite is responsible for that and MDC CLIO locks operatoins are all lockless to avoid real lock enqueueing.  


With data on MDT the GETATTR request from MDT returns both OBD_MD_FLSIZE|OBD_MD_FLBLOCKS along with other attributes. Therefore glimpse is disabled by setting inode flag LLIF_MDS_SIZE_LOCK for DOM objects.  
====Glimpse====
MDT changes
With data on MDT the `GETATTR` request from MDT returns both `OBD_MD_FLSIZE|OBD_MD_FLBLOCKS` along with other attributes. Therefore glimpse is disabled by setting inode flag `LLIF_MDS_SIZE_LOCK` for DOM objects.  
IO request handlers
 
===MDT changes===
====IO request handlers====


New set of handler is added at MDT, they are needed to handle punch/read/write operations.
New set of handler is added at MDT, they are needed to handle punch/read/write operations.
static struct tgt_handler mdt_tgt_handlers[] = {
 
static struct tgt_handler mdt_tgt_handlers[] = {
     ...
     ...
     TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO, OST_BRW_READ, tgt_brw_read),
     TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO, OST_BRW_READ, tgt_brw_read),
     TGT_MDT_HDL(HABEO_CORPUS| MUTABOR, OST_BRW_WRITE, tgt_brw_write),
     TGT_MDT_HDL(HABEO_CORPUS| MUTABOR, OST_BRW_WRITE, tgt_brw_write),
     TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO | MUTABOR, OST_PUNCH, mdt_punch_hdl),
     TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO | MUTABOR, OST_PUNCH, mdt_punch_hdl),
}
}
IO object operations
 
====IO object operations====


New operations are added to the MDT and sends IO requests directly to the OSD bypassing MDD and LOD. Unified Target uses OBD interface for IO methods, so MDD OBD operations are extended with the following:
New operations are added to the MDT and sends IO requests directly to the OSD bypassing MDD and LOD. Unified Target uses OBD interface for IO methods, so MDD OBD operations are extended with the following:
mdt_obd_ops = {
 
mdt_obd_ops = {
     ...
     ...
     .o_preprw      = mdt_obd_preprw,
     .o_preprw      = mdt_obd_preprw,
     .o_commitrw    = mdt_obd_commitrw,
     .o_commitrw    = mdt_obd_commitrw,
}
}
DOM support in LOD


LOD understand new DOM layout by checking lsm pattern with lsm_is_dom(lsm) helper function. It must not use any OSP subdevices for such layout but only local OSD target. All assertions and check on pattern should be extended from "only RAID0 is supported" to "RAID0 AND DOM are supported".
====DOM support in LOD====


MDT stack doesn't filter BLOCKS | SIZE attributes from DOM object and return them to the client.
LOD understand new DOM layout by checking `lsm` pattern with `lsm_is_dom(lsm)` helper function. It must not use any OSP subdevices for such layout but only local OSD target. All assertions and check on pattern should be extended from "only `RAID0` is supported" to "`RAID0` AND `DOM` are supported".
Quota on MDT


MDT stack doesn't filter `BLOCKS|SIZE` attributes from DOM object and return them to the client.
====Quota on MDT====
As a first implementation, a user limit the DOM blocks can be provided without any code changes: for instance, if user want to set 1GB limit for DOM and the maximum size of each file on MDT is 1M, then user can just set the inode limit to 1024. This approach is in conflict with a real inode limit but provides a initial path to quota control.
As a first implementation, a user limit the DOM blocks can be provided without any code changes: for instance, if user want to set 1GB limit for DOM and the maximum size of each file on MDT is 1M, then user can just set the inode limit to 1024. This approach is in conflict with a real inode limit but provides a initial path to quota control.


A more robust quota for Data on MDT can be executed in two phases:
A more robust quota for Data on MDT can be executed in two phases:
Phase I: Support block quota on MDTs, and MDTs share same block limit with OSTs;


=====Phase I: Support block quota on MDTs, and MDTs share same block limit with OSTs=====
In current quota framework, each MDT has only one MD qsd_instance associated with it's OSD device
In current quota framework, each MDT has only one MD qsd_instance associated with it's OSD device
struct osd_device {
 
struct osd_device {
         ...
         ...
         /* quota slave instance */
         /* quota slave instance */
         struct qsd_instance      *od_quota_slave;
         struct qsd_instance      *od_quota_slave;
};
};


This will be expanded to a list, and two qsd_instance will be linked in the list for each MDT: one is MD for metadata operations, the other is DT for data operations.  Write on MDT will use the DT qsd_instance to enforce block quota, and metadata operations will have to enforce both block and inode quota with these two qsd_instances. For ZFS OSD, once block quota enabled on MDT, the approach of estimating inode accounting by used blocks will not work. Using our own ZAP to track inode usage will be the only option. Ideally, OpenZFS will support inode accounting directly.
This will be expanded to a list, and two `qsd_instance` will be linked in the list for each MDT: one is MD for metadata operations, the other is DT for data operations.  Write on MDT will use the DT `qsd_instance` to enforce block quota, and metadata operations will have to enforce both block and inode quota with these two qsd_instances. For ZFS OSD, once block quota enabled on MDT, the approach of estimating inode accounting by used blocks will not work. Using our own ZAP to track inode usage will be the only option. Ideally, OpenZFS will support inode accounting directly.
Phase II: Support different block limits for DOM and OSTs;


=====Phase II: Support different block limits for DOM and OSTs=====
Because space on a MDT is often limited compared to a OST, an administrator may want to set more restricted block limit to MDTs rather than sharing same limits with OSTs. To support private limits for DOM, following changes are required:
Because space on a MDT is often limited compared to a OST, an administrator may want to set more restricted block limit to MDTs rather than sharing same limits with OSTs. To support private limits for DOM, following changes are required:


    Introduce a pre-defined DT pool for DOM to manage the block limit for all MDTs;
* Introduce a pre-defined DT pool for DOM to manage the block limit for all MDTs;
    Add an additional qsd_instance associated with each MDT. This acquires limits from the DOM pool; (some code could probably needs be revised to support non-default pool ID)
* Add an additional qsd_instance associated with each MDT. This acquires limits from the DOM pool; (some code could probably needs be revised to support non-default pool ID)
    Enhance the quota utilities to set and show quota limits for specified pool; (packing pool ID in quota request requires client to server protocol changes)
* Enhance the quota utilities to set and show quota limits for specified pool; (packing pool ID in quota request requires client to server protocol changes)
    Write on MDT must enforce two block quotas: the default DT quota and the DOM quota;
* Write on MDT must enforce two block quotas: the default DT quota and the DOM quota;
 
Grants on MDT


====Grants on MDT====
Grant support on MDT require major MDT stack changes because we need to account all changes including metadata operations with directories, extended attributes and llogs. Initial grants support will be implemented as MDT target support for grants basically. It is able to report grants and declare their support during connection, but returns zero values (means IO to be sync) or some rough estimated values. Further work will be done in accounting all operations with data and report real values. This can be done during declaration phase of transaction.
Grant support on MDT require major MDT stack changes because we need to account all changes including metadata operations with directories, extended attributes and llogs. Initial grants support will be implemented as MDT target support for grants basically. It is able to report grants and declare their support during connection, but returns zero values (means IO to be sync) or some rough estimated values. Further work will be done in accounting all operations with data and report real values. This can be done during declaration phase of transaction.


Operations to take into account:
Operations to take into account:


    directory create/destroy
* directory create/destroy
    EA add/delete
* EA add/delete
    llog accounting (changelogs mostly)
* llog accounting (changelogs mostly)
    Write/truncate of DOM objects
* Write/truncate of DOM objects
    DOM object destroy
* DOM object destroy
 
Lustre tools changes
LFS setstripe


LFS tool introduces new option '--layout=mdt | -L mdt'. It means DOM layout and sets layout pattern to LOV_PATTERN_DOM, stripe counts to 0. Option --stripe-size used with --layout will set DOM maxsize. Option --stripe-count will cause error if not 0. Pool name cannot be set also for this layout at the current time.
===Lustre tools changes===
Migration
====`lfs setstripe`====
LFS tool introduces new option `--layout=mdt | -L mdt`. It means DOM layout and sets layout pattern to `LOV_PATTERN_DOM`, stripe counts to 0. Option `--stripe-size` used with `--layout` will set DOM `maxsize`. Option `--stripe-count` will cause error if not 0. Pool name cannot be set also for this layout at the current time.
==Logic Specification
===`MDS_GETATTR` logic===


Migration happens during IO operation and consist of several steps:
Optimization is to avoid glimpse requests by returning back `SIZE` and `BLOCKS` attributes.


    determine a migration event
[[media:dom_getattr.png]]
    create a file with new layout
    read data from old file and write it to the new one
    swap layout between files
    continue current operation


That means the migration manager should be done in llite in context of current operation or be a separate thread to avoid possible conflict which CLIO thread infos for second object.
* `MDS_GETATTR` is enqueued as usual but with `PR LAYOUT` lock
Migration start
* MDT returns `BLOCKS` and `SIZE` attributes as valid
* llite sets flag to indicate that glimpse is not needed (already exists for SOM)


LOV determines if migration is needed. Migration is evaluated when ongoing WRITE request exceeds the max_stripe_size value of DOM object. The cl_io_operations:cio_io_iter_init() implementation for lovdom layer checks if migration is needed.
===MDS_OPEN logic and optimizations===
Layout change
====`OPEN` with `O_RDWR`====


New object is created with RAID0 layout, OST objects are taken from pre-allocated pool as usual during file creation on MDT. When data is moved from one file to another the layout is switched. All operations are done under PW LAYOUT_LOCK naturally so no additional locking is needed.
[[media:dom_open_rw.png]]
Data migration from MDT to OST
 
The whole data of DOM file is read from MDT by client and flushed back to the new RAID0 file. One possible optimization is to re-assign pages to the new file by flushing them from CLIO. This optimization is beyond the scope of this project.
Logic Specification
MDS_GETATTR logic
 
Optimization is to avoid glimpse requests by returning back SIZE and BLOCKS attributes.
 
dom_getattr
 
    MDS_GETATTR is enqueued as usual but with PR LAYOUT lock
    MDT returns BLOCKS and SIZE attributes as valid
    llite sets flag to indicate that glimpse is not needed (already exists for SOM)
 
MDS_OPEN logic and optimizations
OPEN with O_RDWR
 
dom_open_rw


This optimization might help with partial page updates, when client have to read page first, adds new data in it and flush it back finally.
This optimization might help with partial page updates, when client have to read page first, adds new data in it and flush it back finally.


    OPEN is enqueued with PW LAYOUT lock on object
* OPEN is enqueued with PW LAYOUT lock on object
    MDT returns partial page back if fits into EA buffer
* MDT returns partial page back if fits into EA buffer
    client update partial page and flush pages
* client update partial page and flush pages


OPEN with O_TRUNC
====`OPEN` with `O_TRUNC`====


dom_open_trunc
[[media:dom_open_trunc.png]]


This optimization can truncate file during open if O_TRUNC flag is set.
This optimization can truncate file during open if `O_TRUNC` flag is set.


    OPEN is invoked with PW LAYOUT lock
* `OPEN` is invoked with `PW LAYOUT` lock
    MDS checks O_TRUNC is set and truncate file
* MDS checks `O_TRUNC` is set and truncate file
    client gets reply with updated attributes
* client gets reply with updated attributes


Data Readahead logic
===Data Readahead logic===


All readahead optimizations are based on fact that it is possible to return some data from small file in EA buffer which is quite big now.
All readahead optimizations are based on fact that it is possible to return some data from small file in EA buffer which is quite big now.


dom_readahead
[[media:dom_readahead.png]]


Typical case is stat():
Typical case is `stat()`:


    MDS_GETATTR is enqueued as usual
* `MDS_GETATTR` is enqueued as usual
    MDT returns basic attributes and checks amount of free space in EA reply buffer.  
* MDT returns basic attributes and checks amount of free space in EA reply buffer.  
    MDT reads data from file to EA buffer
* MDT reads data from file to EA buffer
    client fill pages with data from EA buffer
* client fill pages with data from EA buffer
 
Tests Specification
Sanity tests


==Tests Specification==
===`sanity` tests===
The set of tests to make sure the feature works as expected
The set of tests to make sure the feature works as expected
test_1a - file with DOM layout
    create file with lfs setstripe -L, check its LOV EA attribute has LOV_PATTERN_DOM pattern and has no stripes on OST
    write to the file, checks data is written and valid, checks file attributes are valid, checks fs free blocks is decreased according with file size.
    delete file, checks free blocks are returned back
test_1b - DOM layout as default directory layout
    set new default layout on directory with lfs setstripe -L and create file in it.
    all steps from test_1a
test_2a - check size limit and migration


    create file with DOM layout
====`test_1a` - file with DOM layout====
    write more that max_stripe_size to the file
# create file with `lfs setstripe -L`, check its LOV EA attribute has LOV_PATTERN_DOM pattern and has no stripes on OST
        Phase I: checks that -ENOSPC error is returned
# write to the file, checks data is written and valid, checks file attributes are valid, checks fs free blocks is decreased according with file size.
        Phase II: check that migration occured
# delete file, checks free blocks are returned back
    (only Phase II below)
    checks file layout is changed to filesystem default layout and has stripes on OSTs
    checks file size is valid and fs free blocks data is valid as well


Sanityn tests
====`test_1b` - DOM layout as default directory layout====
test_1{a,...} - parallel access to the small files
# set new default layout on directory with lfs setstripe -L and create file in it.  
# all steps from `test_1a`


    open DOM file from client1 in different modes and pause it.
====`test_2a` - check size limit and migration====
    perform various operations on file from client2
# create file with DOM layout
    make sure client2 is waiting client1 to finish first
# write more that `max_stripe_size` to the file
## Phase I: checks that `-ENOSPC` error is returned
## Phase II: check that file expansion to OSTs occurred
# (only Phase II below)
# checks file layout is changed to filesystem default layout and has stripes on OSTs
# checks file size is valid and fs free blocks data is valid as well


Recovery tests
===`sanityn` tests===
====`test_1{a,...}` - parallel access to the small files====
# open DOM file from client1 in different modes and pause it.
# perform various operations on file from client2
# make sure client2 is waiting client1 to finish first


    fail MDS while performing operations like create/open/write/destroy on DOM file
===Recovery tests===
    make sure operations are replayed back and finished as expected
# fail MDS while performing operations like create/open/write/destroy on DOM file
 
# make sure operations are replayed back and finished as expected
Functional tests


===Functional tests===
Check all cases we optimize
Check all cases we optimize
Stat
====Stat====
# create files with DOM layout vs 2-stripes layout
# fills them with data
# perform ls -l on them
# Output results for both cases


    create files with DOM layout vs 2-stripes layout
====Open with `O_TRUNC`====
    fills them with data
# create files with DOM and default layout
    perform ls -l on them
# fills them with data
    Output results for both cases
# perform open with `O_TRUNC`
# output results


Open with O_TRUNC
====Write at the end of file with partial page====
# create many small files and fills with data so last page is filled partially
# performs write to the end of file
# Output results for DOM file vs normal file


    create files with DOM and default layout
====Readahead of small files====
    fills them with data
# create small files with DOM and normal layout
    perform open with O_TRUNC
# fills them with data
    output results
# perform grep on them
 
# output results
Write at the end of file with partial page
 
    create many small files and fills with data so last page is filled partially
    performs write to the end of file
    Output results for DOM file vs normal file
 
Readahead of small files
 
    create small files with DOM and normal layout
    fills them with data
    perform grep on them
    output results
 
Generic performance tests
mdsrate


===Generic performance tests===
====`mdsrate`====
Regular MD operations should benefit from Data-on-MDT because there are no OST requests, basically only stat should benefit noticeable because open uses precreated objects and destroy is not blocked but OST objects destroy.  
Regular MD operations should benefit from Data-on-MDT because there are no OST requests, basically only stat should benefit noticeable because open uses precreated objects and destroy is not blocked but OST objects destroy.  


use mdsrate to perform the following operations:
Use mdsrate to perform the following operations:
 
    file creation with single stripe vs DOM
    stat
    unlink
    output results for both cases


files with data
* file creation with single stripe vs DOM
* `stat`
* `unlink`
* output results for both cases


'''Files with data'''
The mdsrate tool must be modified to create DoM files with amount of data to test DoM files performance in common case.
The mdsrate tool must be modified to create DoM files with amount of data to test DoM files performance in common case.
FIO


====`FIO`====
Check generic IO performance of small files
Check generic IO performance of small files


    run FIO with DoM files vs normal files
* run `FIO` with DoM files vs normal files
    output results
* output results
 
postmark


The Postmark utility is designed to emulate applications such as software development, email, newsgroup servers and Web applications. It is an industry-standard benchmark for small file and metadata-intensive
====postmark====
workloads.
The `postmark` utility is designed to emulate applications such as software development, email, newsgroup servers and Web applications. It is an industry-standard benchmark for small file and metadata-intensive workloads.


    Run postmark over DoM striped directory and default one.
* Run postmark over DoM striped directory and default one.
    Output results
* Output results

Revision as of 17:13, 11 July 2017

Introduction

The Data-on-MDT project aims to achieve good performance for small file IO. The motivations for this and current behaviour of small file IO are described in Data on MDT Solution Architecture document.

This high-level design document provides more details about approach to implement this feature in the Lustre file system.

Implementation Requirements

Create new layout for Data-on-MDT

While Layout Enhancement project is responsible for a new layout, we must add appropriate changes in LOV to properly recognize the new layout and use LMV/MDC cl_object interface to work with data. LOV should understand and properly handle also maximum data size for the new layout. Another part of related work is to change `lfs` tool so it will be able to set new layout on directory as default.

CLIO should be able to work with MDT

Make MDC devices part of CLIO

MDC devices must become `cl_device` and part of CLIO stack and be used by LOV for IO if new layout. Some unification is possible between OSC and MDC methods.

That code is orthogonal to the current MDC and is placed there for simpler access to import and request-related code. This code will become eventually the generic client that will be used by MDC and OSC.

MDT support for IO requests

IO request handlers

Unified Target made this possible, but specific handlers do not currently exist for the MDT. They will be added as part of this work.

LOD must understand new layout

LOD must be aware of a new layout and handle it properly bypassing IO methods and other operations to the local storage instead of OST

Lustre tools changes

The `lfs setstripe` command must be extended to recognize new layout setup and set maximum size for it.

Functional Specification

New DOM layout

Wire/disk changes

The `lov_stripe_md` (`lsm`) stores information about DOM layout. The `lw_pattern` has specific flag `LOV_PATTERN_DOM` set, the `lw_stripe_size` contains maximum data size value for DOM, the `lw_stripe_count` is 0, no `lsm_oinfo` is allocated. This does not change wire protocol because `LOV_PATTERN_DOM` replaces existing `LOV_PATTERN_FIRST`.

In-memory structures

The LOV implements new interface for new `LLT_DOM` layout type. This set of methods are closely related with `LLT_RAID0` interface and some may be reused.

Upon new object allocation the LOV does the following:

  • `lov_layout_type()` checks `lsm` and returns related type;
  • `lov_dispatch()` points to proper interface for this type;

so the proper stack of objects is built with using sub-objects points to MDC.

union lov_layout_state {
   struct lov_layout_raid0 {
       ...
   } raid0;
   struct lov_layout_dom {
       struct lovsub_object *lo_dom;
   } dom;
}

The `lov_layout_dom` refers just to single object below LOV.

New LOV subdevices

For DOM we need pass IO through the MDC device, therefore LOV should have such target similar with other targets pointing to OSCs. With DNE in mind we should be able to query FLDB by FID to get proper MDC device according. The LOV implements own `lov_fld_lookup()` for that and may need to setup also own FLD client OR use one from LMV.

struct lov_device {
   ...
 
   /* Data-on-MDT devices*/
   __u32                     ld_md_tgts_nr;
   struct lovsub_device    **ld_md_tgts;
   struct lu_client_fld     *ld_fld;
};

Adding MDC

MDC devices are added in client LOV config log like them are added to the LMV. Upon new MDC addition the device stack may be not yet ready and must be saved to be added later in the `lov_device_init()`. The LOV device has new arrays of MDC targets `ld_device::ld_md_tgts[]` with the same type `lovsub_device` as used for OSC but with different set of methods.

Deleting MDC

Unlike OSC the MDC might be cleaned up before LOV because they are controlled still by MD stack of devices. The problem can be solved by taking extra reference for MDC device at LOV or better by notification of LOV about MDC is going to shutdown.

LOV `lovdom_device`

The lovdom device replaces lovsub device for OSC. New set of methods is introduced to handle LOV-MDC interactions.

static const struct lu_device_operations lovdom_lu_ops = {
   .ldo_object_alloc      = lovdom_object_alloc,
   .ldo_process_config    = NULL,
   .ldo_recovery_complete = NULL
};

static const struct cl_device_operations lovdom_cl_ops = {
       .cdo_req_init = lovsub_req_init
};

static struct lu_device *lovdom_device_alloc(const struct lu_env *env,
                        struct lu_device_type *t,
                        struct lustre_cfg *cfg)
{
   struct lu_device *d;
   struct lovsub_device *lsd;
   OBD_ALLOC_PTR(lsd);
   if (lsd != NULL) {
       int result;
       result = cl_device_init(&lsd->acid_cl, t);
       if (result == 0) {
           d = lovsub2lu_dev(lsd);
           d->ld_ops = &lovdom_lu_ops;
           lsd->acid_cl.cd_ops = &lovdom_cl_ops;
       } else {
           d = ERR_PTR(result);
       }
   } else {
       d = ERR_PTR(-ENOMEM);
   }
   return d;
}

static const struct lu_device_type_operations lovdom_device_type_ops = {
   .ldto_device_alloc = lovdom_device_alloc,
   .ldto_device_free = lovsub_device_free,
   .ldto_device_init = lovsub_device_init,
   .ldto_device_fini = lovsub_device_fini
};

The `lovdom` device is much simpler than `lovsub_device `because it is not a top device of new sub-stack but part of the stack. That is why device alloc is different but at the same time we reuse other `lovsub` methods.

DOM object layering

DOM object is referred by single FID of MDS object and has no stripes, so it is presented by plain stack `ccc->lov->lovdom[mds_number]->mdc` referred by FID of MDS object.

DOM object

Considering there is always single stripe, the lov+lovdom layers are trivial and have almost no extra functions/checks. All real code goes to the MDC layer.

Manage max_stripe_size

LOV specific code is max_stripe_size boundary check and code to start migration upon file growing.

The `cl_io_operations:cio_io_iter_init()` method for lovdom layer checks that operation is not out of `max_stripe_size` boundary. The Phase I action is just `-ENOSPC` return code, the Phase II starts migration at this point.

MDC

MDC CLIO methods

MDC fully reuses OSC CLIO methods with several exceptions, e.g. the device initialization methods will be different as it must call `mdc_setup()` instead of `osc_setup()`. That means we can separate that code from OSC and make it generic.

Locking

All IO for DOM object is done under `PW LAYOUT` lock. It protects whole data on MDT exclusively. The llite is responsible for that and MDC CLIO locks operatoins are all lockless to avoid real lock enqueueing.

Glimpse

With data on MDT the `GETATTR` request from MDT returns both `OBD_MD_FLSIZE|OBD_MD_FLBLOCKS` along with other attributes. Therefore glimpse is disabled by setting inode flag `LLIF_MDS_SIZE_LOCK` for DOM objects.

MDT changes

IO request handlers

New set of handler is added at MDT, they are needed to handle punch/read/write operations.

static struct tgt_handler mdt_tgt_handlers[] = {
   ...
   TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO, OST_BRW_READ, tgt_brw_read),
   TGT_MDT_HDL(HABEO_CORPUS| MUTABOR, OST_BRW_WRITE, tgt_brw_write),
   TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO | MUTABOR, OST_PUNCH, mdt_punch_hdl),
}

IO object operations

New operations are added to the MDT and sends IO requests directly to the OSD bypassing MDD and LOD. Unified Target uses OBD interface for IO methods, so MDD OBD operations are extended with the following:

mdt_obd_ops = {
   ...
   .o_preprw       = mdt_obd_preprw,
   .o_commitrw     = mdt_obd_commitrw,
}

DOM support in LOD

LOD understand new DOM layout by checking `lsm` pattern with `lsm_is_dom(lsm)` helper function. It must not use any OSP subdevices for such layout but only local OSD target. All assertions and check on pattern should be extended from "only `RAID0` is supported" to "`RAID0` AND `DOM` are supported".

MDT stack doesn't filter `BLOCKS|SIZE` attributes from DOM object and return them to the client.

Quota on MDT

As a first implementation, a user limit the DOM blocks can be provided without any code changes: for instance, if user want to set 1GB limit for DOM and the maximum size of each file on MDT is 1M, then user can just set the inode limit to 1024. This approach is in conflict with a real inode limit but provides a initial path to quota control.

A more robust quota for Data on MDT can be executed in two phases:

Phase I: Support block quota on MDTs, and MDTs share same block limit with OSTs

In current quota framework, each MDT has only one MD qsd_instance associated with it's OSD device

struct osd_device {
       ...
       /* quota slave instance */
       struct qsd_instance      *od_quota_slave;
};

This will be expanded to a list, and two `qsd_instance` will be linked in the list for each MDT: one is MD for metadata operations, the other is DT for data operations. Write on MDT will use the DT `qsd_instance` to enforce block quota, and metadata operations will have to enforce both block and inode quota with these two qsd_instances. For ZFS OSD, once block quota enabled on MDT, the approach of estimating inode accounting by used blocks will not work. Using our own ZAP to track inode usage will be the only option. Ideally, OpenZFS will support inode accounting directly.

Phase II: Support different block limits for DOM and OSTs

Because space on a MDT is often limited compared to a OST, an administrator may want to set more restricted block limit to MDTs rather than sharing same limits with OSTs. To support private limits for DOM, following changes are required:

  • Introduce a pre-defined DT pool for DOM to manage the block limit for all MDTs;
  • Add an additional qsd_instance associated with each MDT. This acquires limits from the DOM pool; (some code could probably needs be revised to support non-default pool ID)
  • Enhance the quota utilities to set and show quota limits for specified pool; (packing pool ID in quota request requires client to server protocol changes)
  • Write on MDT must enforce two block quotas: the default DT quota and the DOM quota;

Grants on MDT

Grant support on MDT require major MDT stack changes because we need to account all changes including metadata operations with directories, extended attributes and llogs. Initial grants support will be implemented as MDT target support for grants basically. It is able to report grants and declare their support during connection, but returns zero values (means IO to be sync) or some rough estimated values. Further work will be done in accounting all operations with data and report real values. This can be done during declaration phase of transaction.

Operations to take into account:

  • directory create/destroy
  • EA add/delete
  • llog accounting (changelogs mostly)
  • Write/truncate of DOM objects
  • DOM object destroy

Lustre tools changes

`lfs setstripe`

LFS tool introduces new option `--layout=mdt | -L mdt`. It means DOM layout and sets layout pattern to `LOV_PATTERN_DOM`, stripe counts to 0. Option `--stripe-size` used with `--layout` will set DOM `maxsize`. Option `--stripe-count` will cause error if not 0. Pool name cannot be set also for this layout at the current time. ==Logic Specification

`MDS_GETATTR` logic

Optimization is to avoid glimpse requests by returning back `SIZE` and `BLOCKS` attributes.

media:dom_getattr.png

  • `MDS_GETATTR` is enqueued as usual but with `PR LAYOUT` lock
  • MDT returns `BLOCKS` and `SIZE` attributes as valid
  • llite sets flag to indicate that glimpse is not needed (already exists for SOM)

MDS_OPEN logic and optimizations

`OPEN` with `O_RDWR`

media:dom_open_rw.png

This optimization might help with partial page updates, when client have to read page first, adds new data in it and flush it back finally.

  • OPEN is enqueued with PW LAYOUT lock on object
  • MDT returns partial page back if fits into EA buffer
  • client update partial page and flush pages

`OPEN` with `O_TRUNC`

media:dom_open_trunc.png

This optimization can truncate file during open if `O_TRUNC` flag is set.

  • `OPEN` is invoked with `PW LAYOUT` lock
  • MDS checks `O_TRUNC` is set and truncate file
  • client gets reply with updated attributes

Data Readahead logic

All readahead optimizations are based on fact that it is possible to return some data from small file in EA buffer which is quite big now.

media:dom_readahead.png

Typical case is `stat()`:

  • `MDS_GETATTR` is enqueued as usual
  • MDT returns basic attributes and checks amount of free space in EA reply buffer.
  • MDT reads data from file to EA buffer
  • client fill pages with data from EA buffer

Tests Specification

`sanity` tests

The set of tests to make sure the feature works as expected

`test_1a` - file with DOM layout

  1. create file with `lfs setstripe -L`, check its LOV EA attribute has LOV_PATTERN_DOM pattern and has no stripes on OST
  2. write to the file, checks data is written and valid, checks file attributes are valid, checks fs free blocks is decreased according with file size.
  3. delete file, checks free blocks are returned back

`test_1b` - DOM layout as default directory layout

  1. set new default layout on directory with lfs setstripe -L and create file in it.
  2. all steps from `test_1a`

`test_2a` - check size limit and migration

  1. create file with DOM layout
  2. write more that `max_stripe_size` to the file
    1. Phase I: checks that `-ENOSPC` error is returned
    2. Phase II: check that file expansion to OSTs occurred
  3. (only Phase II below)
  4. checks file layout is changed to filesystem default layout and has stripes on OSTs
  5. checks file size is valid and fs free blocks data is valid as well

`sanityn` tests

`test_1{a,...}` - parallel access to the small files

  1. open DOM file from client1 in different modes and pause it.
  2. perform various operations on file from client2
  3. make sure client2 is waiting client1 to finish first

Recovery tests

  1. fail MDS while performing operations like create/open/write/destroy on DOM file
  2. make sure operations are replayed back and finished as expected

Functional tests

Check all cases we optimize

Stat

  1. create files with DOM layout vs 2-stripes layout
  2. fills them with data
  3. perform ls -l on them
  4. Output results for both cases

Open with `O_TRUNC`

  1. create files with DOM and default layout
  2. fills them with data
  3. perform open with `O_TRUNC`
  4. output results

Write at the end of file with partial page

  1. create many small files and fills with data so last page is filled partially
  2. performs write to the end of file
  3. Output results for DOM file vs normal file

Readahead of small files

  1. create small files with DOM and normal layout
  2. fills them with data
  3. perform grep on them
  4. output results

Generic performance tests

`mdsrate`

Regular MD operations should benefit from Data-on-MDT because there are no OST requests, basically only stat should benefit noticeable because open uses precreated objects and destroy is not blocked but OST objects destroy.

Use mdsrate to perform the following operations:

  • file creation with single stripe vs DOM
  • `stat`
  • `unlink`
  • output results for both cases

Files with data The mdsrate tool must be modified to create DoM files with amount of data to test DoM files performance in common case.

`FIO`

Check generic IO performance of small files

  • run `FIO` with DoM files vs normal files
  • output results

postmark

The `postmark` utility is designed to emulate applications such as software development, email, newsgroup servers and Web applications. It is an industry-standard benchmark for small file and metadata-intensive workloads.

  • Run postmark over DoM striped directory and default one.
  • Output results