Data on MDT High Level Design: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(initial import from https://wiki.hpdd.intel.com/pages/viewpage.action?spaceKey=opensfs&title=Data+On+MDT+High+Level+Design without formatting or diagrams)
(No difference)

Revision as of 18:21, 10 July 2017

Introduction

The Data-on-MDT project aims to achieve good performance for small file IO. The motivations for this and current behaviour of small file IO are described in Data on MDT Solution Architecture document.

This high-level design document provides more details about approach to implement this feature in the Lustre* file system. Implementation Requirements Create new layout for Data-on-MDT

While Layout Enhancement project is responsible for a new layout, we must add appropriate changes in LOV to properly recognize the new layout and use LMV/MDC cl_object interface to work with data. LOV should understand and properly handle also maximum data size for the new layout. Another part of related work is to change lfs tool so it will be able to set new layout on directory as default. CLIO should be able to work with MDT Make MDC devices part of CLIO

MDC devices must become cl_device and part of CLIO stack and be used by LOV for IO if new layout. Some unification is possible between OSC and MDC methods.

That code is orthogonal to the current MDC and is placed there for simpler access to import and request-related code. This code will become eventually the generic client that will be used by MDC and OSC. MDT support for IO requests IO request handlers

Unified Target made this possible, but specific handlers do not currently exist for the MDT. They will be added as part of this work. LOD must understand new layout

LOD must be aware of a new layout and handle it properly bypassing IO methods and other operations to the local storage instead of OST Lustre tools changes

The lfs setstripe command must be extended to recognize new layout setup and set maximum size for it. Functional Specification New DOM layout Wire/disk changes

The lov_stripe_md (lsm) stores information about DOM layout. The lw_pattern has specific flag LOV_PATTERN_DOM set, the lw_stripe_size contains maximum data size value for DOM, the lw_stripe_count is 0, no lsm_oinfo is allocated. This does not change wire protocol because LOV_PATTERN_DOM replaces existing LOV_PATTERN_FIRST. In-memory structures

The LOV implements new interface for new LLT_DOM layout type. This set of methods are closely related with LLT_RAID0 interface and some may be reused.

Upon new object allocation the LOV does the following:

   lov_layout_type() checks lsm and returns related type;
   lov_dispatch() points to proper interface for this type;

so the proper stack of objects is built with using sub-objects points to MDC. union lov_layout_state {

   struct lov_layout_raid0 {
       ...
   } raid0;
   struct lov_layout_dom {
       struct lovsub_object *lo_dom;
   } dom;

}

The lov_layout_dom refers just to single object below LOV. New LOV subdevices

For DOM we need pass IO through the MDC device, therefore LOV should have such target similar with other targets pointing to OSCs. With DNE in mind we should be able to query FLDB by FID to get proper MDC device according. The LOV implements own lov_fld_lookup() for that and may need to setup also own fld client OR use one from LMV. struct lov_device {

   ...
 
   /* Data-on-MDT devices*/
   __u32                     ld_md_tgts_nr;
   struct lovsub_device    **ld_md_tgts;
   struct lu_client_fld     *ld_fld;

}; Adding MDC

MDC devices are added in client LOV config log like them are added to the LMV. Upon new MDC addition the device stack may be not yet ready and must be saved to be added later in the lov_device_init(). The LOV device has new arrays of MDC targets ld_device:ld_md_tgts[] with the same type lovsub_device as used for OSC but with different set of methods. Deleting MDC

Unlike OSC the MDC might be cleaned up before LOV because they are controlled still by MD stack of devices. The problem can be solved by taking extra reference for MDC device at LOV or better by notification of LOV about MDC is going to shutdown. LOV lovdom_device

The lovdom device replaces lovsub device for OSC. New set of methods is introduced to handle LOV-MDC interactions. static const struct lu_device_operations lovdom_lu_ops = {

   .ldo_object_alloc      = lovdom_object_alloc,
   .ldo_process_config    = NULL,
   .ldo_recovery_complete = NULL

};

static const struct cl_device_operations lovdom_cl_ops = {

       .cdo_req_init = lovsub_req_init

};


static struct lu_device *lovdom_device_alloc(const struct lu_env *env,

                        struct lu_device_type *t,
                        struct lustre_cfg *cfg)

{

   struct lu_device    *d;
   struct lovsub_device    *lsd;
   OBD_ALLOC_PTR(lsd);
   if (lsd != NULL) {
       int result;
       result = cl_device_init(&lsd->acid_cl, t);
       if (result == 0) {
           d = lovsub2lu_dev(lsd);
           d->ld_ops = &lovdom_lu_ops;
           lsd->acid_cl.cd_ops = &lovdom_cl_ops;
       } else {
           d = ERR_PTR(result);
       }
   } else {
       d = ERR_PTR(-ENOMEM);
   }
   return d;

} static const struct lu_device_type_operations lovdom_device_type_ops = {

   .ldto_device_alloc = lovdom_device_alloc,
   .ldto_device_free = lovsub_device_free,
   .ldto_device_init = lovsub_device_init,
   .ldto_device_fini = lovsub_device_fini

};

The lovdom device is much simpler than lovsub_device because it is not a top device of new sub-stack but part of the stack. That is why device alloc is different but at the same time we reuse other lovsub methods. DOM object layering

DOM object is referred by single FID of MDS object and has no stripes, so it is presented by plain stack ccc->lov->lovdom[mds_number]->mdc referred by FID of MDS object.

DOM object

Considering there is always single stripe, the lov+lovdom layers are trivial and have almost no extra functions/checks. All real code goes to the MDC layer. Manage max_stripe_size

LOV specific code is max_stripe_size boundary check and code to start migration upon file growing.

The cl_io_operations:cio_io_iter_init() method for lovdom layer checks that operation is not out of max_stripe_size boundary. The Phase I action is just -ENOSPC return code, the Phase II starts migration at this point. MDC MDC CLIO methods

MDC fully reuses OSC CLIO methods with several exceptions, e.g. the device initialization methods will be different as it must call mdc_setup() instead of osc_setup(). That means we can separate that code from OSC and make it generic. Locking

All IO for DOM object is done under PW LAYOUT lock. It protects whole data on MDT exclusively. The llite is responsible for that and MDC CLIO locks operatoins are all lockless to avoid real lock enqueueing. Glimpse

With data on MDT the GETATTR request from MDT returns both OBD_MD_FLSIZE|OBD_MD_FLBLOCKS along with other attributes. Therefore glimpse is disabled by setting inode flag LLIF_MDS_SIZE_LOCK for DOM objects. MDT changes IO request handlers

New set of handler is added at MDT, they are needed to handle punch/read/write operations. static struct tgt_handler mdt_tgt_handlers[] = {

   ...
   TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO, OST_BRW_READ, tgt_brw_read),
   TGT_MDT_HDL(HABEO_CORPUS| MUTABOR, OST_BRW_WRITE, tgt_brw_write),
   TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO | MUTABOR, OST_PUNCH, mdt_punch_hdl),

} IO object operations

New operations are added to the MDT and sends IO requests directly to the OSD bypassing MDD and LOD. Unified Target uses OBD interface for IO methods, so MDD OBD operations are extended with the following: mdt_obd_ops = {

   ...
   .o_preprw       = mdt_obd_preprw,
   .o_commitrw     = mdt_obd_commitrw,

} DOM support in LOD

LOD understand new DOM layout by checking lsm pattern with lsm_is_dom(lsm) helper function. It must not use any OSP subdevices for such layout but only local OSD target. All assertions and check on pattern should be extended from "only RAID0 is supported" to "RAID0 AND DOM are supported".

MDT stack doesn't filter BLOCKS | SIZE attributes from DOM object and return them to the client. Quota on MDT

As a first implementation, a user limit the DOM blocks can be provided without any code changes: for instance, if user want to set 1GB limit for DOM and the maximum size of each file on MDT is 1M, then user can just set the inode limit to 1024. This approach is in conflict with a real inode limit but provides a initial path to quota control.

A more robust quota for Data on MDT can be executed in two phases: Phase I: Support block quota on MDTs, and MDTs share same block limit with OSTs;

In current quota framework, each MDT has only one MD qsd_instance associated with it's OSD device struct osd_device {

       ...
       /* quota slave instance */
       struct qsd_instance      *od_quota_slave;

};

This will be expanded to a list, and two qsd_instance will be linked in the list for each MDT: one is MD for metadata operations, the other is DT for data operations. Write on MDT will use the DT qsd_instance to enforce block quota, and metadata operations will have to enforce both block and inode quota with these two qsd_instances. For ZFS OSD, once block quota enabled on MDT, the approach of estimating inode accounting by used blocks will not work. Using our own ZAP to track inode usage will be the only option. Ideally, OpenZFS will support inode accounting directly. Phase II: Support different block limits for DOM and OSTs;

Because space on a MDT is often limited compared to a OST, an administrator may want to set more restricted block limit to MDTs rather than sharing same limits with OSTs. To support private limits for DOM, following changes are required:

   Introduce a pre-defined DT pool for DOM to manage the block limit for all MDTs;
   Add an additional qsd_instance associated with each MDT. This acquires limits from the DOM pool; (some code could probably needs be revised to support non-default pool ID)
   Enhance the quota utilities to set and show quota limits for specified pool; (packing pool ID in quota request requires client to server protocol changes)
   Write on MDT must enforce two block quotas: the default DT quota and the DOM quota;

Grants on MDT

Grant support on MDT require major MDT stack changes because we need to account all changes including metadata operations with directories, extended attributes and llogs. Initial grants support will be implemented as MDT target support for grants basically. It is able to report grants and declare their support during connection, but returns zero values (means IO to be sync) or some rough estimated values. Further work will be done in accounting all operations with data and report real values. This can be done during declaration phase of transaction.

Operations to take into account:

   directory create/destroy
   EA add/delete
   llog accounting (changelogs mostly)
   Write/truncate of DOM objects
   DOM object destroy

Lustre tools changes LFS setstripe

LFS tool introduces new option '--layout=mdt | -L mdt'. It means DOM layout and sets layout pattern to LOV_PATTERN_DOM, stripe counts to 0. Option --stripe-size used with --layout will set DOM maxsize. Option --stripe-count will cause error if not 0. Pool name cannot be set also for this layout at the current time. Migration

Migration happens during IO operation and consist of several steps:

   determine a migration event
   create a file with new layout
   read data from old file and write it to the new one
   swap layout between files
   continue current operation

That means the migration manager should be done in llite in context of current operation or be a separate thread to avoid possible conflict which CLIO thread infos for second object. Migration start

LOV determines if migration is needed. Migration is evaluated when ongoing WRITE request exceeds the max_stripe_size value of DOM object. The cl_io_operations:cio_io_iter_init() implementation for lovdom layer checks if migration is needed. Layout change

New object is created with RAID0 layout, OST objects are taken from pre-allocated pool as usual during file creation on MDT. When data is moved from one file to another the layout is switched. All operations are done under PW LAYOUT_LOCK naturally so no additional locking is needed. Data migration from MDT to OST

The whole data of DOM file is read from MDT by client and flushed back to the new RAID0 file. One possible optimization is to re-assign pages to the new file by flushing them from CLIO. This optimization is beyond the scope of this project. Logic Specification MDS_GETATTR logic

Optimization is to avoid glimpse requests by returning back SIZE and BLOCKS attributes.

dom_getattr

   MDS_GETATTR is enqueued as usual but with PR LAYOUT lock
   MDT returns BLOCKS and SIZE attributes as valid
   llite sets flag to indicate that glimpse is not needed (already exists for SOM)

MDS_OPEN logic and optimizations OPEN with O_RDWR

dom_open_rw

This optimization might help with partial page updates, when client have to read page first, adds new data in it and flush it back finally.

   OPEN is enqueued with PW LAYOUT lock on object
   MDT returns partial page back if fits into EA buffer
   client update partial page and flush pages

OPEN with O_TRUNC

dom_open_trunc

This optimization can truncate file during open if O_TRUNC flag is set.

   OPEN is invoked with PW LAYOUT lock
   MDS checks O_TRUNC is set and truncate file
   client gets reply with updated attributes

Data Readahead logic

All readahead optimizations are based on fact that it is possible to return some data from small file in EA buffer which is quite big now.

dom_readahead

Typical case is stat():

   MDS_GETATTR is enqueued as usual
   MDT returns basic attributes and checks amount of free space in EA reply buffer. 
   MDT reads data from file to EA buffer
   client fill pages with data from EA buffer

Tests Specification Sanity tests

The set of tests to make sure the feature works as expected test_1a - file with DOM layout

   create file with lfs setstripe -L, check its LOV EA attribute has LOV_PATTERN_DOM pattern and has no stripes on OST
   write to the file, checks data is written and valid, checks file attributes are valid, checks fs free blocks is decreased according with file size.
   delete file, checks free blocks are returned back

test_1b - DOM layout as default directory layout

   set new default layout on directory with lfs setstripe -L and create file in it. 
   all steps from test_1a

test_2a - check size limit and migration

   create file with DOM layout
   write more that max_stripe_size to the file
       Phase I: checks that -ENOSPC error is returned
       Phase II: check that migration occured
   (only Phase II below)
   checks file layout is changed to filesystem default layout and has stripes on OSTs
   checks file size is valid and fs free blocks data is valid as well

Sanityn tests test_1{a,...} - parallel access to the small files

   open DOM file from client1 in different modes and pause it.
   perform various operations on file from client2
   make sure client2 is waiting client1 to finish first

Recovery tests

   fail MDS while performing operations like create/open/write/destroy on DOM file
   make sure operations are replayed back and finished as expected

Functional tests

Check all cases we optimize Stat

   create files with DOM layout vs 2-stripes layout
   fills them with data
   perform ls -l on them
   Output results for both cases

Open with O_TRUNC

   create files with DOM and default layout
   fills them with data
   perform open with O_TRUNC
   output results

Write at the end of file with partial page

   create many small files and fills with data so last page is filled partially
   performs write to the end of file
   Output results for DOM file vs normal file

Readahead of small files

   create small files with DOM and normal layout
   fills them with data
   perform grep on them
   output results

Generic performance tests mdsrate

Regular MD operations should benefit from Data-on-MDT because there are no OST requests, basically only stat should benefit noticeable because open uses precreated objects and destroy is not blocked but OST objects destroy.

use mdsrate to perform the following operations:

   file creation with single stripe vs DOM
   stat 
   unlink
   output results for both cases

files with data

The mdsrate tool must be modified to create DoM files with amount of data to test DoM files performance in common case. FIO

Check generic IO performance of small files

   run FIO with DoM files vs normal files
   output results

postmark

The Postmark utility is designed to emulate applications such as software development, email, newsgroup servers and Web applications. It is an industry-standard benchmark for small file and metadata-intensive workloads.

   Run postmark over DoM striped directory and default one.
   Output results