Architecture - ZFS TinyZAP

From Lustre Wiki
Jump to: navigation, search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Definitions

ZAP 
ZFS Attribute Processor, a hashed name=value lookup table that can be used to do efficient and scalable attribute storage
MicroZAP 
a form of ZAP used by ZFS that only allows value to be a __u64
TinyZAP 
a compact ZAP format that allows arbitrary values to be stored
FatZAP 
a form of ZAP used by ZFS that allows arbitrary name/value pairs but (as name implies) consumes a lot of space

Use Cases

id quality attribute summary
fast_ea performance TinyZAP must be flexible to allow storage in arbitrary-sized containers such as large dnode or ancillary dbuf, and not waste too much space per attribute
mdt_fid performance ZAP must allow FID storage for MDT directories in a manner that is compatible with ZFS directories
compatible-zfs usability integration should be done at the DMU level so that any ZAP users (including ZFS) can use TinyZAPs to store name/value data

Efficient EA storage

Scenario: Efficient EA storage in large dnode
Business Goals: Fast access to Lustre EA values
Relevant QA's: Performance
details Stimulus: EA needs to be stored in large dnode
Stimulus source: Lustre OSD (MDT/OST) storing EA data to object
Environment: EA being stored on a specific dnode within a transaction
Artifact: EA is stored in the dnode in ZAP
Response: size of EA in ZAP
Response measure: overhead should not be more than approximately 128 bytes + 48 bytes/EA
Questions:
Issues: None.

MDT FIDs in directories

Scenario: Storage of MDT FIDs in directories
Business Goals: Able to store FID data along with name and (in some cases) local DMU objid in a single directory entry to avoid extra FID lookup overhead
Relevant QA's: Performance
details Stimulus: Storing MDT FID in a directory entry
Stimulus source: Lustre MDT creating new directory entry
Environment: Directory entry being created within a transaction with objid and FID
Artifact: FID is stored in directory entry after __u64 objid value
Response: size of EA in ZAP
Response measure: overhead should not be more than approximately 128 bytes + 48 bytes/EA
Questions:
Issues: None.

ZFS compatible directories

Scenario: ZFS reading TinyZAP directory with FID
Business Goals: Compatible with (possibly modified) ZFS code
Relevant QA's: Usability
details Stimulus: ZFS reading TinyZAP entry
Stimulus source: ZFS doing name lookup in directory
Environment: Lustre MDT mounted with ZFS for diagnostic reasons
Artifact: ZFS is able to read directory and find objid (if local)
Response: ZFS does not fail in object lookup
Response measure:
Questions: What should stored in the objid in a CMD environment where the object is on a remote MDT?
Issues: There will be some small amount of code change needed in ZFS to only access the first __u64 of the value, because it currently only provides enough space to return a single __u64 of data. We will also need to handle the case for a remote MDT by either storing objid=0 or some other well-defined value.

Implementation constraints

TinyZAP needs to be flexible enough to store arbitrary name/value data, including both Lustre LOV EA, and also MDT directories with extended FID data. Using a MicroZAP is not possible because this only allows storage of a single __u64 value with each entry. Using a FatZAP is wasteful as it requires a full block just for the header and a separate block for the leaf data.

A preferred implementation would have a structure similar to the existing zap_leaf_{phys_t,chunk} for the TinyZAP, since the leaf structure is reasonably compact, and may avoid a large amount of almost-identical code in the ZAP.

ZFS should be adapted (if necessary) to be able to handle directories created with TinyZAP layout, so they can can get the objid from the first __u64 and ignore the FID component of the directory entry.

The current ZAP implementation uses an object set and object number as parameters and we will need to interface using a buffer that might be located in the dnode or in an external block. So this might require some refactoring of the ZAP code.

This needs to handle endian swabbing issues correctly, as does all ZFS code.

Questions and Issues

Should we "wrap" the FID data after the DMU object id in an MDT directory so that it is possible in the future to add other extra data in a directory without compromising compatibility? Something like:

#define ZAP_LUSTRE_FID 0x110f1d0f1d0f1d10 struct zap_dir_fid { __u64 zdf_magic; struct lu_fid zdf_fid; /* or other data as appropriate */ }; #define zdf_len (zdf_magic & 0xff). This means we can skip (possibly unknown) extra directory info by skipping (zdf_len) bytes at a time looking for zdf_magic == ZAP_LUSTRE_FID.

References

ZFS large dnodes

http://www.opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf