Sun Oracle Logo


Lustre File System

Operations Manual - Version 1.8

821-0035-12



Contents

Preface

Part I Lustre Architecture

1. Introduction to Lustre

1.1 Introducing the Lustre File System

1.1.1 Lustre Key Features

1.2 Lustre Components

1.2.1 Lustre Networking (LNET)

1.2.2 Management Server (MGS)

1.3 Lustre Systems

1.4 Files in the Lustre File System

1.4.1 Lustre File System and Striping

1.4.2 Lustre Storage

1.4.2.1 OSS Storage

1.4.2.2 MDS Storage

1.4.3 Lustre System Capacity

1.5 Lustre Configurations

1.6 Lustre Networking

1.7 Lustre Failover and Rolling Upgrades

2. Understanding Lustre Networking

2.1 Introduction to LNET

2.2 Supported Network Types

2.3 Designing Your Lustre Network

2.3.1 Identify All Lustre Networks

2.3.2 Identify Nodes to Route Between Networks

2.3.3 Identify Network Interfaces to Include/Exclude from LNET

2.3.4 Determine Cluster-wide Module Configuration

2.3.5 Determine Appropriate Mount Parameters for Clients

2.4 Configuring LNET

2.4.1 Module Parameters

2.4.1.1 Using Usocklnd

2.4.1.2 OFED InfiniBand Options

2.4.2 Module Parameters - Routing

2.4.2.1 LNET Routers

2.4.3 Downed Routers

2.5 Starting and Stopping LNET

2.5.1 Starting LNET

2.5.1.1 Starting Clients

2.5.2 Stopping LNET

Part II Lustre Administration

3. Installing Lustre

3.1 Preparing to Install Lustre

3.1.1 Supported Operating System, Platform and Interconnect

3.1.2 Required Lustre Software

3.1.3 Required Tools and Utilities

3.1.4 (Optional) High-Availability Software

3.1.5 Debugging Tools

3.1.6 Environmental Requirements

3.1.7 Memory Requirements

3.1.7.1 MDS Memory Requirements

3.1.7.2 OSS Memory Requirements

3.2 Installing Lustre from RPMs

3.3 Installing Lustre from Source Code

3.3.1 Patching the Kernel

3.3.1.1 Introducing the Quilt Utility

3.3.1.2 Get the Lustre Source and Unpatched Kernel

3.3.1.3 Patch the Kernel

3.3.2 Create and Install the Lustre Packages

3.3.3 Installing Lustre with a Third-Party Network Stack

4. Configuring Lustre

4.1 Configuring the Lustre File System

4.1.0.1 Simple Lustre Configuration Example

4.1.0.2 Module Setup

4.1.1 Scaling the Lustre File System

4.2 Additional Lustre Configuration

4.3 Basic Lustre Administration

4.3.1 Specifying the File System Name

4.3.2 Starting up Lustre

4.3.3 Mounting a Server

4.3.4 Unmounting a Server

4.3.5 Working with Inactive OSTs

4.3.6 Finding Nodes in the Lustre File System

4.3.7 Mounting a Server Without Lustre Service

4.3.8 Specifying Failout/Failover Mode for OSTs

4.3.9 Running Multiple Lustre File Systems

4.3.10 Setting and Retrieving Lustre Parameters

4.3.10.1 Setting Parameters with mkfs.lustre

4.3.10.2 Setting Parameters with tunefs.lustre

4.3.10.3 Setting Parameters with lctl

4.3.10.4 Reporting Current Parameter Values

4.3.11 Regenerating Lustre Configuration Logs

4.3.12 Changing a Server NID

4.3.13 Removing and Restoring OSTs

4.3.13.1 Removing an OST from the File System

4.3.13.2 Restoring an OST in the File System

4.3.14 Aborting Recovery

4.3.15 Determining Which Machine is Serving an OST

4.4 More Complex Configurations

4.4.1 Failover

4.5 Operational Scenarios

4.5.1 Changing the Address of a Failover Node

5. Service Tags

5.1 Introduction to Service Tags

5.2 Using Service Tags

5.2.1 Installing Service Tags

5.2.2 Discovering and Registering Lustre Components

5.2.3 Service Tag Registration Information

6. Configuring Lustre - Examples

6.1 Simple TCP Network

6.1.1 Lustre with Combined MGS/MDT

6.1.1.1 Installation Summary

6.1.1.2 Configuration Generation and Application

6.1.2 Lustre with Separate MGS and MDT

6.1.2.1 Installation Summary

6.1.2.2 Configuration Generation and Application

6.1.2.3 Configuring Lustre with a CSV File

7. More Complicated Configurations

7.1 Multihomed Servers

7.1.1 Modprobe.conf

7.1.2 Start Servers

7.1.3 Start Clients

7.2 Elan to TCP Routing

7.2.1 Modprobe.conf

7.2.2 Start servers

7.2.3 Start clients

7.3 Load Balancing with InfiniBand

7.3.1 Setting Up modprobe.conf for Load Balancing

7.4 Multi-Rail Configurations with LNET

8. Failover

8.1 What is Failover?

8.1.1 Failover Capabilities

8.1.2 Types of Failover Configurations

8.2 Failover Functionality in Lustre

8.2.1 MDT Failover Configuration (Active/Passive)

8.2.2 OST Failover Configuration (Active/Active)

8.2.3 Lustre Failover and MMP

8.2.3.1 Working with MMP

8.3 Configuring and Using Heartbeat with Lustre Failover

8.3.1 Creating a Failover Environment

8.3.1.1 Power Management Software

8.3.1.2 Power Equipment

8.3.2 Setting up the Heartbeat Software

8.3.2.1 Installing Heartbeat

8.3.2.2 Configuring Heartbeat

8.3.2.3 (Optional) Migrating a Heartbeat Configuration (v1 to v2)

8.3.3 Working with Heartbeat

8.3.3.1 Starting Heartbeat

8.3.3.2 Switching Resources Between Nodes

9. Configuring Quotas

9.1 Working with Quotas

9.1.1 Enabling Disk Quotas

9.1.1.1 Administrative and Operational Quotas

9.1.2 Creating Quota Files and Quota Administration

9.1.3 Quota Allocation

9.1.4 Known Issues with Quotas

9.1.4.1 Granted Cache and Quota Limits

9.1.4.2 Quota Limits

9.1.4.3 Quota File Formats

9.1.5 Lustre Quota Statistics

9.1.5.1 Interpreting Quota Statistics

10. RAID

10.1 Considerations for Backend Storage

10.1.1 Selecting Storage for the MDS or OSTs

10.1.2 Reliability Best Practices

10.1.3 Performance Tradeoffs

10.1.4 Formatting Options for RAID Devices

10.1.4.1 Creating an External Journal

10.1.5 Handling Degraded RAID Arrays

10.2 Insights into Disk Performance Measurement

10.3 Lustre Software RAID Support

10.3.0.1 Enabling Software RAID on Lustre

11. Kerberos

11.1 What is Kerberos?

11.2 Lustre Setup with Kerberos

11.2.1 Configuring Kerberos for Lustre

11.2.1.1 Kerberos Distributions Supported on Lustre

11.2.1.2 Preparing to Set Up Lustre with Kerberos

11.2.1.3 Configuring Lustre for Kerberos

11.2.1.4 Configuring Kerberos

11.2.1.5 Setting the Environment

11.2.1.6 Building Lustre

11.2.1.7 Running GSS Daemons

11.2.2 Types of Lustre-Kerberos Flavors

11.2.2.1 Basic Flavors

11.2.2.2 Security Flavor

11.2.2.3 Customized Flavor

11.2.2.4 Specifying Security Flavors

11.2.2.5 Mounting Clients

11.2.2.6 Rules, Syntax and Examples

11.2.2.7 Authenticating Normal Users

12. Network Interface Bonding

12.1 Network Bonding

12.2 Requirements

12.3 Using Lustre with Multiple NICs versus Bonding NICs

12.4 Bonding Module Parameters

12.5 Setting Up Bonding

12.5.1 Examples

12.6 Configuring Lustre with Bonding

12.6.1 Bonding References

13. Upgrading and Downgrading Lustre

13.1 Supported Upgrades

13.2 Lustre Interoperability

13.3 Upgrading Lustre 1.6.x to 1.8.x

13.3.1 Performing a Complete File System Upgrade

13.3.2 Performing a Rolling Upgrade

13.4 Upgrading Lustre 1.8.x to the Next Minor Version

13.5 Downgrading from Lustre 1.8.x to 1.6.x

13.5.1 Performing a Complete File System Downgrade

13.5.2 Performing a Rolling Downgrade

14. Lustre SNMP Module

14.1 Installing the Lustre SNMP Module

14.2 Building the Lustre SNMP Module

14.3 Using the Lustre SNMP Module

15. Backup and Restore

15.1 Backing up a File System

15.2 Backing up a Device (MDS or OST)

15.2.1 Backing Up the MDS

15.2.2 Backing Up an OST

15.3 Backing up Files

15.3.1 Backing up Extended Attributes

15.4 Restoring from a File-level Backup

15.5 Using LVM Snapshots with Lustre

15.5.1 Creating an LVM-based Backup File System

15.5.2 Backing up New/Changed Files to the Backup File System

15.5.3 Creating Snapshot Volumes

15.5.4 Restoring the File System From a Snapshot

15.5.5 Deleting Old Snapshots

15.5.6 Changing Snapshot Volume Size

16. POSIX

16.1 Introduction to POSIX

16.2 Installing POSIX

16.2.1 POSIX Installation Using a Quick Start Version

16.3 Building and Running a POSIX Compliance Test Suite on Lustre

16.3.1 Building the Test Suite from Scratch

16.3.2 Running the Test Suite Against Lustre

16.4 Isolating and Debugging Failures

17. Benchmarking

17.1 Bonnie++ Benchmark

17.2 IOR Benchmark

17.3 IOzone Benchmark

18. Lustre I/O Kit

18.1 Lustre I/O Kit Description and Prerequisites

18.1.1 Downloading an I/O Kit

18.1.2 Prerequisites to Using an I/O Kit

18.2 Running I/O Kit Tests

18.2.1 sgpdd_survey

18.2.1.1 Tuning sgpdd_survey

18.2.2 obdfilter_survey

18.2.2.1 Running obdfilter_survey Against a Local Disk

18.2.2.2 Running obdfilter_survey Against a Network

18.2.2.3 Running obdfilter_survey Against a Network Disk

18.2.2.4 Output Files

18.2.2.5 Script Output

18.2.2.6 Visualizing Results

18.2.3 ost_survey

18.2.4 stats-collect

18.3 PIOS Test Tool

18.3.1 Synopsis

18.3.2 PIOS I/O Modes

18.3.3 PIOS Parameters

18.3.4 PIOS Examples

18.4 LNET Self-Test

18.4.1 Basic Concepts of LNET Self-Test

18.4.1.1 Modules

18.4.1.2 Utilities

18.4.1.3 Session

18.4.1.4 Console

18.4.1.5 Group

18.4.1.6 Test

18.4.1.7 Batch

18.4.1.8 Sample Script

18.4.2 LNET Self-Test Commands

18.4.2.1 Session

18.4.2.2 Group

18.4.2.3 Batch and Test

18.4.2.4 Other Commands

19. Lustre Recovery

19.1 Recovery Overview

19.1.1 Client Failure

19.1.2 Client Eviction

19.1.3 MDS Failure (Failover)

19.1.4 OST Failure (Failover)

19.1.5 Network Partition

19.1.6 Failed Recovery

19.2 Metadata Replay

19.2.1 XID Numbers

19.2.2 Transaction Numbers

19.2.3 Replay and Resend

19.2.4 Client Replay List

19.2.5 Server Recovery

19.2.6 Request Replay

19.2.7 Gaps in the Replay Sequence

19.2.8 Lock Recovery

19.2.9 Request Resend

19.3 Reply Reconstruction

19.3.1 Required State

19.3.2 Reconstruction of Open Replies

19.4 Version-based Recovery

19.4.1 Delayed Recovery

19.4.2 Working with VBR

19.4.3 Tips for Using VBR

19.5 Recovering from Errors or Corruption on a Backing File System

19.6 Recovering from Corruption in the Lustre File System

19.6.1 Working with Orphaned Objects

Part III Lustre Tuning, Monitoring and Troubleshooting

20. Lustre Tuning

20.1 Module Options

20.1.1 OSS Service Thread Count

20.1.1.1 Optimizing the Number of Service Threads

20.1.2 MDS Service Thread Count

20.2 LNET Tunables

20.2.1 Transmit and receive buffer size:

20.2.2 irq_affinity

20.3 Options for Formatting the MDT and OSTs

20.3.1 Planning for Inodes

20.3.2 Sizing the MDT

20.4 Overriding Default Formatting Options

20.4.1 Number of Inodes for the MDS

20.4.2 Inode Size for the MDS

20.4.3 Number of Inodes for an OST

20.5 Large-Scale Tuning for Cray XT and Equivalents

20.5.1 Network Tunables

20.6 Lockless I/O Tunables

20.7 Data Checksums

21. LustreProc

21.1 Proc Entries for Lustre

21.1.1 Locating Lustre File Systems and Servers

21.1.2 Lustre Timeouts

21.1.3 Adaptive Timeouts

21.1.3.1 Configuring Adaptive Timeouts

21.1.3.2 Interpreting Adaptive Timeouts Information

21.1.4 LNET Information

21.1.5 Free Space Distribution

21.1.5.1 Managing Stripe Allocation

21.2 Lustre I/O Tunables

21.2.1 Client I/O RPC Stream Tunables

21.2.2 Watching the Client RPC Stream

21.2.3 Client Read-Write Offset Survey

21.2.4 Client Read-Write Extents Survey

21.2.5 Watching the OST Block I/O Stream

21.2.6 Using File Readahead and Directory Statahead

21.2.6.1 Tuning File Readahead

21.2.6.2 Tuning Directory Statahead

21.2.7 OSS Read Cache

21.2.7.1 Using OSS Read Cache

21.2.8 OSS Asynchronous Journal Commit

21.2.9 mballoc History

21.2.10 mballoc3 Tunables

21.2.11 Locking

21.2.12 Setting MDS and OSS Thread Counts

21.3 Debug Support

21.3.1 RPC Information for Other OBD Devices

21.3.1.1 Interpreting OST Statistics

21.3.1.2 llobdstat

21.3.1.3 Interpreting MDT Statistics

22. Lustre Monitoring

22.1 Lustre Monitoring Tool

22.2 Red Hat Cluster Manager

22.3 SNMP Monitoring

22.4 CollectL

22.5 Other Monitoring Options

23. Lustre Troubleshooting

23.1 Troubleshooting Lustre

23.1.1 Error Numbers

23.1.2 Error Messages

23.1.3 Lustre Logs

23.2 Reporting a Lustre Bug

23.3 Common Lustre Problems and Performance Tips

23.3.1 Recovering from an Unavailable OST

23.3.2 Write Performance Better Than Read Performance

23.3.3 OST Object is Missing or Damaged

23.3.4 OSTs Become Read-Only

23.3.5 Identifying a Missing OST

23.3.6 Improving Lustre Performance When Working with Small Files

23.3.7 Default Striping

23.3.8 Erasing a File System

23.3.9 How to Fix a Bad LAST_ID on an OST

23.3.10 Reclaiming Reserved Disk Space

23.3.11 Considerations in Connecting a SAN with Lustre

23.3.12 Handling/Debugging "Bind: Address already in use" Error

23.3.13 Replacing An Existing OST or MDS

23.3.14 Handling/Debugging Error "- 28"

23.3.15 Triggering Watchdog for PID NNN

23.3.16 Handling Timeouts on Initial Lustre Setup

23.3.17 Handling/Debugging "LustreError: xxx went back in time"

23.3.18 Lustre Error: "Slow Start_Page_Write"

23.3.19 Drawbacks in Doing Multi-client O_APPEND Writes

23.3.20 Slowdown Occurs During Lustre Startup

23.3.21 Log Message ‘Out of Memory’ on OST

23.3.22 Number of OSTs Needed for Sustained Throughput

23.3.23 Setting SCSI I/O Sizes

23.3.24 Identifying Which Lustre File an OST Object Belongs To

24. Lustre Debugging

24.1 Lustre Debug Messages

24.1.1 Format of Lustre Debug Messages

24.1.2 Lustre Debug Messages Buffer

24.2 Tools for Lustre Debugging

24.2.1 Debug Daemon Option to lctl

24.2.1.1 lctl Debug Daemon Commands

24.2.2 Controlling the Kernel Debug Log

24.2.3 The lctl Tool

24.2.4 Finding Memory Leaks

24.2.5 Printing to /var/log/messages

24.2.6 Tracing Lock Traffic

24.2.7 Sample lctl Run

24.2.8 Adding Debugging to the Lustre Source Code

24.3 Troubleshooting with strace

24.4 Looking at Disk Content

24.4.1 Determine the Lustre UUID of an OST

24.4.2 Tcpdump

24.5 Ptlrpc Request History

24.6 Using LWT Tracing

Part IV Lustre for Users

25. Striping and I/O Options

25.1 Lustre File Striping

25.1.1 Advantages of Striping

25.1.1.1 Bandwidth

25.1.2 Disadvantages of Striping

25.1.2.1 Increased Overhead

25.1.2.2 Increased Risk

25.1.3 Stripe Size

25.2 Setting and Retrieving Striping Information

25.2.1 Setting File Layouts

25.2.2 Changing Striping for a Subdirectory

25.2.3 Using a Specific Striping Pattern/File Layout for a Single File

25.2.4 Creating a File on a Specific OST

25.3 Managing Free Space

25.3.1 Checking File System Free Space

25.3.2 Using Stripe Allocations

25.3.3 Round-Robin Allocator

25.3.4 Weighted Allocator

25.3.5 Adjusting the Weighting Between Free Space and Location

25.4 Handling Full OSTs

25.4.1 Checking File System Usage

25.4.2 Taking a Full OST Offline

25.4.3 Migrating Data within a File System

25.5 Creating and Managing OST Pools

25.5.1 Working with OST Pools

25.5.1.1 Using the lfs Command with OST Pools

25.5.2 Tips for Using OST Pools

25.6 Performing Direct I/O

25.6.1 Making File System Objects Immutable

25.7 Other I/O Options

25.7.1 Lustre Checksums

25.7.1.1 Changing Checksum Algorithms

25.8 Striping Using llapi

26. Lustre Security

26.1 Using ACLs

26.1.1 How ACLs Work

26.1.2 Using ACLs with Lustre

26.1.3 Examples

26.2 Using Root Squash

26.2.1 Configuring Root Squash

26.2.2 Enabling and Tuning Root Squash

26.2.3 Syntax Error Handling

27. Lustre Operating Tips

27.1 Adding an OST to a Lustre File System

27.2 A Simple Data Migration Script

27.3 Adding Multiple SCSI LUNs on Single HBA

27.4 Failures Running a Client and OST on the Same Machine

27.5 Improving Lustre Metadata Performance While Using Large Directories

Part V Reference

28. User Utilities (man1)

28.1 lfs

28.2 lfs_migrate

28.3 lfsck

28.4 Filefrag

28.5 Mount

28.6 Handling Timeouts

29. Lustre Programming Interfaces (man2)

29.1 User/Group Cache Upcall

29.1.1 Name

29.1.2 Description

29.1.2.1 Primary and Secondary Groups

29.1.3 Parameters

29.1.4 Data structures

30. Setting Lustre Properties (man3)

30.1 Using llapi

30.1.1 llapi_file_create

30.1.2 llapi_file_get_stripe

30.1.3 llapi_file_open

30.1.4 llapi_quotactl

30.1.5 llapi_path2fid

31. Configuration Files and Module Parameters (man5)

31.1 Introduction

31.2 Module Options

31.2.1 LNET Options

31.2.1.1 Network Topology

31.2.1.2 networks ("tcp")

31.2.1.3 routes (“”)

31.2.1.4 forwarding ("")

31.2.2 SOCKLND Kernel TCP/IP LND

31.2.3 QSW LND

31.2.4 RapidArray LND

31.2.5 VIB LND

31.2.6 OpenIB LND

31.2.7 Portals LND (Linux)

31.2.8 Portals LND (Catamount)

31.2.9 MX LND

32. System Configuration Utilities (man8)

32.1 mkfs.lustre

32.2 tunefs.lustre

32.3 lctl

32.4 mount.lustre

32.5 Additional System Configuration Utilities

32.5.1 lustre_rmmod.sh

32.5.2 e2scan

32.5.3 Utilities to Manage Large Clusters

32.5.4 Application Profiling Utilities

32.5.5 More /proc Statistics for Application Profiling

32.5.6 Testing / Debugging Utilities

32.5.7 Flock Feature

32.5.7.1 Example

32.5.8 l_getgroups

32.5.9 llobdstat

32.5.10 llstat

32.5.11 lst

32.5.12 plot-llstat

32.5.13 routerstat

32.5.14 ll_recover_lost_found_objs

33. System Limits

33.1 Maximum Stripe Count

33.2 Maximum Stripe Size

33.3 Minimum Stripe Size

33.4 Maximum Number of OSTs and MDTs

33.5 Maximum Number of Clients

33.6 Maximum Size of a File System

33.7 Maximum File Size

33.8 Maximum Number of Files or Subdirectories in a Single Directory

33.9 MDS Space Consumption

33.10 Maximum Length of a Filename and Pathname

33.11 Maximum Number of Open Files for Lustre File Systems

33.12 OSS RAM Size

Glossary

Index