Você está na página 1de 36

The Lustre File System

Eric Barton
Lead Engineer, Lustre Group
Sun Microsystems
1

Lustre Today
What is Lustre
Deployments
Community

Lustre Development

Topics

Industry Trends
Scalability
Improvements

Lustre File System


The world's fastest, most scalable file system

Parallel shared POSIX file system


Scalable
High performance
Petabytes of storage
Tens of thousands of clients

Coherent

Single namespace
Strict concurrency control

Heterogeneous networking
High availability
GPL Open source
Multi-platform
Multi-vendor

Lustre File System


Major components
Client
Client
Client
Client
Client
Client
Client
Client

MGS
configuration

OSS
OSS
OSS
data
OSS
MDS
MDS
MDS
namespace

Lustre Networking
Simple

Message Queues
RDMA
Active - Get/Put
Passive Attach
Asynchronous Events
Error handling Unlink

Layered

LNET / LND

Multiple Networks
Routers

RPC
Queued requests
RDMA bulk
RDMA reply

Recovery
Resend
Replay

A Lustre Cluster
I/O Servers (OSS)
Metadata
Servers (MDS)

MDS 1
(active)

MDS 2
(standby)

OSS 1

Commodity Storage
Servers

Multiple Networks
TCP/IP
QSNet

OSS 2

Myrinet

Lustre Clients

InfiniBand

10s - 10,000s

OSS 3

iWARP

Shared storage enables


failover OSS

Cray Seastar
Router

OSS 4

OSS 5

OSS 6

= failover
OSS 7

Enterprise-Class Storage
Arrays & SAN Fabrics

Lustre Today
Lustre is the leading HPC file system
> 7 of Top 10
> Over 40% of Top100

Demonstrated scalability and


performance
> 190GB/sec IO
> 26,000 clients
> Many systems with over 1,000 nodes

Livermore Blue Gene/L SCF

TACC Ranger

3.5 PB storage; 52 GB/s I/O throughput


131,072 processor cores

1.73 PB storage; 40GB/s I/O throughput


62,976 processor cores

Sandia Red Storm

ORNL Jaguar

340 TB Storage; 50GB/s I/O throughput


12,960 multi-core compute sockets

10.5PB storage; 240 GB/s I/O throughput goal


265,708 processor cores

Center-wide File System


Spider will provide a shared, parallel file system for all
systems
Based on Lustre file system

Demonstrated bandwidth of over 190 GB/s


Over 10 PB of RAID-6 Capacity
13,440 1 TB SATA Drives

192 Storage servers


3 TeraBytes of memory

Available from all systems via our high-performance


scalable I/O network
Over 3,000 InfiniBand ports
Over 3 miles of cables
Scales as storage grows

Undergoing system checkout with deployment expected in


summer 2009

Future LCF Infrastructure


Everest
Powerwall

Remote
Visualization
Cluster

End-to-End
Cluster

Application
Development
Data Archive
Cluster
25 PB

SION
192x

48x

192x

Login
XT5
XT4

Spider

Lustre Success - Media


Customer challenges
> Eliminate data storage bottlenecks resulting from scalability

issues NFS can't handle


> Increase system performance and reliability

Lustre value
> Doubled data storage at three times less cost of compelling

solutions
> The ability to provide a single file system namespace to its
production artists
> Easy-to-install open source software with great flexibility on
storage and server hardware

While we were working on The Golden Compass, we faced the most


intensive I/O requirements on any project to date. Lustre played a vital role
in helping us to deliver this project.
Daire Byrne, senior systems integrator, Framestore

Lustre success Telecommunications


NBC broadcast 2008 Summer Olympics live online
over Level 3 network using Lustre

Customer challenges
> Provide scalable service
> Ensure continuous availability
> Control costs

Lustre value
> The ability to scale easily
> Works well with commodity equipment from multiple vendors
> High performance and stability
With Lustre, we can achieve that balancing act of maintaining a reliable network with lesscostly equipment. It allows us to replace servers and expand the network quickly and easily
- Kenneth Brookman, Level 3 Communications

Lustre success - Energy


Customer challenges
> Process huge and growing volumes of
data
> Keep hardware costs manageable
> Scale existing cluster easily
Lustre value
> Ability to handle exponential growth in
data
> Capability to scale computer clusters
easily
More > Reduced hardware costs
Success> Reduced maintenance costs

Open Source Community


Lustre OEM Partners

Open Source Community


Resources
Web

http://www.lustre.org
News and information
Operations Manual
Detailed technical documentation

Mailing lists

lustre-discuss@lists.lustre.org
> General/operational issues

lustre-devel@lists.lustre.org
> Architecture and features

Bugzilla

https://bugzilla.lustre.org
Defect tracking and patch database

CVS repository
Lustre Internals training material

HPC Trends
Processor performance / RAM growing faster than I/O
Relative number of I/O devices must grow to compensate
Storage component reliability not increasing with capacity
> Failure is not an option its guaranteed

Trend to shared file systems


Multiple compute clusters
Direct access from specialized systems

Storage scalability critical

DARPA HPCS
Capacity

Performance

1 trillion files per file system 40,000 file creates/sec


> Single client node
10 billion files per directory
30GB/sec streaming data
100 PB system capacity
> Single client node
1 PB single file size
240GB/sec aggregate I/O
>30k client nodes
> File per process
100,000 open files

Reliability
End-to-end data integrity
No performance impact
during RAID build

> Shared file

Lustre and the Future


Continued focus on extreme HPC
Capacity
> Exabytes of storage
> Trillions of files
> Many client clusters each with 100,000's of clients

Performance
> TB's/sec of aggregate I/O
> 100,000's of aggregate metadata ops/sec

Community Driven Tools and Interfaces


> Management and Performance Analysis

HPC Center of the Future


Capability
500,000 Nodes

Capacity 1
250,000 Nodes

Capacity 2
150,000 Nodes

Capacity 3
50,000 Nodes

Test
25,000 Nodes

Viz
1

Viz
2

WAN
Access

Shared Storage Network


10 TB/sec

User Data
1000 MDTs

Lustre Storage Cluster

Metadata
25 MDSs

HPSS
Archive

Lustre Scalability
Definition

Performance / capacity grows nearly linearly with


hardware
Component failure does not have a disproportionate
impact on availability

Requirements

Scalable I/O & MD performance


Expanded component size/count limits
Increased robustness to component failure
Overhead grows sub-linearly with system size
Timely failure detection & recovery

Lustre Scaling

Architectural Improvements
Clustered Metadata (CMD)

10s 100s of metadata servers


Distributed inodes

> Files local to parent directory entry / subdirs may be non-local

Distributed directories
> Hashing

Striping

Distributed Operation Resilience/Recovery


> Uncommon HPC workload
- Cross-directory rename

> Short term

- Sequenced cross-MDS ops

> Longer term

- Transactional - ACID
- Non-blocking - deeper pipelines
- Hard - cascading aborts, synch ops

Epochs
Oldest
Epoch

Global Oldest Volatile Epoch Reduction Network

CurrentGloballyKnown

Newest

OldestVolatileEpoch

Epoch

Stable

Unstable
Committed

Uncommitted

Updates

Server1
Server2

Operations

Server3
Client1

LocalOldest
VolatileEpochs

Client2
Redo

Architectural Improvements
Fault Detection Today
RPC timeout
> Timeouts must scale O(n) to distinguish death / congestion

Pinger
> No aggregation across clients or servers
> O(n) ping overhead

Routed Networks
> Router failure can be confused with end-to-end peer failure

Fully automatic failover scales with slowest time constant


> Many 10s of minutes on large clusters
> Failover could be much faster if useless waiting eliminated

Architectural Improvements
Scalable Health Network

Burden of monitoring clients distributed not replicated


> ORNL 35,000 clients, 192 OSSs, 7 OSTs/OSS

Fault-tolerant status reduction/broadcast network


> Servers and LNET routers

LNET high-priority small message support


> Health network stays responsive

Prompt, reliable detection

> Time constants in seconds


> Failed servers, clients and routers
> Recovering servers and routers

Interface with existing RAS infrastructure


Receive and deliver status notification

Health Monitoring Network


Primary Health Monitor
Failover Health Monitor
Client

Architectural Improvements
Metadata Writeback Cache

Avoids unnecessary server communications


> Operations logged/cached locally
> Performance of a local file system when uncontended

Aggregated distributed operations

> Server updates batched and tranferred using bulk protocols


(RDMA)
> Reduced network and service overhead

Sub-Tree Locking
> Lock aggregation a single lock protects a whole subtree
> Reduce lock traffic and server load

Architectural Improvements
Current - Flat Communications model

Stateful client/server connection required for coherence and


performance
Every client connects to every server
O(n) lock conflict resolution

Future - Hierarchical Communications Model

Aggregate connections, locking, I/O, metadata ops


Caching clients
> Aggregate local processes (cores)
> I/O Forwarders scale another 32x or more

Caching Proxies

> Aggregate whole clusters


> Implicit Broadcast - scalable conflict resolution

Hierarchical Communications
Lustre Storage
Cluster

Proxy Cluster
Cluster
Proxy

Proxy Server
WBC Client

MDS

MDS

MDS

MDS

MDS

MDS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

OSS

User Proc.

User Proc.

User Proc.

Proxy Server
WBC Client

Proxy
Server

User Proc.

OSS

OSS

OSS

OSS

OSS

OSS

OSS

User Proc.

OSS

OSS

OSS

Proxy Server

OSS

OSS

OSS

OSS

OSS

OSS
OSS

I/O Forwarder

IO Forwarding
Client

OSS

OSS

IO Forwarding
Client

Proxy
Server

OSS

OSS

Lustre Client

Proxy Server
WBC Client

OSS

Proxy
Server

User Proc.
IO
Forwarding
Server

WBC Client

Proxy
Server

WBC Client

IO
Forwarding
Server

User Proc.

I/O Forwarder

IO Forwarding
Client
User Proc.

User Proc.

User Proc.

WAN / Security Domain


Luster Comm

WBC Client

User Proc.

System Calls

ZFS
End-to-end data integrity
Checksums in block pointers
Ditto blocks
Transactional mirroring/RAID

Remove ldiskfs size limits


Immense Capacity (128 bit)
No limits on files, dirents etc

COW
Transactional
Snapshots

Performance Improvements
SMP Scaling

Improve MDS performance / small message handling


CPU affinity
# Client
Finer granularity locking
RPC Througput

RPC Trhoughput

Nodes

Total client processes

Total client processes

Load (Im)Balance

Request
Queue
Depth

Time

Server #

Network Request Scheduler


Much larger working set than disk elevator
Higher level information - client, object, offset, job/rank
Prototype
Initial development on simulator
Scheduling strategies - quanta, offset, fairness etc.
Testing at ORNL pending

Future
Exchange global information - gang scheduling
QoS - Real time / Bandwidth reservation (min/max)

Metadata Protocol Improvements


Size on MDT (SOM)
Avoid multiple RPCs for attributes derived from OSTs
OSTs remain definitive while file open
Compute on close and cache on MDT

Readdir+
Aggregation
> Directory I/O
> Getattrs
> Locking

Lustre Scalability
Attribute

Today

Future

Number of Clients

10,000s
Flat comms model

1,000,000s
Hierarchical comms model

Server Capacity

Ext3 8TB

ZFS - Petabytes

Metadata Performance

Single MDS

CMD
SMP scaling

Recovery Time

RPC timeout - O(n)

Health Network - O(log n)

THANK YOU
Eric Barton
eeb@sun.com
lustre-discuss@lists.lustre.org
lustre-devel@lists.lustre.org
36