Você está na página 1de 16

Rethinking Ceph Architecture for

Disaggregation using NVMe-over-Fabrics


Yi Zou (Research Scientist), Arun Raghunath (Research Scientist)
Tushar Gohad (Principal Engineer)
Intel Corporation
2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 1
Contents
 Ceph and Disaggregated Storage Refresher
 Ceph on Disaggregation – Problem statement
 Replication Flow
 Data Center Tax
 Current Approach
 Our Approach - Decoupling data and control plane
 Architecture Change Detail
 Analytical Results
 Preliminary Evaluation
 Summary

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 2


Ceph Refresher
 Open-source, object-based scale-
out storage system
 Software-defined, hardware-
agnostic – runs on commodity
hardware
 Object, Block and File support in a
unified storage cluster
 Highly durable, available –
replication, erasure coding
 Replicates and re-balances
dynamically

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 3


Disaggregation Refresher
 Software Defined Storage (SDS): Scale-out
Trend observed in both academia and industry
approach for storage guarantees.
“Extreme resource modularity” Gao, USENIX OSDI ‘16
 Disaggregates software from hardware
Open Compute Project; Intel RSD; HP MoonShot;
 Numerous SDS offerings and deployments Facebook disaggregated racks; AMD SeaMicro;

 Disaggregation: Separate servers into resource


components (e.g. storage, compute blades) Intel Rack Scale Design

 Resource flexibility and utilization – TCO benefit


 Provides deployment flexibility – pure
disaggregation, hyper-converged, hybrid
 Feasible now for SSD due to advancement of
fabric technologies

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 4


Ceph and NVMe-oF disaggregation options
 Rationale:
 Share storage tier across multiple SDS options
 Scale compute and storage sleds independently
 Opens new optimization opportunities

 Approaches
 Host based NVMeoF storage backend
 NVMeoF volume replication in different failure domains
 Not using Ceph for durability guarantees
 Stock Ceph with NVMeoF storage backend
 OSD directed replication

 Decouple Ceph control and data flows

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 5


Ceph Replication Flow
Client
• SDS reliability guarantees  data copies (replication / Erasure Coding) (1) Write (6) Ack
• SDS durability guarantees  long running tasks to scrub and repair data
• We focus on replication flows in the rest of the talk
Primary
OSD

(2) Write (4) Ack


(1) Client writes to the primary OSD
(2) Primary identifies secondary and tertiary OSDs via CRUSH Map Secondary
(3) Primary writes to secondary and tertiary OSDs. OSD
(4) Secondary OSD acks write to Primary
(5) Tertiary OSD acks write to Primary
(6) When writes are settled – Primary OSD Acks to the client
(3) Write Tertiary (5) Ack
OSD

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 6


Stock Ceph disaggregation: Datacenter ‘tax’
“relayed” data placement extra data transfers 
 Latency cost Client Bandwidth cost
Data
 Ceph Deployments Today Primary
Replica
 Common to provision separate cluster network
Primary OSD Target
for internal traffic Primary Data Server
 Network cost compounded as capacity scales up Replica 1

Secondary OSD Replica 1 Target


(Replica 1) Data Server
 Disaggregating OSD storage Replica 2
Replica 2
Tertiary OSD
 Exacerbates the data movement problem (Replica 2)
Target
Data Server
 Magnifies performance interference
NVMe-oF

Cluster Network (IP)


Fabric Network (NVMeof)

OSD: Object Storage Daemon

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 7


A Different Approach - Decoupling Data and Control Plane
data from client
1. Direct data copy to storage target
• Issue: Need final landing destination Object-OSD map 1 Storage
Primary ID, data Target1
• Current consistent hashing maps Object  OSD
Storage Target OSD Block mgmt
• Maintain a map of storage target assigned to each OSD map
2
• Consult map to find storage target for each OSD
Storage
Object-OSD map
2. Block ownership Replica Target2
Storage Target OSD Meta only + ID ID, data Block mgmt
• Issue: Currently the OSD host File-System owns blocks map
• Move block ownership to remote target (details next slide) 3

Storage
Control plane Object-OSD map
3.
Replica Target3
• Issue: Metadata tightly coupled with data Storage Target OSD ID, data Block mgmt
map
• Send only metadata to replica OSD (eliminates N-1 data copies!)
• Unique ID to correlate meta with data

Typical 3-way replication, total 4 hops here vs 6 hops in stock Ceph E2E from client to target!
* OSD: Object Storage Daemon

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 8


Stock Ceph Architecture – Control and Data Plane

Ceph OSD Host


Inefficient as block
PlacementGroup allocation occurs in
Control Plane for OSD host that is
ObjectStore Object Mapping remote from actual
storage device
Service
BlueStore
Client

Storage Service API BlockDevice Data Plane for Storage Target


• Object Storage API (RGW) Data Block
• Block Storage API (RBD) Management
Target Target
Initiator …
Fabric
SSD SSD

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 9


Architecture Change – Remote Block Management

Block Ownership Mapping Table


Control Plane for Data Plane for Data OID Blocks
Object Mapping Service Block Management
<unqiue-id-1> Disk1:1-128,200-300

<unique-id-2> Disk3:1000-2000
Ceph OSD Host Storage Target
Block Mgmt Service
Control Plane Only
(data plane only)
BlueStore
PlacementGroup
BlockDevice
Client
ObjectStore
Storage Service API
• Object Storage API (RGW) Target Target
• Block Storage API (RBD) Fabric …
Initiator SSD SSD

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 10


Bandwidth benefits: Remote Block Management

Stock Ceph with NVMe-oF Ceph optimized for NVMe-oF

data data
Primary OSD Target1 Primary OSD Target1
Control + data Control only
data
data
Replica1 OSD Target2 Replica1 OSD Target2
Control + data Control only data
data
Replica2 OSD Target3 Replica2 OSD Target3

Estimated Reduction in Bandwidth consumption


40% reduction for 3-way replication!
Reduction (bytes) = (r – 1) × (Md – Mm)

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 11


Latency benefits: Decouple control and data flows

Stock Ceph with NVMe-oF Ceph optimized for NVMe-oF

Client Primary Rep1 Rep2 Primary Rep1 Rep2


Client Primary Rep1 Rep2 Primary Rep1 Rep2
OSD OSD OSD Target Target Target
OSD OSD OSD Target Target Target
Write Write data data
new data new (concurrent)

time
data data
data
Control Control
+ data data only
data
ack
Ok

Ok ack

Estimated Latency Reduction


1.5X latency improvement !
Reduction (usec) = Nd + m + Na

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 12


PoC Setup
Ceph Cluster NVMeoF
 Ceph Luminous
 2-Way Replication SPDK NVMeoF
 Ceph ObjectStore as Ceph osd.1 Target 1
Ceph mon.a With Ceph BlueStore
SPDK NVMe-oF Initiator
 SPDK RDMA transport
 SPDK NVMe-oF target
 SPDK bdev maps requests
SPDK NVMeoF
to remote Ceph BlueStore > rados put
Ceph osd.0 Target 2
 Linux Soft RoCE (rxe) With Ceph BlueStore

Ceph public network Ceph cluster network NVMe over Fabric

Metric: Ceph cluster network rx/tx bytes

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 13


Preliminary Results*
Ceph network overhead reduction

 Test: rados put


 10 iterations
 Measure Ceph network
rx/tx bytes
 Derive reduction in
bandwidth consumption

* Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Intel is the trademark of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. See Trademarks on
intel.com for full list of Intel trademarks or the Trademarks & Brands Names Database

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 14


Summary & Next Steps
Summary Future work
 Eliminate datacenter ‘tax’  Validate new architecture with Ceph
 Decouple control/data flows community
 Remote block management  Integrate storage target information
 Preserve Ceph SDS value with crush-map
proposition  Evaluate performance at scale
 Reduce TCO for Ceph on  Mechanism to survive OSD node
disaggregated storage failures
 Bring NVMe-oF value to  Explore additional offloads for
Ceph users replication

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 15


Q&A

2018 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 16

Você também pode gostar