FlashMemorySummit2012 v2

Leveraging Flash Translation Layer for Application Acceleration
Ashish Batwara Fusion-io
Flash Memory Summit 2012 Santa Clara, CA
Traditional Storage Stack

User space
Application
Kernel space
Filesystem
Block Device Driver LBA
Hardware
LBA view enforced by Storage Protocols (SCSI/SATA etc.)
Device
Flash is Different From Disk

Area
Logical to Physical Blocks Read/Write Performance Sequential vs Random Performance Background operations
Hard Disk Drives

Nearly 1:1 Mapping Largely symmetrical An order of magnitude difference Rarely impact foreground
Flash Devices
Remapped at every write Heavily asymmetrical. Minimal difference Regular occurrence. If unmanaged can impact foreground Limited writes 100Ks to Millions 10s - 100s us Improves performance
Wear out IOPS Latency TRIM

Largely unlimited writes 100s to 1000s 10s ms Do not benefit
Flash Translation Layer 101

Input Logical Block Address (LBA) Flash Translation Layer Output Commands to NAND flash
Flash in Traditional Storage Stack

User space
Application
Kernel space
Filesystem
Block Device Driver LBA
Hardware
Flash Translation Layer LBA PBA Device
Virtual Storage Layer

Fusion-ios host based FTL
Virtual Storage Layer (VSL)
Host
DRAM / Memory / Operating System and Application Memory ioMemory Virtualization Tables
Cut-thru architecture avoids traditional storage protocols Scales with multi-core Provide a large virtual address space HW/SW functional boundary defined as optimal for flash Traditional block access methods for compatibility New access methods, functionality and primitives natively supported by flash
CPU and cores
DATA TRANSFERS
Virtual Storage Layer tm (VSL )

Commands
PCIe
ioDrive
ioMemory
Data-Path Controller
Banks
Channels Wide
Fast Forward
Host-based FTLs integrate and scale with applications, examples include

File Systems Caching Databases
Power of FTL no longer restricted by traditional block interfaces Opportunity for performance, simplicity and reliability improvements
Flash Memory Summit 2012 Santa Clara, CA 7
Flash Memory Evolution

|
Traditional SSDs Flash as a drive Flash as a cache
Native Access
Flash with direct I/O semantics Flash with memory semantics
Application
Application
Application
Application
Application
Open Source Extensions
Application
Application
OS Block I/O
OS Block I/O
OS Block I/O Direct-access I/O API Family Memory access API Family
File System
Host
File System
Host
File System Block Layer directCache
Block Layer SAS/SATA Network RAID Controller
Block Layer
directFS native file system service
directFS
Remote
VSL VSL
Read/Write Read/Write
VSL
VSL
VSL
Flash Layer
Read/Write
Read/Write
Read/Write
Load/Store
Direct-access I/O API family

|
Legacy SSDs Flash as a drive Flash as a cache
Native Access
Flash with direct I/O semantics Flash with memory semantics
direct I/O Primitives

Application
Application Sparse
Application
Application Addressing
Application
Application
Application
Atomic multi-block Operations

Write
OS Block I/O PTRIM OS Block I/O OS Block I/O
Exists, Range Exists Conditional Write File System Read Range

Host
Direct-access I/O API Family
Memory Semantics API Family
File System
Host
File System Block Layer
Block Layer Network
SAS/SATA direct Key-Value Store
Block Layer
directFS native file system service
directFS
Remote
NVM optimized with transactional directCache RAID Controller VSL semantics

Flash Layer
Read/Write Read/Write
VSL
VSL
VSL
VSL
Read/Write Read/Write Read/Write Load/Store
Sparse Addressing
10
Excess work by conventional cache

Conventional Block Cache Application
Cache: HDD block -> Flash LBA
Cache Hit
Metadata, Persistence, Recovery logic etc.
Cache Miss
Flash FTL: LBA -> PBA Backend Store Block Device
Two Translations Additional metadata, logic etc.

Sparse address mapping
Mapped to sparse address space
Mapped to sparse address space
Backend store
12
VSL based cache

VSL Based Cache Application
VSL Based Cache: Minimal Fixed Metadata

Cache Hit Leverage on primitives
VSL Sparse HDD LBA -> PBA
Cache Miss
Backend Store Block Device
Fewer Translations Minimal additional metadata, logic etc.

Direct I/O - Atomic Operations

Traditional Atomicity (with Hard Disks) Traditional Atomicity (with SSD) Atomicity in ioMemory
Atomicity DBMS Trans Log Applications DBMS
Atomicity Trans Log Applications
DBMS File System
Applications
Atomicity FileSystem Metadata Journaling, Copy-on-Write FileSystem
Atomicity Metadata Journaling, Copy-on-Write
Block IO Layer Generalized ioMemory Layer Re-mapping AtomicOperation Page Read/Write Wear-Leveling
Block IO Layer
Block IO Layer Sector Read/Write
Block Erase
Sector Read/Write Flash Translation Layer Re-mapping Wear-Leveling Block Erase Page Write Page Read NAND Flash Memory Solid State NAND Flash Memory Disk ioMemory Controller
Disk Drive
September 9, 2012
14
Transactional Block Interface

Application issues call to transactional block interface
Write all blocks atomically
iov[0] Range[0] Range[1]
iov[3] Range[n] iov[4]

15

Write all blocks atomically Trim all blocks atomically
X X
Range[0]
X X X X X X X X X
iov[0] Range[0] Range[1] iov[3] Range[n] iov[4]
Range[n]

16

Write all blocks atomically Trim all blocks atomically Write and Trim atomically
X X
Range[0]
X X
Range[0]
X X X X X X X X X
iov[0] Range[0] Range[1] iov[3] Range[n] iov[4]
Range[n]
X X X
Range[1]
Range[m] Range[n]
Transaction Envelopes Virtual Storage Layer

17
Sysbench Performance With Atomic-Write

MySQL extension for Atomic-Write
43%
TRANSACTIONS/SEC INCREASE
2x
ENDURANCE INCREASE
Processor: Xeon X5472 @ 3.00GHz DRAM: 16GB DDR3 4x4GB DIMMs OS: Fedora 14 Linux kernel 2.6.35 Sysbench config: 1 million inserts in 8, 2-million-entry tables, using 16 threads
18
Native raw performance comparison
Significantly more functionality with NO additional performance impact
1U HP blade server with 16 GB RAM, 8 CPU cores - Intel(R) Xeon(R) CPU X5472 @ 3.00GHz with single 1.2 TB ioDrive2 mono Flash Memory Summit 2012 Santa Clara, CA
19
Direct I/O Primitives Persistent Trim and Exists

Persistent TRIM (Virtual Address)
Has all the positive properties of TRIM
Improves wear leveling Improves write performance
However well defined with respect to failures

Deterministic return of zeros for read Survives power failures (transactional)
EXISTS (Virtual Address)

Query the existence of a particular element Enables sparse stores with well defined presence semantics
Conventional Key-Value Store

Conventional KV store Application
KV Store Key -> block mapping (overhead per key) block allocation
Block Read/Write
Metadata, persistence mechanisms, logging, recovery logic etc.
VSL Dynamic provisioning, Block allocation, persistence mechanisms, Logging, recovery etc.
21
VSL based Direct Key-Value Store

DirectKey-Value store Application
Key Value API and Library

Fixed zero metadata, leverages VSL Atomic write/delete, coordinated garbage collection
VSL Dynamic provisioning, Block allocation, persistence mechanisms, Logging, recovery etc.
22
directKey-Value Store API Overview

KV Store Administration
kv_create() kv_pool_create() kv_pool_delete() kv_open() kv_get_store_info() kv_get_key_info() kv_register_notification_handler() kv_close() kv_destroy() kv_get_pool_info()
KV Store Data Operations

kv_put() kv_batch_put() kv_get() kv_batch_get() kv_delete() kv_delete_all() kv_batch_delete() kv_begin() kv_next() kv_get_current() kv_exists()
September 9, 2012 Fusion-io Confidential
23
Directkey-Value Store Sample Performance

Sample Performance - GET Sample Performance - GET
140000 120000 100000 GETs/s
P T U s/s 120000 100000 80000
Sample Performance - PUT Sample Performance - PUT
80000 60000 40000 20000 0 0 20 40 60 Threads 80 100 120 140
512B 4KB 16KB 64KB
512B 60000 40000 20000 0 0 20 40 60 80 100 120 140 Threads 4KB 16KB 64KB
Significantly more functionality with NO additional performance impact

Performance relative to to ioDrive Performane rela ve ioDrive
120000
Performance relative to ioDrive Performancerela ve to ioDrive

100000 80000 OPS/s 60000 40000 20000
140000 120000 100000 OPS/s 80000 60000 40000 20000 0 0 20 40 Threads 60 80

0 0 10 20 30 40 50 60 70 Threads
512B Key GET 1KB FIO READ
512B Key PUT 1K-FIO WRITE
1U HP blade server with 16 GB RAM, 8 CPU cores - Intel(R) Xeon(R) CPU X5472 @ 3.00GHz with single 1.2 TB ioDrive2 mono Flash Memory Summit 2012 Santa Clara, CA
24
Advantages of Native Flash Access

1. Helps accelerating applications 2. Eliminates redundant functionality 3. Leverages FTL mapping and sparse addressing 4. Optimizes garbage collection 5. Delivers transactional properties 6. Provides direct I/O as well as memory semantics.
25
Open Source Enablement and Standardization

MySQL InnoDB extension (GPLv2) Standardization of primitives in T10
Current standards proposal Atomic Writes SBC-4 SPC-5 Atomic-Write http://www.t10.org/cgi-bin/ac.pl?t=d&f=11-229r5.pdf SBC-4 SPC-5 Scattered writes, optionally atomic http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-086r3.pdf SBC-4 SPC-5 Gathered reads, optionally atomic http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-087r3.pdf
26
Thank you!
Ashish Batwara Fusion-io abatwara@fusionio.com
27
Extended Memory Non-Persistence Path

Fusion-io has developed a subsystem technology called Extended memory, which enables developers to take advantage of NAND Flash memory as an extension of DRAM. The idea is to move frequently accessed data pages to DRAM while rarely accessed data pages are transferred from DRAM to Flash. Thus, the overall available capacity for DRAM can indirectly be increased. Fusion-io said that the technology, which was created in collaboration with Princeton University researchers, allows software developers to simply assume that their entire data set is kept in-memory all the time as NAND is a much more cost-effective memory solution and can reach much greater capacities than DRAM. The Fusion ioMemory architecture is uniquely suited to innovation like the Extended Memory subsystem, said Chris Mason, Fusion-io director of kernel engineering and principal author of the Btrfs file system for Linux, in a prepared statement. Since Fusion ioMemory has moved beyond legacy disk-era protocols, we can integrate new features like the Extended Memory subsystem to truly advance application performance for enterprise computing in ways that are simply not possible with traditional SSDs. Developers can access the Extended Memory feature via Fusion-io's developer community.
http://www.tomshardware.com/news/dram-memory-flash-nand-fusion-io,16254.html?utm_source=dlvr.it&utm_medium=twitter#xtor=RSS-181
28
Auto-Commit Memory: Cutting Latency by Eliminating Block I/O

Auto-Commit Memory: Cutting Latency by Eliminating Block I/O Our recent demonstration of one billion IOPS showcased a new paradigm for storing data through Fusion-io Auto-Commit Memory (ACM). ACM isnt just about making NAND Flash storage devices go faster, although it does that too. Its about introducing a much simpler and faster way for an application to guarantee data persistence. When Simplicity Meets Speed For decades, the industry norm for persisting data has been the same an application manipulates data in memory, and when ready to persist the data, packages the data (sometimes called a transaction) for storage. At this point, the application asks the operating system to route the transaction through the kernel block I/O layer. The kernel block layer was built to work with traditional disks. In order to minimize the effect of slow rotational-disk seek times, application I/O is packaged into blocks with sizes matching hard disk sector sizes and sequenced for delivery to the disk drive. As Linux block maintainer (and Fusion-io chief architect) Jens Axboe points out, most real-world I/O patterns are dominated by small, random I/O requests, but are force-fit into 4k block I/Os sequenced by the block layer to match the characteristics of rotating disks. Note the number of steps in this pathway each one contributes to latency. Even more steps are introduced in this pathway when the block storage device is at the other end of a network, behind various bus adaptors, and controllers. As long as memory is volatile, this type of I/O pathway will be the norm. But, what if an application could designate a certain area within its memory space as persistent, and know that data in this area would maintain its state across system reboots? This application would no longer have the burden of following the multi-step block I/O pathway to persist that data. It would no longer need smaller transactions to be packaged into 4k blocks for storage. It would just place selected data meant for persistence in this designated memory area, and then continue using it through normal memory access semantics. If the application or system experienced a failure, the restarted application would find its data persisted in non-volatile memory exactly where it was left. To illustrate, how much faster could real-world databases go if the in-memory tail of their transaction logs had guaranteed persistence without waiting on synchronous block I/O? How much faster could real-world key-value stores (typical in NoSQL databases) go if their indices could be updated in non-volatile memory and not block while waiting on kernel I/O? That is the simplicity of Auto Commit Memory. Itreduces latency by eliminating entire steps in the persistence pathway. Addressing Both Halves of the Problem Block storage benchmarks such as throughput and IOPS are certainly important, but only address half of the problem. The other half of the problem is the work the application and kernel I/O subsystems must do to package and route data for storage. Applications can be accelerated by addressing either or both halves of this problem. However, note that, at some point, the overhead incurred by packaging and routing data through the kernel block storage subsystem will become the bottleneck. Breaking through that barrier was the goal of this technology demonstration. Give applications the software mechanisms to avoid this block storage packaging and routing latency and complexity. Let them spend more time processing data in memory, and less time packaging and waiting for that data to arrive at a block storage destination. Fusion-io does indeed make very fast block I/O devices. What's most exciting about last weeks demonstration for us is looking beyond fast block I/O devices to show what is possible when you address this this fast device as memory rather than block storage.
http://www.fusionio.com/blog/auto-commit-memory-cutting-latency-by-eliminating-block-i/o/

FlashMemorySummit2012 v2

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

FlashMemorySummit2012 v2

Enviado por

Direitos autorais:

Formatos disponíveis

Leveraging Flash Translation Layer for Application Acceleration

Ashish Batwara Fusion-io

Flash Memory Summit 2012 Santa Clara, CA

Traditional Storage Stack

Block Device Driver LBA

LBA view enforced by Storage Protocols (SCSI/SATA etc.)

Flash Memory Summit 2012 Santa Clara, CA

Flash is Different From Disk

Hard Disk Drives

Wear out IOPS Latency TRIM

Largely unlimited writes 100s to 1000s 10s ms Do not benefit

Flash Translation Layer 101

Flash Memory Summit 2012 Santa Clara, CA

Flash in Traditional Storage Stack

Block Device Driver LBA

Flash Translation Layer LBA PBA Device

Flash Memory Summit 2012 Santa Clara, CA

Virtual Storage Layer

CPU and cores

Virtual Storage Layer tm (VSL )

Host-based FTLs integrate and scale with applications, examples include

Flash Memory Evolution

File System Block Layer directCache

Block Layer SAS/SATA Network RAID Controller

directFS native file system service

Flash Memory Summit 2012 Santa Clara, CA

Direct-access I/O API family

direct I/O Primitives

Atomic multi-block Operations

Exists, Range Exists Conditional Write File System Read Range

Direct-access I/O API Family

Memory Semantics API Family

File System Block Layer

Block Layer Network

SAS/SATA direct Key-Value Store

directFS native file system service

NVM optimized with transactional directCache RAID Controller VSL semantics

Flash Memory Summit 2012 Santa Clara, CA

Flash Memory Summit 2012 Santa Clara, CA

Excess work by conventional cache

Cache: HDD block -> Flash LBA

Metadata, Persistence, Recovery logic etc.

Flash FTL: LBA -> PBA Backend Store Block Device

Two Translations Additional metadata, logic etc.

Sparse address mapping

Mapped to sparse address space

Mapped to sparse address space

Flash Memory Summit 2012 Santa Clara, CA

VSL based cache

VSL Based Cache: Minimal Fixed Metadata

VSL Sparse HDD LBA -> PBA

Backend Store Block Device

Fewer Translations Minimal additional metadata, logic etc.

Direct I/O - Atomic Operations

Atomicity DBMS Trans Log Applications DBMS

Atomicity Trans Log Applications

DBMS File System

Atomicity FileSystem Metadata Journaling, Copy-on-Write FileSystem

Atomicity Metadata Journaling, Copy-on-Write

Block IO Layer Sector Read/Write

Transactional Block Interface

iov[0] Range[0] Range[1]