Você está na página 1de 61

Deduplication School 2010

Presentation Download

Since 2009, there have been a number of changes and advancements in the storage environment. Data deduplication was the # 1 storage technology being evaluated by storage professionals last year. This presentation download will explain how to leverage data deduplication technology to benefit your organization. Download to learn the answers to the following questions: How do recent major acquisitions affect the options in the dedupe marketplace? Is everyone doing dedupe now? Are all dedupe products roughly equivalent, or are their advantages of certain approaches? These questions and more will be answered by storage expert, W. Curtis Preston, in this Dedupe School seminar presentation download.

Sponsored by:

Deduplication School 2010


http://searchstorage.techtarget.com/DedupeSchool

W. Curtis Preston Executive Editor, TechTarget Founder/CEO Truth in IT, Inc.

Follow on Twitter @wcpreston

A Little About Me
When I started as backup guy at $35B company in 1993: Tape Drive: QIC 80 (80 MB capacity) Tape Drive: Exabyte 8200 (2.5 GB & 256KB/s) Biggest Server: 4 GB (93), 100 GB (96) Entire Data Center: 200 GB (93), 400 GB (96) My TIVO now has 5 times the storage my data center did! Consulting in backup & recovery since 96 Author of OReillys Backup & Recovery & Using SANs and NAS Webmaster of BackupCentral.com Founder/CEO of Truth in IT Follow me on Twitter @wcpreston

A Little Bit About Truth in IT, Inc.


Inspired by Consumer Reports, but designed for IT No advertising, no partners = no need to SPIN No huge consulting fees just to find out which products work and which ones dont work (such fees typically start at $10K and go all the way to $100K!) Funded instead by $999 annual subscription Private online community with written research, testing results, podcasts of interviews with users of products, and direct communication with real customers of the products youre interested in all included In beta now at http://www.truthinit.com

Agenda
Understanding Deduplication Using Deduplication in Backup Systems Using Data Reduction in Primary Systems Recent Backup Software Advancements Backing up Virtual Servers Backups on a Budget Stump Curtis

Session 1
Understanding Deduplication

Why Disk?
First a little history

History of My World Part I


When I joined the industry (1993) Disks were 4 MB/s, tapes were 256 KB/s Networks were 10 Mb shared Seventeen years later (2010) Disks are 70 MB/s, tapes are 120 MB/s Networks are 10 Gb switched Changes in 17 years 17x increase in disk speed (luckily, RAID
has created virtual disks that are way faster)
Exabyte 8200 (256 KB/s)

QIC 80 (60 KB/s)

500x increase in tape speed! 1000x+ increase in network speed


DECStation 5000

More History
Plan A: Stage to disk, spool to tape Pioneered by IBM in 90s, widely adopted in late 00s Large, very fast virtual disk as caching mechanism to tape Only need enough disk to hold one nights backups Helps backups; does not help restores Plan B: Backup to disk, leave on disk AKA the early VTL craze Helps backups and restores Disk was still way too expensive to make this feasible for
most people

Plan C: Dedupe
Its perfect for traditional backup Fulls backup the same data every day/week/month Incrementals backup entire file when only one byte changes Both backup file 100 times if its in 100 locations Databases are often backed up full every day Tons of duplicate blocks! Average actual reduction of 10:1 and higher Its not perfect for everything Pre-compressed or encrypted data File types that dont have versions (multimedia)

Naysayers
Eliminate all but one copy? No, just eliminate duplicates per location What about hash collisions? More on this later, but this is nothing but FUD If youre unconvinced, use a delta differential approach Doesnt this have immutability concerns? Everything that changes the format of the data has
immutability concerns (e.g. sector-based storage, tar, etc)

Job of backup/archive applications is to verify same in/out

What about the dedupe tax? Lets talk more about this one in a bit

Is There a Plan D?
Some pundits/analysts think dedupe (especially target dedupe) is a band-aid, and will eventually be done away with via backupsoftware-based dedupe, delta-backups, etc. Maybe this will happen in a 3-5 year time span, maybe it wont. (In fact, some backup software companies will tell you they dont need no stinking dedupe appliances.) Thats still no argument for not moving on whats available to solve your problems now

How Dedupe Works

Your Mileage WILL Vary


You really can get 10x to 400x It depends on Frequency of full backups (more fulls = more dupes) How much of a given incremental backup contains versions
of other files (multimedia generally doesnt have versions)

Length of retention (longer retention = more dupes) Redundancy in single full backup (if your product notices)

Things that confuse dedupe Encrypting data before the dedupe process sees it Compressing data before the dedupe process sees it Multiplexing to a VTL

How Do They Identify Duplicate Data?


Two very different methods Chunking/hashing
Asigra, EMC Avamar, Symantec PureDisk, CommVault Simpana EMC Data Domain, Greenbytes, FalconStor VTL & FDS, NEC, Quantum DXi

Delta differential Exagrid, IBM Protectier, Ocarina, SEPATON Some systems may use a hybrid approach

Chunking/Hashing Method
Slice all data into segments or chunks Run chunk through hashing algorithm (SHA-1) Check hash value against all other hash values Chunk with identical hash value is discarded Will find redundant blocks between files from different file systems, even different servers

Delta Differential Method


Correlate backups Mathematical methods Using metadata Compare similar backups byte-by-byte Example Tonights backup of Exchange instance Elvis is seen as
similar to last nights backup of Elvis

Tonights backup of Elvis is compared byte-by-byte to last nights backup of Elvis & redundant segments are found

Hashing & Delta Differential


Hashing

Most used method with most mileage Some concerned about hash collisions (more on this later) Compares everything to everything, therefore gets more dedupe out of similar data in dissimilar datasets (e.g. production and test copy of same data)

Delta Differentials

Faster than hashing No concern about hash collisions Only compares like backups, so will get no dedupe on similar data in dissimilar datasets, but does get more dedupe on same data

What will you get? Only testing with your data will answer that question.

Hash Collisions: The real numbers


Hash Size Number of Hashes & Amount of Data to achieve Desired Probability (Assuming 8k chunk size) 10-15 128 bits (MD5) 8.2 1011 6.6 PB 432.5 EB 8.2 1016 5.4 1021 160 bits (SHA-1) 5.4 1016 10-5 20.9 YB 1,371,181 YB

10-15: Odds of single disk writing incorrect data and not knowing it (Undetectable Bit Error Rate or UBER) With SHA-1, we have to write 6.6 PB to get those odds 10-5: Worst odds of a double-disk RAID5 failure We have to write 1,371,181 YB to reach those odds Original formula here: http://en.wikipedia.org/wiki/Birthday_attack Original formula modified with MacLaurin series expansion to mitigate Excels
lack of precision and is here: backupcentral.com/hash-odds.xls

Where Is the Data Deduped?


Target Dedupe

Data is sent unmodified across LAN & deduped at target No LAN/WAN benefits until you replicate target to target Cannot compress or encrypt before sending to target

Source Dedupe
Redundant data is identified at backup client Only new, unique data sent across LAN/WAN LAN/WAN benefits, can back up remote/mobile data Allows for compression, encryption at source

Hybrid
Fingerprint data at source, dedupe at target Allows for compression, encryption at source

Lets Make It More Complicated


Standalone Target Dedupe Dedupe appliance separate from backup software Integrated Target Dedupe Target dedupe from b/u s/w vendor that backs up to POD* Standalone Source Dedupe Full dedupe solution that only does source dedupe Integrated Source Dedupe Backup software that can dedupe at client (or not) Hybrid Also from backup software company
*Plain Ol Disk

Name That Dedupe


Standalone Target Dedupe Data Domain, Exagrid, Greenbytes, IBM, NEC, Quantum,
SEPATON

Integrated Target Dedupe Symantec NetBackup Integrated Source Dedupe Asigra, Symantec NetBackup Standalone Source Dedupe EMC Avamar, i365 eVault, Symantec NetBackup Hybrid CommVault Simpana

Multi-node Deduplication
AKA Global Deduplication AKA Clustered Deduplication

What Were Not Talking About


Remember hashing vs. delta differential dedupe Delta compares like to like Hashing compares everything to everything Some sales reps from some companies (that dont have multi-node/global dedupe) are calling the latter global dedupe. Its not. At a minimum this is honest confusion Possibly this is subterfuge to confuse the buyer

Single-node/Local vs. Multi-node/Global


Assume a customer buys multiple nodes of a dedupe system Suppose, then, that they back up exactly the same client to each of those multiple nodes If the vendor fails to recognize the duplicate data and stores it multiple times, it has singlenode/local dedupe If the vendor recognizes duplicate data across multiple nodes and stores it on only one node, they have multi-node/global dedupe

Doctor It Hurts When I Do This


Single-node/local dedupe vendors say then dont do that. Why would you do that? They tell you to split up your datasets and send a given dataset to only one appliance Easy to do if

Your dataset sizes never change A given dataset never outgrows a node

Some single-node sales reps will point out that this also doesnt harm your dedupe ratio because most dedupe is from comparing like to like. Theyre also the same ones claiming they get better dedupe because they compare all to all. Which is it?

Multi-node Is the Way to Go


Especially for larger environments & budget conscious environments that buy as they go With multi-node dedupe you can load-balance & treat same as you would a large tape library Single-node dedupe pushes the vendors to ride the crest of the CPU/RAM wave Multi-node vendors can ride behind the wave, saving cost without reducing value

Multi/Single Node Dedupe Vendors


Multi-node/global EMC Avamar (12 nodes) Exagrid (10 nodes) NEC (55 nodes) SEPATON (8 nodes) Symantec PureDisk, NetBackup & Backup Exec Diligent (2 nodes) Single-node/local (as of Mar 2010) EMC Data Domain NetApp ASIS Quantum

When Is It Deduped?
AKA Inline or Post Process?

Get Out the Swords


Wed have just as much luck trying to settle these arguments Apple vs Windows Linux vs either of them Linux vs FreeBSD Vmware vs the mainframe (the original hypervisor) Cable modem vs DSL Initial common sense leans to inline, but postprocess offers a lot of advantages Cannot pick based on concept; must pick based on price/performance

Whats the Difference?


This only applies to target dedupe Inline is synchronous dedupe Post-process is asynchronous dedupe Both are deduping as the data is coming into the device (with most products and configs) The question is really where the dedupe process reads the native data from. If it reads it from RAM, were talking inline. If it reads it from disk, were talking post process.

Inline & Post-process: An I/O Walkthrough


Step Ingest (100%) New segment Old segment Match (90%) No match(10%) Disk write Disk write IL Hash RAM write RAM read RAM read IL Delta RAM write RAM read Disk read PP Hash Disk write Disk read RAM read PP Delta Disk write Disk read Disk read

Disk delete Disk delete

For every 100 GB an inline hash system writes 10 GB to disk For every 100 GB an inline delta system writes 10 GB, reads 100 GB from disk For every 100 GB a post process hash system writes 100 GB, reads 100 GB, and deletes 90 GB from disk For every 100 GB a post process delta system writes 100 GB, reads 200 GB, and deletes 90 GB from disk Common sense seems like inline has a major advantage Things change when you consider the dedupe tax

The Chair Recognizes Inline


When youre done with backups, youre done with dedupe Backups begin replicating as soon as they arrive The post-process vendors need a staging area The post-process vendors dont start deduping until a backup is done; that will make things take longer

The Chair Recognizes Post-process


When backups are done, dedupe is almost done Replication begins as soon as the first backup is done We wait until a backup is done, not until all the backups are done (unless you tell us to) The staging area allows

Initial backups to be faster Allows copies and recent restores to come from native data Allows for staggered implementation of dedupe Selectively dedupe only what makes sense

You dont need as much staging disk as you might think Inline vendors may slow down large backups and restore. They always rehydrate. We only rehydrate older data.

Inline & Post-process Vendors


Inline EMC Data Domain IBM Protectier NEC HydraStor Post-process Exagrid Greenbytes Quantum DXi SEPATON Deltastor

How Does Replication Work?


Does replication use dedupe? Can I replicate many-to-one, one-to-many, cascading replication? If deduping many to one, will it dedupe globally across those appliances? Can I control what gets replicated and when? (e.g. production vs development)

Is There an Index?
What happens if the index is destroyed? How do you protect against that? Does it need its index to read the data? What do you to verify data integrity? What about malicious people? Some dedupe vendors arent very good at answering these questions, partially because they dont get them enough Make sure you ask them

Truth in IT Backup Concierge Service


Community of verified but anonymous end-users (no vendors) Included in base service: Billable product & strategy-related questions Learn from other customers questions & answers Much less expensive than traditional consulting Talk to real people using the products you are interested in Podcast interviews with end-users and thought leaders Unbiased product briefings written by experts Coming soon: Reports of lab tests by experts Field test reports designed by us, conducted by end-users One year subscription: $999

Session Two
Using Deduplication in Backup Systems Using Data Reduction in Primary Systems

The Dedupe Tax AKA Rehydration Problem


Essentially a read from very fragmented data Not all dedupe systems are equally adept at reassembling Humpty Dumpty Especially visible during tape copies & restores of large systems (single stream performance) Recent POC of three major vendors showed 3x difference in performance! Remember to test replica source & destination

Isnt It Cheaper Just to


Buy tape? Tape is cheaper than ever & keeps getting cheaper Must encrypt if youre using tape Must use D2D2T to stream modern tape drives Must constantly tweak to ensure youre doing it right Take all that away and use dedupe May not be cheaper but definitely better Buy JBOD/RAID Even if it were free, you still have to power it Power/cooling bill will be 10-20x more with JBOD/RAID Replication not feasible, stuck with tape for offsite (see
above)

Lets Talk About What Matters


What are the risks of their approach? Data integrity questions How big is it? Whats my dedupe ratio? How big can it grow (local vs global) How fast is it How fast can it backup/restore/copy my data? How fast is replication? How much does it cost? Pricing schemes are all over the board Try to get them on even playing field Also consider operational costs

Adding storage Replacing drives (how long does rebuild take?) Monitoring, etc

Advanced Uses of Deduplication

Eliminate Tape Shipping


Offsite backups w/o shipping tapes Backups with no human hands on them Make tapes offsite from replicated copy and never move them No tapes shipped = No need to encrypt tapes

Shorter Recovery Point Objectives


Most companies run backups once per day Even though they back up their transaction logs, throughout the day, theyre only sent offsite once per day Dedupe and replication could get them offsite immediately throughout the day

VMware Backup
One of the challenges with typical VMware backup is the I/O load it places on the server Source dedupe can perform an incrementalforever backup with a much lower I/O load Could allow you to continue simpler backups without having to invest in VCB

ROBO & Laptop Backups


Dedupe software can protect even the largest laptops over the Internet It can also protect relatively large remote sites without installing hardware Restores can be done locally (for slower RTOs) or locally using a local recovery server (for quicker RTOs)

Where to Use Target/Source Dedupe


Laptops, Vmware, Hyper-V are easy: its got to be source Small, remote sets of data also an easy decision. Could do target w/remote backup server, but cost usually pushes people to source. A medium-sized (<1 TB) remote site could use a remote target system or remote source dedupe backup server that replicates to CO Medium-large datacenter could also use either Large datacenter (10TB+) might start to find things they dont like about a source system Should do POC to decide

Source Dedupe: Remote Backup Server?


If using source dedupe to backup a remote office, should you back up directly to a centralized backup server or backup to a remote backup server that replicates to a central server? Its all about the RTO you need. Decide on RTO, test totally remote restore and see if it can meet it. If not, use a remote server

How Big is Too Big to Replicate Backups?


Remote office replicating to a CO, or a CO replicating its backups to a DR site, there is a limit to how much you can replicate Make sure youve done all you can to maximize deduplication ratio. A 10:1 site will need twice as much bandwidth as a 20:1 site. Depends on daily deduplicated change rate, which is a factor of data types and dedupe ratio Now common to protect 1 TB over typical WAN lines, much more over dedicated lines

Test, Test, Test!!!

Test Everything
Installation and configuration, including adding additional capacity Support call and ask stupid questions Dedupe ratio Must use your data Must use your retention settings Must fill up the system All speeds Backup speed Copy speed extremely important to test Restore speed Aggregate performance With all your data types Especially true if using local dedupe Single stream performance Backup speed Restore and copy speed (especially if going to tape) Replication Performance Lag time (if using post process) Dedupe speed (if using post process) Loss of physical systems Drive rebuild times Reverse replication to replace array? Unplug things, see how it handles it Be mean!

Testing Methods: Source Dedupe


Must install on all data types you plan to back up Must task the system to the point that you plan to use it VMware anyone? OK to back up many redundant systems; thats kind of the point Remember to test speed of copy to tape if you plan to do so

Testing Methods: Target Dedupe


Copy production backups into IDT/VTL using your backup softwares built-in cloning/migration/dupe features Use dedicated drives if possible and script it to run 24x7 You must fill up the system, expire some data, then add more data to see steady state info Copy/backup to one system, replicate to another, record entire time, then restore/copy data from replicated copy

Data Reduction in Primary Storage

A Whole New Ball Game


In primary space, we use the term data reduction, as its more inclusive than dedupe A very different access pattern; latency is much more important The standard in backup world is tape: just dont be slower than that and youre OK The standard in primary world is disk: anything you do to slow it down will kill the project Will not get same ratios as backup Summary: the job is harder and the rewards are fewer And yet, some are still trying it

Options
Compression File-level dedupe Sub-file-level dedupe Some files compress, but dont dedupe Some files dedupe but dont compress well

Vendors
Compression Storwize, Ocarina File-level dedupe EMC Celerra Sub-file-level dedupe NetApp ASIS, Ocarina, Greenbytes, Exar/Hifn, SNOracle Usually you get compression or dedupe Ocarina & Exar claim to do both compression and sub-file-level dedupe

Pros/Cons of Primary Data Reduction


Saves disk space, power/cooling Can have positive or negative impact on performance must test to see which Does not usually help backups: data is re-duped before being read by any app, including backup Exception to above rule is NetApp SnapMirror to tape

Contact Me
Email curtis@backupcentral.com Websites to which I contribute: http://www.backupcentral.com http://www.searchstorage.com http://www.searchdatabackup.com Follow me on Twitter @wcpreston My upcoming venture: http://www.truthinit.com

R E S O U RC E S F RO M O U R S P O N S O R

The ROI of Backup Redesign Using Deduplication: An EMC Data Domain User

Research Study H19

IDC Executive Guide: Assess the Value of Deduplication for your Storage

Consolidation Initiatives

Why CIOs Should Look To Data Deduplication

Você também pode gostar