Você está na página 1de 104

c/oud computinq c/oud computinq

== Thu+(v> == Thu+(v>
l+l ,>fm l+l ,>fm
4qendo
W Cvervlew
Padoop Coogle
W aaS 1echnlques
llle SysLem
W ClS PulS
rogrammlng Model
W Map8educe regel
SLorage SysLem for SLrucLured uaLa
W 8lgLable Pbase
nodoop
W Padoop ls
A dlsLrlbuLed compuLlng
plaLform
A sofLware framework LhaL
leLs one easlly wrlLe and run
appllcaLlons LhaL process
vasL amounLs of daLa
lnsplred from publlshed
papers by Coogle
Padoop ulsLrlbuLed
llle SysLem (PulS)
Padoop ulsLrlbuLed
llle SysLem (PulS)
Map8educe Map8educe
Pbase Pbase
A ClusLer of Machlnes A ClusLer of Machlnes
Cloud AppllcaLlons Cloud AppllcaLlons
6ooq/e
W uoogle publisheu the uesigns of web-seaich
engine
S0SP
W 1he oogle lle ysLem
0SBI
W Mapkeduce Slmpllfled uaLa rocesslng on Large ClusLer
0SBI
W ab|e A ulsLrlbuLed SLorage SysLem for SLrucLured uaLa
6ooq/e vs nodoop
eve|op oup Coogle Apache
ponso Coogle ?ahoo Amazon
kesouce open documenL open source
|e sem ClS PulS
oammn Mode| Map8educe
Padoop
Map8educe
oae sem
(fo sucue daa)
8lgLable Pbase
each Lnne Coogle nuLch
C Llnux Llnux / CL
4qendo
W Cvervlew
Padoop Coogle
W aaS 1echnlques
llle SysLem
W ClS PulS
rogrammlng Model
W Map8educe regel
SLorage SysLem for SLrucLured uaLa
W 8lgLable Pbase
llL 5Y51M
ile System 0veiview
Bistiibuteu ile Systems (BS)
uoogle ile System (uS)
Bauoop Bistiibuteu ile Systems (BBS)
li/e 5ystem Overview
W SysLem LhaL permanenLly sLores daLa
W 1o sLore daLa ln unlLs called flles" on dlsks and oLher
medla
W llles are managed by Lhe CperaLlng SysLem
W 1he parL of Lhe CperaLlng SysLem LhaL deal wlLh flles
ls known as Lhe llle SysLem"
A flle ls a collecLlon of dlsk blocks
llle SysLem maps flle names and offseLs Lo dlsk blocks
W 1he seL of valld paLhs form Lhe namespace" of Lhe
flle sysLem
Jot 6ets 5tored
W user daLa lLself ls Lhe bulk of Lhe flle sysLems
conLenLs
W Also lncludes meLadaLa on a volumewlde and per
flle basls
W Avallable space
W lormaLLlng lnfo
W CharacLer seL
W
W Avallable space
W lormaLLlng lnfo
W CharacLer seL
W
volumewlde
W name
W Cwner
W ModlflcaLlon daLa
W
W name
W Cwner
W ModlflcaLlon daLa
W
erflle
esiqn considerotions
W namespace
hyslcal mapplng
Loglcal volume
W ConslsLency
WhaL Lo do when more Lhan one user reads/wrlLes on Lhe
same flle?
W SecurlLy
Who can do whaL Lo a flle?
AuLhenLlcaLlon/Access ConLrol LlsL (ACL)
W 8ellablllLy
Can flles noL be damaged aL power ouLage or oLher
hardware fallures?
Loco/ l5 on unix/ike 5ystems{1/4)
W namespace
rooL dlrecLory /" followed by dlrecLorles and flles
W ConslsLency
sequenLlal conslsLency" newly wrlLLen daLa are
lmmedlaLely vlslble Lo open reads
W SecurlLy
uld/gld mode of flles
kerberos LlckeLs
W 8ellablllLy
[ournallng snapshoL
Loco/ l5 on unix/ike 5ystems{2/4)
W namespace
hyslcal mapplng
W a dlrecLory and all of lLs subdlrecLorles are sLored on Lhe same
physlcal medla
/mnL/cdrom
/mnL/dlsk1 /mnL/dlsk2 when you have mulLlple dlsks
Loglcal volume
W a loglcal namespace LhaL can conLaln mulLlple physlcal medla or a
parLlLlon of a physlcal medla
sLlll mounLed llke /mnL/vol1
dynamlcal reslzlng by addlng/removlng dlsks wlLhouL rebooL
spllLLlng/merglng volumes as long as no daLa spans Lhe spllL
Loco/ l5 on unix/ike 5ystems{l/4)
W !ournallng
Changes Lo Lhe fllesysLem ls logged ln a jootool before lL ls
commlLLed
W useful lf an aLomlc acLlon needs Lwo or more wrlLes
eg appendlng Lo a flle (updaLe meLadaLa + allocaLe space +
wrlLe Lhe daLa)
W can play back a [ournal Lo recover daLa qulckly ln case of hardware
fallure
WhaL Lo log?
W changes Lo flle conLenL heavy overhead
W changes Lo meLadaLa fasL buL daLa corrupLlon may occur
lmplemenLaLlons xfs3 8elserlS l8Ms !lS eLc
Loco/ l5 on unix/ike 5ystems{4/4)
W SnapshoL
A snapshoL a copy of a seL of flles and dlrecLorles aL a
polnL ln Llme
W readonly snapshoLs readwrlLe snapshoLs
W usually done by Lhe fllesysLem lLself someLlmes by LvMs
W backlng up daLa can be done on a readonly snapshoL wlLhouL
worrylng abouL conslsLency
CopyonwrlLe ls a slmple and fasL way Lo creaLe snapshoLs
W currenL daLa ls Lhe snapshoL
W a requesL Lo wrlLe Lo a flle creaLes a new copy and work from
Lhere afLerwards
lmplemenLaLlon ulS Suns ZlS eLc
llL 5Y51M
ile System 0veiview
Bistiibuteu ile Systems (BS)
uoogle ile System (uS)
Bauoop Bistiibuteu ile Systems (BBS)
istributed li/e 5ystems
W Allows access Lo flles from mulLlple hosLs sharlng vla
a compuLer neLwork
W MusL supporL concurrency
Make varylng guaranLees abouL locklng who wlns" wlLh
concurrenL wrlLes eLc
MusL gracefully handle dropped connecLlons
W May lnclude faclllLles for LransparenL repllcaLlon and
faulL Lolerance
W ulfferenL lmplemenLaLlons slL ln dlfferenL places on
complexlLy/feaLure scale
Jen is l5 usefu/
W MulLlple users wanL Lo share flles
W 1he daLa may be much larger Lhan Lhe sLorage space
of a compuLer
W A user wanL Lo access hls/her daLa from dlfferenL
machlnes aL dlfferenL geographlc locaLlons
W users wanL a sLorage sysLem
8ackup
ManagemenL
noLe LhaL a user" of a ulS may acLually be a program"
esiqn considerotions of l5{1/2)
W ulfferenL sysLems have dlfferenL deslgns and
behavlors on Lhe followlng feaLures
lnLerface
W flle sysLem block l/C cusLom made
SecurlLy
W varlous auLhenLlcaLlon/auLhorlzaLlon schemes
8ellablllLy (faulLLolerance)
W conLlnue Lo funcLlon when some hardware fall (dlsks nodes
power eLc)
esiqn considerotions of l5{2/2)
namespace (vlrLuallzaLlon)
W provlde loglcal namespace LhaL can span across physlcal
boundarles
ConslsLency
W all cllenLs geL Lhe same daLa all Lhe Llme
W relaLed Lo locklng cachlng and synchronlzaLlon
arallel
W mulLlple cllenLs can have access Lo mulLlple dlsks aL Lhe same Llme
Scope
W local area neLwork vs wlde area neLwork
llL 5Y51M
ile System 0veiview
Bistiibuteu ile Systems (BS)
uoogle ile System (uS)
Bauoop Bistiibuteu ile Systems (BBS)
6ooq/e li/e 5ystem
Bow to piocess laige uata sets anu easily
utilize the iesouices of a laige uistiibuteu
system .
6ooq/e li/e 5ystem
W MoLlvaLlons
W ueslgn Cvervlew
W SysLem lnLeracLlons
W MasLer CperaLlons
W laulL 1olerance
Motivotions
W laulLLolerance and auLorecovery need Lo be bullL
lnLo Lhe sysLem
W SLandard l/C assumpLlons (eg block slze) have Lo be
reexamlned
W 8ecord appends are Lhe prevalenL form of wrlLlng
W Coogle appllcaLlons and ClS should be codeslgned
5l6N OvkvlJ
ssumptions
ichitectuie
Netauata
Consistency Nouel
4ssumptions{1/2)
W Plgh componenL fallure raLes
lnexpenslve commodlLy componenLs fall all Lhe Llme
MusL monlLor lLself and deLecL LoleraLe and recover from
fallures on a rouLlne basls
W ModesL number of large flles
LxpecL a few mllllon flles each 100 M8 or larger
MulLlC8 flles are Lhe common case and should be
managed efflclenLly
W 1he workloads prlmarlly conslsL of Lwo klnds of reads
large sLreamlng reads
small random reads
4ssumptions{2/2)
W 1he workloads also have many large sequenLlal
wrlLes LhaL append daLa Lo flles
1yplcal operaLlon slzes are slmllar Lo Lhose for reads
W Welldeflned semanLlcs for mulLlple cllenLs LhaL
concurrenLly append Lo Lhe same flle
W Plgh susLalned bandwldLh ls more lmporLanL Lhan
low laLency
lace a premlum on processlng daLa ln bulk aL a hlgh raLe
whlle have sLrlngenL response Llme
esiqn ecisions
W 8ellablllLy Lhrough repllcaLlon
W Slngle masLer Lo coordlnaLe access keep meLadaLa
Slmple cenLrallzed managemenL
W no daLa cachlng
LlLLle beneflL on cllenL large daLa seLs / sLreamlng reads
no need on chunkserver rely on exlsLlng flle buffers
Slmpllfles Lhe sysLem by ellmlnaLlng cache coherence
lssues
W lamlllar lnLerface buL cusLomlze Lhe Al
no CSlx slmpllfy Lhe problem focus on Coogle apps
Add snapshot and record append operaLlons
5l6N OvkvlJ
ssumptions
ichitectuie
Netauata
Consistency Nouel
4rcitecture
ldenLlfled by
an lmmuLable
and globally
unlque 64 blL
cunk ond/e
ko/es in 6l5
W 8oles masLer chunkserver cllenL
CommodlLy Llnux box user level server processes
CllenL and chunkserver can run on Lhe same box
W MasLer holds meLadaLa
W Chunkservers hold daLa
W CllenL produces/consumes daLa
5inq/e Moster
W 1he masLer have global knowledge of chunks
Lasy Lo make declslons on placemenL and repllcaLlon
W lrom dlsLrlbuLed sysLems we know Lhls ls a
Slngle polnL of fallure
ScalablllLy boLLleneck
W ClS soluLlons
Shadow masLers
Mlnlmlze masLer lnvolvemenL
W never move daLa Lhrough lL use only for meLadaLa
W cache meLadaLa aL cllenLs
W large chunk slze
W masLer delegaLes auLhorlLy Lo prlmary repllcas ln daLa
muLaLlons(chunk leases)
cunkserver oto
W uaLa organlzed ln flles and dlrecLorles
ManlpulaLlon Lhrough flle handles
W llles sLored ln nooks (cf blonks" ln dlsk flle sysLems)
A chunk ls a Llnux flle on local dlsk of a chunkserver
unlque 64 blL chunk handles asslgned by masLer aL
creaLlon Llme
llxed chunk slze of 64M8
8ead/wrlLe by (chunk handle byLe range)
Lach chunk ls repllcaLed across 3+ chunkservers
cunk 5ite
W Lach chunk slze ls 64 M8
W A large chunk slze offers lmporLanL advanLages when
sLream readlng/wrlLlng
Less communlcaLlon beLween cllenL and masLer
Less memory space needed for meLadaLa ln masLer
Less neLwork overhead beLween cllenL and chunkserver
(one 1C connecLlon for larger amounL of daLa)
W Cn Lhe oLher hand a large chunk slze has lLs
dlsadvanLages
PoL spoLs
lragmenLaLlon
5l6N OvkvlJ
ssumptions
ichitectuie
Netauata
Consistency Nouel
Metodoto
CFS master
- namespace(flle chunk)
- Mapplng from flles Lo chunks
- CurrenL locaLlons of chunks
- Access ConLrol lnformaLlon
All ln memory
durlng operaLlon
Metodoto {cont)
W namespace and flleLochunk mapplng are kepL
perslsLenL
operation logs + checkpoints
W CperaLlon logs hlsLorlcal record of muLaLlons
represenLs Lhe Llmellne of changes Lo meLadaLa ln concurrenL
operaLlons
sLored on masLers local dlsk
repllcaLed remoLely
W A muLaLlon ls noL done or vlslble unLll Lhe operaLlon log ls
sLored locally and remoLely
masLer may group operaLlon logs for baLch flush
kecovery
W 8ecover Lhe flle sysLem replay Lhe operaLlon logs
fsck" of ClS afLer eg a masLer crash
W use checkpolnLs Lo speed up
memorymappable no parslng
8ecovery read ln Lhe laLesL checkpolnL + replay logs Laken afLer
Lhe checkpolnL
lncompleLe checkpolnLs are lgnored
Cld checkpolnLs and operaLlon logs can be deleLed
W CreaLlng a checkpolnL musL noL delay new muLaLlons
1 SwlLch Lo a new log flle for new operaLlon logs all operaLlon
logs up Lo now are now frozen"
2 8ulld Lhe checkpolnL ln a separaLe Lhread
3 WrlLe locally and remoLely
cunk Locotions
W Chunk locaLlons are noL sLored ln masLers dlsks
1he masLer asks chunkservers whaL Lhey have durlng
masLer sLarLup or when a new chunkserver [olns Lhe
clusLer
lL decldes chunk placemenLs LhereafLer
lL monlLors chunkservers wlLh regular hearLbeaL messages
W 8aLlonale
ulsks fall
Chunkservers dle (re)appear geL renamed eLc
LllmlnaLe synchronlzaLlon problem beLween Lhe masLer
and all chunkservers
5l6N OvkvlJ
ssumptions
ichitectuie
Netauata
Consistency Nouel
consistency Mode/
W ClS has a relaxed conslsLency model
W llle namespace muLaLlons are aLomlc and conslsLenL
handled excluslvely by Lhe masLer
namespace lock guaranLees aLomlclLy and correcLness
order deflned by Lhe operaLlon logs
W llle reglon muLaLlons compllcaLed by repllcas
ConslsLenL" all repllcas have Lhe same daLa
ueflned" conslsLenL + repllca reflecLs Lhe muLaLlon
enLlrely
A relaxed conslsLency model noL always conslsLenL noL
always deflned elLher
consistency Mode/ {cont)
6ooq/e li/e 5ystem
W MoLlvaLlons
W ueslgn Cvervlew
W SysLem lnLeracLlons
W MasLer CperaLlons
W laulL 1olerance
5Y51M lN1k4c1lON5
eauWiite
Concuiient Wiite
tomic ecoiu ppenus
Snapshot
Ji/e reodinq o fi/e
AppllcaLlon AppllcaLlon ClS CllenL ClS CllenL MasLer MasLer Chunkserver Chunkserver
Cpen(name read)
name
handle
handle
8ead(handle offseL
lengLh buffer)
handle
chunk_lndex
chunk_handle
chunk_locaLlons
cache (handle
chunk_lndex)

(chunk_handle
locaLlons)
selecL a repllca
chunk_handle
byLe_range
uaLa uaLa
reLurn code
Cpen
8ead
Ji/e writinq to o li/e
chunk_handle
prlmary_ld 8ep
llca_locaLlons
AppllcaLlon AppllcaLlon ClS CllenL ClS CllenL MasLer MasLer Chunkserver Chunkserver
rlmary
Chunkserver
rlmary
Chunkserver
Chunkserver Chunkserver
WrlLe(handle
offseLlengLh
buffer) handle
Cuery
cache selecL a
repllca
granLs a lease
(lf noL granLed before)
uaLa uaLa
uaLa uaLa
uaLa uaLa
recelved
daLa recelved
wrlLe (lds)
m order(*)
m order(*)
compleLe
compleLe
compleLed
reLurn code
uaLa ush
CommlL
* asslgn muLaLlon
order wrlLe Lo dlsk
Chunkserver
Leose Monoqement
W A cruclal parL of concurrenL wrlLe/append operaLlon
ueslgned Lo mlnlmlze masLers managemenL overhead by
auLhorlzlng chunkservers Lo make declslons
W Cne lease per chunk
CranLed Lo a chunkserver whlch becomes Lhe prlmary
CranLlng a lease lncreases Lhe verslon number of Lhe chunk
8emlnder Lhe prlmary decldes Lhe muLaLlon order
W 1he prlmary can renew Lhe lease before lL explres
lggybacked on Lhe regular hearLbeaL message
W 1he masLer can revoke a lease (eg for snapshoL)
W 1he masLer can granL Lhe lease Lo anoLher repllca lf Lhe
currenL lease explres (prlmary crashed eLc)
Mutotion
1 CllenL asks masLer for repllca
locaLlons
2 MasLer responds
3 CllenL pushes daLa Lo all repllcas
repllcas sLore lL ln a buffer cache
4 CllenL sends a wrlLe requesL Lo Lhe
prlmary (ldenLlfylng Lhe daLa LhaL had
been pushed)
3 rlmary forwards requesL Lo Lhe
secondarles (ldenLlfles Lhe order)
6 1he secondarles respond Lo Lhe
prlmary
7 1he prlmary responds Lo Lhe cllenL
Mutotion {cont)
W MuLaLlon wrlLe or append
musL be done for all repllcas
W Coal
mlnlmlze masLer lnvolvemenL
W Lease mechanlsm for conslsLency
masLer plcks one repllca as prlmary glves lL a lease" for
muLaLlons
a lease a lock LhaL has an explraLlon Llme
prlmary deflnes a serlal order of muLaLlons
all repllcas follow Lhls order
W uaLa flow ls decoupled from conLrol flow
5Y51M lN1k4c1lON5
eauWiite
Concuiient Wiite
tomic ecoiu ppenus
Snapshot
concurrent Jrite
W lf Lwo cllenLs concurrenLly wrlLe Lo Lhe same reglon
of a flle any of Lhe followlng may happen Lo Lhe
overlapplng porLlon
LvenLually Lhe overlapplng reglon may conLaln daLa from
exacLly one of Lhe Lwo wrlLes
LvenLually Lhe overlapplng reglon may conLaln a mlxLure of
daLa from Lhe Lwo wrlLes
W lurLhermore lf a read ls execuLed concurrenLly wlLh
a wrlLe Lhe read operaLlon may see elLher all of Lhe
wrlLe none of Lhe wrlLe or [usL a porLlon of Lhe wrlLe
consistency Mode/ {remind)
WrlLe x aL reglon [ ln C1
C1 C1 C1
8eglon lnconslsLenL 8eglon conslsLenL
x x x
WrlLe xyz aL reglon [ ln C1
WrlLe abc aL
reglon [ ln C1
8eglon conslsLenL buL undeflned
xyzabc xyzabc xyzabc
Jrite/concurrent Jrite
1rodeoffs
W Some properLles
concurrenL wrlLes leave reglon conslsLenL buL posslbly
undeflned
falled wrlLes leave Lhe reglon lnconslsLenL
W Some work has moved lnLo Lhe appllcaLlons
eg selfvalldaLlng selfldenLlfylng records
4tomic kecord 4ppends
W ClS provldes an aLomlc append operaLlon called
record append"
W CllenL speclfles daLa buL noL Lhe offseL
W ClS guaranLees LhaL Lhe daLa ls appended Lo Lhe flle
aLomlcally aL leasL once
ClS plcks Lhe offseL and reLurns Lhe offseL Lo cllenL
works for concurrenL wrlLers
W used heavlly by Coogle apps
eg for flles LhaL serve as mulLlpleproducer/slngle
consumer queues
ConLaln merged resulLs from many dlfferenL cllenLs
now kecord 4ppend Jorks
W Cuery and uaLa ush are slmllar Lo wrlLe operaLlon
W CllenL send wrlLe requesL Lo prlmary
W lf appendlng would exceed chunk boundary
rlmary pads Lhe currenL chunk Lells oLher repllcas Lo do
Lhe same replles Lo cllenL asklng Lo reLry on Lhe nexL
chunk
W Llse
commlL Lhe wrlLe ln all repllcas
W Any repllca fallure cllenL reLrles
Append abc
C1 C1 C1
8eglon deflned lnLerspersed wlLh
lnconslsLenL
abc abc abc
8eLry
8eglon lnconslsLenL and
undeflned
abc
abc
4ppend
5Y51M lN1k4c1lON5
eauWiite
Concuiient Wiite
tomic ecoiu ppenus
Snapshot
5nopsot
W Makes a copy of a flle or a dlrecLory Lree almosL
lnsLanLaneously
mlnlmlze lnLerrupLlons of ongolng muLaLlons
copyonwrlLe wlLh reference counLs on chunks
W SLeps
1 a cllenL lssues a snapshoL requesL for source flles
2 masLer revokes all leases of affecLed chunks
3 masLer logs Lhe operaLlon Lo dlsk
4 masLer dupllcaLes meLadaLa of source flles polnLlng Lo
Lhe same chunks lncreaslng Lhe reference counL of Lhe
chunks
4fter 5nopsot{keod/Jrite)
UaZWX
8ead bar WrlLe bar
Copy
ton 2

Chunk 2ef0
Chunk handle

ton 1
Chunk 2ef1
Copy daLa
Copy daLa
SnapshoL
ton 1
Chunk handle
uaLa
6ooq/e li/e 5ystem
W MoLlvaLlons
W ueslgn Cvervlew
W SysLem lnLeracLlons
W MasLer CperaLlons
W laulL 1olerance
M451k OPk41lON5
amespace Nanagement anu Locking
eplica Placement
Cieation, ebalancing , e-ieplication
uaibage Collection
Stale eplica Betection
Nomespoce Mqt ond Lockinq
W Allows mulLlple operaLlons Lo be acLlve and use locks
over reglons of Lhe namespace
W Loglcally represenLs namespace as a lookup Lable
mapplng full paLhnames Lo meLadaLa
W Lach node ln Lhe namespace Lree has an assoclaLed
readwrlLe lock
W Lach masLer operaLlon acqulres a seL of locks before
lL runs
Nomespoce Mqt ond Lockinq {cont)
/d1/d2//dn/leaf
/d1
/d1/d2

/d1/d2//dn
/d1/d2//dn/leaf
lf lL lnvolves
8ead locks on Lhe
dlrecLory name
LlLher a read lock
or a wrlLe lock on
Lhe full paLhname
Nomespoce Mqt ond Lockinq {cont)
W Pow Lhls locklng mechanlsm can prevenL a flle
/home/user/foo from belng creaLed whlle
/home/user ls belng snapshoLLed Lo /save/user
kead |ocks We |ocks
SnapshoL
operaLlon
/home /home/user
/save /save/user
CreaLlon
operaLlon
/home
/home/user/foo
/home/user
M451k OPk41lON5
amespace Nanagement anu Locking
eplica Placement
Cieation, ebalancing , e-ieplication
uaibage Collection
Stale eplica Betection
kep/ico P/ocement
W 1rafflc beLween racks ls slower Lhan wlLhln Lhe same
rack
W A repllca ls creaLed for 3 reasons
Chunk creaLlon
Chunk rerepllcaLlon
Chunk rebalanclng
W MasLer has a repllca placemenL pollcy
Maxlmlze daLa rellablllLy and avallablllLy
Maxlmlze neLwork bandwldLh uLlllzaLlon
MusL spread repllca across racks
cunk creotion kebo/once
W Where Lo puL Lhe lnlLlal repllcas?
Servers wlLh belowaverage dlsk uLlllzaLlon
8uL noL Loo many recenL creaLlons on a server
And musL have servers across racks
W MasLer rebalances repllcas perlodlcally
Moves chunks for beLLer dlsk space balance and load
balance
lllls up new chunkserver
W MasLer prefers Lo move chunks ouL of crowded
chunkserver
cunk kerep/icotion
W MasLer rerepllcaLes a chunk as soon as Lhe number of
avallable repllcas falls below a userspeclfled goal
Chunkserver dles ls removed eLc ulsk falls ls dlsabled eLc
Chunk ls corrupL Coal ls lncreased
W lacLors affecLlng whlch chunk ls cloned flrsL
Pow far ls lL from Lhe goal
Llve flles vs deleLed flles
8locklng cllenL
W lacemenL pollcy ls slmllar Lo chunk creaLlon
W MasLer llmlLs Lhe number of clonlng per chunkserver and
clusLerwlde Lo mlnlmlze Lhe lmpacL on cllenL Lrafflc
W Chunkserver LhroLLles clonlng read
M451k OPk41lON5
amespace Nanagement anu Locking
eplica Placement
Cieation, ebalancing , e-ieplication
uaibage Collection
Stale eplica Betection
6orboqe co//ection
W Chunks of deleLed flles are noL reclalmed lmmedlaLely
W Mechanlsm
CllenL lssues a requesL Lo deleLe a flle
MasLer logs Lhe operaLlon lmmedlaLely renames Lhe flle Lo a
hldden name wlLh LlmesLamp and replles
MasLer scans flle namespace regularly
W MasLer removes meLadaLa of hldden flles older Lhan 3 days
MasLer scans chunk namespace regularly
W MasLer removes meLadaLa of orphaned chunks
Chunkserver sends masLer a llsL of chunk handles lL has ln
regular PearL8eaL message
W MasLer replles Lhe chunks noL ln namespace
W Chunkserver ls free Lo deleLe Lhe chunks
6orboqe co//ection{cont)
ueleLe /foo
o

Meadaa

ueleLe /foo20101013 /foo


5to/e kep/ico e/etion
W SLale repllca ls a repllca LhaL mlsses muLaLlon(s) whlle
Lhe chunkserver ls down
Server reporLs lLs chunks Lo masLer afLer booLlng Cops!
W SoluLlon chunk verslon number
MasLer and chunkservers keep chunk verslon numbers
perslsLenLly
MasLer creaLes new chunk verslon number when granLlng
a lease Lo prlmary and noLlfles all repllcas Lhen sLore Lhe
new verslon perslsLenLly
W 1he masLer removes sLale repllcas ln lLs regular
garbage collecLlon
6ooq/e li/e 5ystem
W MoLlvaLlons
W ueslgn Cvervlew
W SysLem lnLeracLlons
W MasLer CperaLlons
W laulL 1olerance
l4uL1 1OLk4Nc
Bigh vailability
Bata Integiity
Biagnostic Tools
lost kecovery
W MasLer and chunkserver can sLarL and resLore Lo
prevlous sLaLe ln seconds
MeLadaLa ls sLored ln blnary formaL no parslng
30M8 100 M8 of meLadaLa per server
normal sLarLup and sLarLup afLer abnormal LermlnaLlon ls
Lhe same
Can klll Lhe process anyLlme
W do noL dlsLlngulsh beLween normal and abnormal LermlnaLlon
Moster kep/icotion
W MasLers operaLlon logs and checkpolnLs are
repllcaLed on mulLlple machlnes
A muLaLlon ls compleLe only when all repllcas are updaLed
W lf Lhe masLer dles clusLer monlLorlng sofLware sLarLs
anoLher masLer wlLh checkpolnLs and operaLlon logs
CllenLs see Lhe new masLer as soon as Lhe unS allas ls
updaLed
W Shadow masLers provlde readonly access
8eads a repllca operaLlon log Lo updaLe Lhe meLadaLa
1yplcally behlnd by less Lhan a second
no lnLeracLlon wlLh Lhe busy masLer excepL repllca locaLlon
updaLes (clonlng)
l4uL1 1OLk4Nc
Bigh vailability
Bata Integiity
Biagnostic Tools
oto lnteqrity
W A responslblllLy of chunkservers noL masLer
ulsks fallure ls norm chunkserver musL know
ClS doesnL guaranLee ldenLlcal repllca lndependenL
verlflcaLlon ls necessary
W 32 blL checksum for every 64 k8 block of daLa
avallable ln memory perslsLenL wlLh logglng
separaLe from user daLa
W 8ead verlfy checksum before reLurnlng daLa
mlsmaLch reLurn error Lo cllenL reporL Lo masLer
cllenL reads from anoLher repllca
masLer clones a repllca Lells chunkserver Lo deleLe Lhe
chunk
ioqnostic 1oo/s
W Logs on each server
SlgnlflcanL evenLs (server up down)
8C requesLs/replles
W Comblnlng logs on all servers Lo reconsLrucL Lhe full
lnLeracLlon hlsLory Lo ldenLlfy source of problems
W Logs can be used on performance analysls and load
LesLlng Loo
5ummory of 6l5
W ClS demonsLraLes how Lo supporL largescale
processlng workloads on commodlLy hardware
deslgned Lo LoleraLe frequenL componenL fallures
unlform loglcal namespace
opLlmlze for huge flles LhaL are mosLly appended and read
feel free Lo relax and exLend lS lnLerface as requlred
relaxed conslsLency model
go for slmple soluLlons (eg slngle masLer garbage
collecLlon)
W ClS has meL Coogle's sLorage needs
nOJ 48Ou1 n4OOP
BBS
nl5
W 0veiview
W ichitectuie
W Implementation
W 0thei Issue
Jots nl5
W Padoop ulsLrlbuLed llle
SysLem
8eference from Coogle llle
SysLem
A scalable dlsLrlbuLed flle
sysLem for large daLa analysls
8ased on commodlLy
hardware wlLh hlgh faulL
LoleranL
1he prlmary sLorage used by
Padoop appllcaLlons
Padoop ulsLrlbuLed
llle SysLem (PulS)
Padoop ulsLrlbuLed
llle SysLem (PulS)
Map8educe Map8educe
Pbase Pbase
A ClusLer of Machlnes A ClusLer of Machlnes
Cloud AppllcaLlons Cloud AppllcaLlons
nl5s leoture{1/2)
W Large daLa seLs and flles
SupporL eabes slze
W PeLerogeneous
Could be deployed on dffeen hadwae
W SLreamlng daLa access
ach processlng raLher Lhan lnLeracLlve user access
Plgh aggregaLe daLa bandwldLh
nl5s leoture{2/2)
W laulL1olerance
1he norm raLher Lhan excepLlon
AuLomaLlc recovery or reporL fallure
W Coherency Model
Weonceeadman
1hls assumpLlon slmpllfles coherency
W uaLa LocallLy
Move compuLe Lo daLa
nl5
W 0veiview
W ichitectuie
W Implementation
W 0thei Issue
now to monoqe doto
PulS ArchlLecLure
Nomenode
W Lach PulS clusLer has one Namenode
W Manage Lhe f|e ssem namespace
W 8egulaLe access Lo flles by cllenLs
W LxecuLe flle sysLem namespace opeaons
W ueLermlne Lhe ack d each uaLanode belongs Lo
otonode
W Cne pe node ln Lhe clusLer
W Manae soae aLLached Lo Lhe nodes LhaL Lhey run
on
W Serve ead and we requesLs from Lhe flle sysLem's
cllenLs
W erform b|ock ceaon de|eon and ep|caon
li/e 5ystem Nomespoce
W 1radlLlonal heachca| flle organlzaLlon
W uoes noL supporL hard llnks or sofL llnks
W Change Lo Lhe flle sysLem namespace or lLs
properLles ls ecoded b he Namenode
nl5
W 0veiview
W ichitectuie
W Implementation
W 0thei Issue
oto kep/icotion
W 8locks of a flle are repllcaLed fo fau| o|eance
W 1he block slze and repllcaLlon facLor are confuab|e
pe f|e
W namenode makes all declslons regardlng repllcaLlon
of blocks
eabea uaLanode ls funcLlonlng properly
|ockepo a llsL of all blocks on a uaLanode
8/ock kep/icotion
kep/ico P/ocement
W 8ackaware repllca placemenL pollcy
daLa rellablllLy
avallablllLy
neLwork bandwldLh uLlllzaLlon
W 1o valldaLe lL on producLlon sysLems
learn more abouL lLs behavo
bulld a foundaLlon Lo LesL
research more sophscaed po|ces
5creensot
number of 8epllcas2
Jy it lou/t1o/eronce
W uaLa CorrupL
Checked wlLh k
8eplace corrupL block wlLh repllcaLlon one
W neLwork laulL uaLanode laulL
uaLanode sends heabea Lo namenode
W namenode laulL
mae coe f|e ssem mapplng lmage
Ld|o llke SCL ansacon log
Mu|p|e backups of lSlmage and LdlLlog
Manua|| ecove whlle namenode laulL
C8C Cycllcal 8edundancy Check
coerency Mode/ Performonce
W Coherency model of flles
Namenode handle Lhe operaLlon of wrlLe read and deleLe
W Large uaLa SeL and erformance
1he defaulL block slze ls M
8lgger block slze wlll enhance ead pefomance
Slngle flle sLored on PulS mlghL be larger Lhan slngle
physlcal dlsk of uaLanode
u|| dsbued blocks lncrease LhroughpuL of readlng
4bout oto /oco/ity
nl5
W 0veiview
W ichitectuie
W Implementation
W 0thei Issue
5mo// fi/e prob/em
W lnefflclency of resource uLlllzaLlon
SlgnlflcanLly smaller Lhan Lhe PulS block slze(64M8)
W llle dlrecLory and block ln PulS ls represenLed as an
ob[ec ln Lhe namenode's memory each of whlch
occuples 1S0 bes
W PulS ls noL geared up Lo efflclenLly accessn small
flles
ueslgned for sLreamlng access of large flles
5mo// fi/e so/ution
W Padoop Archlves (PA8)
lnLroduced Lo allevlaLe Lhe problem of loLs of flles puLLlng
pressure on Lhe namenode's memory
8ulldlng a |aeed f|essem on Lop of PulS
5mo// fi/e so/ution
W Sequence llles
use Lhe fllename as Lhe key and Lhe flle conLenLs as Lhe
value
5ummory
W ScalablllLy
rovlde scaleouL sLorage capablllLy of handllng very large
amounLs of daLa
W AvallablllLy
rovlde Lhe ablllLy of fallure Lolerance such LhaL daLa would noL
lose on machlne or dlsk fall
W ManageablllLy
rovlde mechanlsm for Lhe sysLem Lo auLomaLlcally monlLor
lLself and manage Lhe masslve daLa LransparenLly for users
W erformance
Plgh susLalned bandwldLh ls more lmporLanL Lhan low laLency
keferences
W S CPLMAWA1 P CC8lCll and S1 LLunC 1he
Coogle flle sysLem" ln 9ton o t 19t AcM 5O59
(un 2003)
W Padoop
hLLp//hadoopapacheorg/
W nCPC Cloud CompuLlng 8esearch Croup
hLLp//LracnchcorgLw/cloud
W n1u course Cloud CompuLlng and Moblle laLforms
hLLp//nLucslecloud98appspoLcom/course_lnformaLlon

Você também pode gostar