Complete 1 Distributed Systems

CS60002
Distributed Systems
Text Book:
• “Advanced Concepts in Operating
Systems” by Mukesh Singhal and
Niranjan G. Shivaratri
will cover about half the course,
supplemented by copies of papers
Xerox, notes, copies of papers

etc. will cover the rest.
What is a distributed system?
A very broad definition:

A set of autonomous processes
communicating among
themselves to perform a task
Autonomous: able to act

independently
Communication: shared memory or
message passing
“Concurrent system” : a better term

probably
A more restricted definition:
• A network of autonomous
computers that communicate by
message passing to perform
some task
A practical “distributed system” will

probably have both
– Computers that communicate by
messages
– Processes/threads on a
computer that communicate by
messages or shared memory
Advantages
• Resource Sharing
• Higher Performance
• Fault Tolerance
• Scalability
Why is it hard to design them?
The usual problem of concurrent

systems:
– Arbitrary interleaving of actions
makes the system hard to verify
Plus
• No globally shared memory
(therefore hard to collect global
state)
• No global clock
• Unpredictable communication
delays
Models for Distributed
Algorithms
– Topology : completely
connected, ring, tree etc.
– Communication : shared
memory/message passing
(reliable? Delay? FIFO/Causal?
Broadcast/multicast?)
– Synchronous/asynchronous
– Failure models (fail stop, crash,
omission, Byzantine…)
An algorithm need to specify the

model on which it is supposed to
work
Complexity Measures
• Message complexity : no. of

messages
• Communication complexity/Bit
Complexity : no. of bits
• Time complexity : For synchronous

systems, no. of rounds. For
asynchronous systems, different
definitions are there.
Some Fundamental Problems
• Ordering events in the absence of a

global clock
• Capturing the global state
• Mutual exclusion
• Leader election
• Clock synchronization
• Termination detection
• Constructing spanning trees
• Agreement protocols
Ordering of Events and
Logical Clocks
Ordering of Events
Lamport’s Happened Before

relationship:
For two events a and b, a → b if
 a and b are events in the same
process and a occurred before b
 a is a send event of a message
m and b is the corresponding
receive event at the destination
process
 a → c and c → b for some event
c
a → b implies a is a potential cause
of b
Causal ordering : potential
dependencies
“Happened Before” relationship
causally orders events
• If a → b, then a causally affects
b
• If a → b and b → a, then a and b
are concurrent ( a || b)
Logical Clock
Each process i keeps a clock Ci.
• Each event a in i is timestamped

C(a), the value of Ci when a
occured
• Ci is incremented by 1 for each
event in i
• In addition, if a is a send of
message m from process i to j,
then on receive of m,
Cj = max(Cj, C(a)+1)
Points to note:
• if a → b, then C(a) < C(b)
• → is an irreflexive partial order
• Total ordering possible by

arbitrarily ordering concurrent
events by process numbers
Limitation of Lamport’s Clock
a → b implies C(a) < C(b)
BUT
C(a) < C(b) doesn’t imply a → b !!
So not a true clock !!

Solution: Vector Clocks
Ci is a vector of size n (no. of

processes)
C(a) is similarly a vector of size n
Update rules:
• Ci[i]++ for every event at process
i
• if a is send of message m from i
to j with vector timestamp tm, on
receive of m:
Cj[k] = max(Cj[k], tm[k]) for all k
For events a and b with vector
timestamps ta and tb,
• ta = tb iff for all i, ta[i] = tb[i]
• ta ≠ tb iff for some i, ta[i] ≠ tb[i]
• ta ≤ tb iff for all i, ta[i] ≤ tb[i]
• ta < tb iff (ta ≤ tb and ta ≠ tb)
• ta || tb iff (ta < tb and tb < ta)

• a → b iff ta < tb
• Events a and b are causally related

iff ta < tb or tb < ta, else they are
concurrent
• Note that this is still not a total
order
Causal ordering of messages:
application of vector clocks
• If send(m1)→ send(m2), then every

recipient of both message m1 and
m2 must “deliver” m1 before m2.
“deliver” – when the message is

actually given to the application for
processing
Birman-Schiper-Stephenson
Protocol
• To broadcast m from process i,

increment Ci(i), and timestamp m
with VTm = Ci[i]
• When j ≠ i receives m, j delays
delivery of m until
– Cj[i] = VTm[i] –1 and
– Cj[k] ≥ VTm[k] for all k ≠ i
– Delayed messaged are queued in j
sorted by vector time. Concurrent
messages are sorted by receive time.
• When m is delivered at j, Cj is
updated according to vector clock
rule.
Problem of Vector Clock
• message size increases since

each message needs to be
tagged with the vector
• size can be reduced in some

cases by only sending values
that have changed
Capturing Global State
Global State Collection
Applications:
– Checking “stable” properties,
checkpoint & recovery
Issues:
– Need to capture both node and
channel states
– system cannot be stopped
– no global clock
Some notations:
– LSi : local state of process i
– send(mij) : send event of
message mij from process i to
process j
– rec(mij) : similar, receive instead
of send
– time(x) : time at which state x
was recorded
– time (send(m)) : time at which
send(m) occured
send(mij) є LSi iff
time(send(mij)) < time(LSi)
rec(mij) є LSj iff

time(rec(mij)) < time(LSj)
transit(LSi,LSj) = { mij | send(mij) є LSi

and rec(mij) є LSj}
inconsistent(LSi, LSj) = {mij |

send(mij) є LSi and rec(mij) є LSj}
Global state: collection of local states
GS = {LS1, LS2,…, LSn}
GS is consistent iff
for all i, j, 1 ≤ i, j ≤ n,
inconsistent(LSi, LSj) = Ф
GS is transitless iff
for all i, j, 1 ≤ i, j ≤ n,
transit(LSi, LSj) = Ф
GS is strongly consistent if it is
consistent and transitless.
Chandy-Lamport’s Algorithm
• Uses special marker messages.
• One process acts as initiator, starts

the state collection by following the
marker sending rule below.
• Marker sending rule for process P:

– P records its state; then for each
outgoing channel C from P on
which a marker has not been
sent already, P sends a marker
along C before any further
message is sent on C
• When Q receives a marker along a
channel C:
– If Q has not recorded its state

then Q records the state of C as
empty; Q then follows the
marker sending rule
– If Q has already recorded its

state, it records the state of C as
the sequence of messages
received along C after Q’s state
was recorded and before Q
received the marker along C
Points to Note:
• Markers sent on a channel

distinguish messages sent on the
channel before the sender recorded
its states and the messages sent
after the sender recorded its state
• The state collected may not be any
state that actually happened in
reality, rather a state that “could
have” happened
• Requires FIFO channels
• Network should be strongly
connected (works obviously for
connected, undirected also)
• Message complexity O(|E|), where
E = no. of links
Lai and Young’s Algorithm
• Similar to Chandy-Lamport’s, but

does not require FIFO
• Boolean value X at each node,
False indicates state is not
recorded yet, True indicates
recorded
• Value of X piggybacked with every
application message
• Value of X distinguishes pre-
snapshot and post-snapshot
messages, similar to the Marker
Mutual Exclusion
Mutual Exclusion
• very well-understood in shared

memory systems
• Requirements:
– at most one process in critical
section (safety)
– if more than one requesting
process, someone enters
(liveness)
– a requesting process enters
within a finite time (no starvation)
– requests are granted in order
(fairness)
Classification of Distributed
Mutual Exclusion Algorithms
• Non-token based/Permission based

– Permission from all processes:
e.g. Lamport, Ricart-Agarwala,
Raicourol-Carvalho etc.
– Permission from a subset: ex.
Maekawa
• Token based
– ex. Suzuki-Kasami
Some Complexity Measures
• No. of messages/critical section

entry
• Synchronization delay
• Response time
• Throughput
Lamport’s Algorithm
• Every node i has a request queue

qi, keeps requests sorted by logical
timestamps (total ordering enforced
by including process id in the
timestamps)
To request critical section:

– send timestamped REQUEST
(tsi, i) to all other nodes
– put (tsi, i) in its own queue
On receiving a request (tsi, i):

– send timestamped REPLY to the
requesting node i
– put request (tsi, i) in the queue
To enter critical section:
– i enters critical section if (tsi, i) is
at the top if its own queue, and i
has received a message (any
message) with timestamp larger
than (tsi, i) from ALL other
nodes.
To release critical section:

– i removes it request from its own
queue and sends a timestamped
RELEASE message to all other
nodes
– On receiving a RELEASE
message from i, i’s request is
removed from the local request
queue
Some points to note:
• Purpose of REPLY messages from

node i to j is to ensure that j knows
of all requests of i prior to sending
the REPLY (and therefore, possibly
any request of i with timestamp
lower than j’s request)
• Requires FIFO channels.
• 3(n – 1 ) messages per critical
section invocation
• Synchronization delay = max.
message transmission time
• requests are granted in order of
increasing timestamps
Ricart-Agarwala Algorithm
• Improvement over Lamport’s

• Main Idea:
– node j need not send a REPLY
to node i if j has a request with
timestamp lower than the
request of i (since i cannot enter
before j anyway in this case)
• Does not require FIFO
• 2(n – 1) messages per critical
section invocation
• Synchronization delay = max.
message transmission time
• requests granted in order of
increasing timestamps
– send timestamped REQUEST
message (tsi, i)
On receiving request (tsi, i) at j:

– send REPLY to i if j is neither
requesting nor executing critical
section or if j is requesting and
i’s request timestamp is smaller
than j’s request timestamp.
Otherwise, defer the request.

– i enters critical section on
receiving REPLY from all nodes

– send REPLY to all deferred
requests
Roucairol-Carvalho Algorithm
• Improvement over Ricart-Agarwala

• Main idea
– once i has received a REPLY
from j, it does not need to send a
REQUEST to j again unless it
sends a REPLY to j (in response
to a REQUEST from j)
– no. of messages required varies
between 0 and 2(n – 1)
depending on request pattern
– worst case message complexity
still the same
Maekawa’s Algorithm
• Permission obtained from only a

subset of other processes, called
the Request Set (or Quorum)
• Separate Request Set Ri for each
process i
• Requirements:
– for all i, j: Ri ∩ Rj ≠ Φ
– for all i: i Є Ri
– for all i: |Ri| = K, for some K
– any node i is contained in exactly
D Request Sets, for some D
• K = D = sqrt(N) for Maekawa’s
A simple version

– i sends REQUEST message to
all process in Ri
On receiving a REQUEST message:

– send a REPLY message if no
REPLY message has been sent
since the last RELEASE
message is received. Update
status to indicate that a REPLY
has been sent. Otherwise, queue
up the REQUEST

– i enters critical section after
receiving REPLY from all nodes
in Ri
– send RELEASE message to all
nodes in Ri
– On receiving a RELEASE
message, send REPLY to next
node in queue and delete the
node from the queue. If queue is
empty, update status to indicate
no REPLY message has been
sent.
• Message Complexity: 3*sqrt(N)
• Synchronization delay =
2 *(max message transmission
time)
• Major problem: DEADLOCK

possible
• Need three more types of

messages (FAILED, INQUIRE,
YIELD) to handle deadlock.
Message complexity can be
5*sqrt(N)
• Building the request sets?

Token based Algorithms
• Single token circulates, enter CS

when token is present
• No FIFO required
• Mutual exclusion obvious
• Algorithms differ in how to find and
get the token
• Uses sequence numbers rather
than timestamps to differentiate
between old and current requests
Suzuki Kasami Algorithm
• Broadcast a request for the token

• Process with the token sends it to
the requestor if it does not need it
Issues:
– Current vs. outdated requests

– determining sites with pending
requests
– deciding which site to give the
token to
• The token:
– Queue (FIFO) Q of requesting
processes
– LN[1..n] : sequence number of
request that j executed most
recently
• The request message:
– REQUEST(i, k): request
message from node i for its kth
critical section execution
• Other data structures
– RNi[1..n] for each node i, where
RNi[j] is the largest sequence
number received so far by i in a
REQUEST message from j.
– If i does not have token,
increment RNi[i] and send
REQUEST(i, RNi[i]) to all nodes
– if i has token already, enter
critical section if the token is idle
(no pending requests), else
follow rule to release critical
section
On receiving REQUEST(i, sn) fat j:

– set RNj[i] = max(RNj[i], sn)
– if j has the token and the token is
idle, send it to i if RNj[i] = LN[i] +
1. If token is not idle, follow rule
to release critical section
– enter CS if token is present

– set LN[i] = RNi[i]
– For every node j which is not in
Q (in token), add node j to Q if
RNi[ j ] = LN[ j ] + 1
– If Q is non empty after the
above, delete first node from Q
and send the token to that node
Points to note:
– No. of messages: 0 if node holds

the token already, n otherwise
– Synchronization delay: 0 (node

has the token) or max. message
delay (token is elsewhere)
– No starvation
Raymond’s Algorithm
• Forms a directed tree (logical) with

the token-holder as root
• Each node has variable “Holder”

that points to its parent on the path
to the root. Root’s Holder variable
points to itself
• Each node i has a FIFO request

queue Qi
• To request critical section:
– Send REQUEST to parent on the
tree, provided i does not hold the
token currently and Qi is empty.
Then place request in Qi
• When a non-root node j receives a

request from i
– place request in Qj
– send REQUEST to parent if no
previous REQUEST sent
• When the root receives a
REQUEST:
– send the token to the requesting
node
– set Holder variable to point to
that node
• When a node receives the token:
– delete first entry from the queue
– send token to that node
– set Holder variable to point to
that node
– if queue is non-empty, send a
REQUEST message to the
parent (node pointed at by
Holder variable)
• To execute critical section:
– enter if token is received and
own entry is at the top of the
queue; delete the entry from the
queue
• To release critical section

– if queue is non-empty, delete first
entry from the queue, send token
to that node and make Holder
variable point to that node
– If queue is still non-empty, send
a REQUEST message to the
parent (node pointed at by
Holder variable)
Points to note:
• Avg. message complexity

O(log n)
• Sync. delay (T log n)/2, where

T = max. message delay
Leader Election
Leader Election in Rings
• Models
– Synchronous or Asynchronous
– Anonymous (no unique id) or
Non-anonymous (unique ids)
– Uniform (no knowledge of ‘n’,
the number of processes) or
non-uniform (knows ‘n’)
• Known Impossibility Result:
– There is no Synchronous, non-
uniform leader election
protocol for anonymous rings
– Implications ??
Election in Asynchronous
Rings
Lelann-Chang-Robert’s
Algorithm
– send own id to node on left
– if an id received from right,
forward id to left node only if
received id greater than own id,
else ignore
– if own id received, declares itself
“leader”
• works on unidirectional rings
• message complexity = θ(n^2)
• Hirschberg-Sinclair Algorithm
– operates in phases, requires
bidirectional ring
– In kth phase, send own id to 2^k
processes on both sides of yourself
(directly send only to next
processes with id and k in it)
– if id received, forward if received id
greater than own id, else ignore
– last process in the chain sends a
reply to originator if its id less than
received id
– replies are always forwarded
– A process goes to (k+1)th phase
only if it receives a reply from both
sides in kth phase
– process receiving its own id –
declare itself “leader”
• Message Complexity: O(nlgn)
• Lots of other algorithms exist
for rings
• Lower Bound Result:
– Any comparison-based leader
election algorithm in a ring
requires Ώ(nlgn) messages
– What if not comparison-based?
Leader Election in Arbitrary
Networks
• FloodMax
– synchronous, round-based
– at each round, each process sends
the max. id seen so far (not
necessarily its own) to all its
neighbors
– after diameter no. of rounds, if max.
id seen = own id, declares itself
leader
– Complexity = O(d.m), where d =
diameter of the network, m = no. of
edges
– does not extend to asynchronous
model trivially
• Variations of building different types of
spanning trees with no pre-specified
roots. Chosen root at the end is the
leader (Ex., the DFS spanning tree
algorithm we covered earlier)
Clock Synchronization
Clock Synchronization
• Multiple machines with physical

clocks. How can we keep them
more or less synchronized?
• Internal vs. External
synchronization
• Perfect synchronization not
possible because of communication
delays
• Even synchronization within a
bound can not be guaranteed with
certainty because of
unpredictability of communication
delays.
• But still useful !! Ex. – Kerberos,
GPS
How clocks work
• Computer clocks are crystals that

oscillate at a certain frequency
• Every H oscillations, the timer chip

interrupts once (clock tick). No. of
interrupts per second is typically
18.2, 50, 60, 100; can be higher,
settable in some cases
• The interrupt handler increments a

counter that keeps track of no. of
ticks from a reference in the past
(epoch)
• Knowing no. of ticks per second,

we can calculate year, month, day,
time of day etc.
Clock Drift
• Unfortunately, period of crystal
oscillation varies slightly
• If it oscillates faster, more ticks per
real second, so clock runs faster;
similar for slower clocks
• For machine p, when correct
reference time is t, let machine
clock show time as C = Cp(t)
• Ideally, Cp(t) = t for all p, t
• In practice,
1 – ρ ≤ dC/dt ≤ 1 + ρ
• ρ = max. clock drift rate, usually
around 10-5 for cheap oscillators
• Drift => Skew between clocks
(difference in clock values of two
machines)
Resynchronization
• Periodic resynchronization needed

to offset skew
• If two clocks are drifting in opposite

directions, max. skew after time t is
2ρt
• If application requires that clock

skew < δ, then resynchronization
period
r < δ /(2 ρ)
• Usually ρ and δ are known

Cristian’s Algorithm
• One m/c acts as the time server

• Each m/c sends a message
periodically (within resync. period r)
asking for current time
• Time server replies with its time
• Sender sets its clock to the reply
• Problems:
– message delay
– time server time is less than
sender’s current time
• Handling message delay: try to
estimate the time the message with
the timer server’s time took to each
the sender
– measure round trip time and
halve it
– make multiple measurements of
round trip time, discard too high
values, take average of rest
– make multiple measurements
and take minimum
– use knowledge of processing
time at server if known
• Handling fast clocks
– do not set clock backwards; slow
it down over a period of time to
bring in tune with server’s clock
Berkeley Algorithm
• Centralized as in Cristian’s, but the

time server is active
• time server asks for time of other
m/cs at periodic intervals
• time server averages the times and
sends the new time to m/cs
• M/cs sets their time (advances
immediately or slows down slowly)
to the new time
• Estimation of transmission delay as
before
External Synchronization
• Clocks must be synchronized with

real time
• Cristian’s algorithm can be used if

the time server is synchronized with
real time somehow
• Berkeley algorithm cannot be used
• But what is “real time” anyway?

Measurement of time
• Astronomical
– traditionally used
– based on earth’s rotation around
its axis and around the sun
– solar day : interval between two
consecutive transits of the sun
– solar second : 1/86,400 of a
solar day
– period of earth’s rotation varies,
so solar second is not stable
– mean solar second : average
length of large no of solar days,
then divide by 86,400
• Atomic
– based on the transitions of
Cesium 133 atom
– 1 sec. = time for 9,192,631,770
transitions
– about 50+ labs maintain Cesium
clock
– International Atomic Time (TAI) :
mean no. of ticks of the clocks
since Jan 1, 1958
– highly stable
– But slightly off-sync with mean
solar day (since solar day is
getting longer)
– A leap second inserted approx.
occasionally to bring it in sync.
(so far 32, all positive)
– Resulting clock is called UTC –
Universal Coordinated Time
• UTC time is broadcast from
different sources around the world,
ex.
– National Institute of Standards &
Technology (NIST) – runs radio
stations, most famous being
WWV, anyone with a proper
receiver can tune in
– United States Naval Observatory
(USNO) – supplies time to all
defense sources, among others
– National Physical Laboratory in
UK
– GPS satellites
– Many others
NTP : Network Time Protocol
• Protocol for time sync. in the

internet
• Hierarchical architecture
– primary time servers (stratum 1)
synchronize to national time
standards via radio, satelite etc.
– secondary servers and clients
(stratum 2, 3,…) synchronize to
primary servers in a hierrachical
manner (stratum 2 servers sync.
with stratum 1, startum 3 with
stratum 2 etc.).
• Reliability ensured by redundant
servers
• Communication by multicast
(usually within LAN servers),
symmetric (usually within multiple
geographically close servers), or
client server (to higher stratum
servers)
• Complex algorithms to combine
and filter times
• Sync. possible to within tens of
milliseconds for most machines
• But, just a best-effort service, no
guarantees
• RFC 1305 and
www.eecis.udel.edu/~ntp/ for more
details
Termination Detection
Termination Detection
Model
– processes can be active or idle
– only active processes send
messages
– idle process can become active
on receiving an computation
message
– active process can become idle
at any time
– termination: all processes are
idle and no computation
message are in transit
– Can use global snapshot to
detect termination also
Huang’s Algorithm
• One controlling agent, has weight 1

initially
• All other processes are idle initially
and has weight 0
• Computation starts when
controlling agent sends a
computation message to a process
• An idle process becomes active on
receiving a computation message
• B(DW) – computation message
with weight DW. Can be sent only
by the controlling agent or an active
process
• C(DW) – control message with
weight DW, sent by active
processes to controlling agent
when they are about to become idle
Let current weight at process = W
3. Send of B(DW):
• Find W1, W2 such that W1 > 0, W2
> 0, W1 + W2 = W
• Set W = W1 and send B(W2)
4. Receive of B(DW):
• W += DW;
• if idle, become active
5. Send of C(DW):
• send C(W) to controlling agent
• Become idle
6. Receive of C(DW):
• W += DW
• if W = 1, declare “termination”
Building Spanning Trees
Building Spanning Trees
Applications:
• Broadcast
• Convergecast
• Leader election
Two variations:
• from a given root r
• root is not given a-priori
Flooding Algorithm
– starts from a given root r
– r initiates by sending message M
to all neighbours, sets its own
parent to nil
– For all other nodes, on receiving
M from i for the first time, set
parent to i and send M to all
neighbors except i. Ignore any M
received after that
– Tree built is an arbitrary
spanning tree
– Message complexity
= 2m – (n -1) where m = no of
edges
– Time complexity ??
Constructing a DFS tree with
given root
• plain parallelization of the

sequential algorithm by introducing
synchronization
• each node i has a set unexplored,
initially contains all neighbors of i
• A node i (initiated by the root)
considers nodes in unexplored one
by one, sending a neighbor j a
message M and then waiting for a
response (parent or reject) before
considering the next node in
unexplored
• if j has already received M from
some other node, j sends a reject
to i
• else, j sets i as its parent, and
considers nodes in its unexplored
set one by one
• j will send a parent message to i
only when it has considered all
nodes in its unexplored set
• i then considers the next node in its
unexplored set
• Algorithm terminates when root has
received parent or reject message
from all its neighbours
• Worst case no. of messages = 4m
• Time complexity O(m)
What if no root given?
Main idea:
– Nodes have unique ids

– A node starts building the DFS
tree with itself as root (a single
node fragment) spontaneously
as in the previous case
– Fragments of the spanning tree
gets built in parallel, all nodes in
each fragment is identified by the
id of its root
– M carries the fragment id of the
sender
– when M sent from node in lower
id fragment to node in higher id
fragment, lower id fragment is
stalled by higher id fragment by
not giving a response
– When M sent from higher to
lower id fragment, node in lower
id fragment switches parent to
node in higher id tree, resets
unexplored, and starts DFS
again
– Eventually, the highest id node
becomes the root (leader
election!!)
– Message complexity O(mn) !!
– Time complexity O(m)
What about MSTs??
Gallager-Humblet-Spira Algorithm
– much more complex! but similar

to Kruskal’s
– no root given, edge weights
assumed to be distinct
– MST built up in fragments
(subtree of MST)
– initially each node in its own
fragment
– fragments merge, finally just one
fragment
– outgoing edge – edge that goes
between two fragments
– known result – min. wt. outgoing
edge of a fragment always in
MST
Issues:
1. How does a node find its min.

wt. outgoing edge?
2. How does a fragment finds its
min. wt. outgoing edge?
3. When does two fragments
merge?
4. How does two fragments
merge?
Some definitions
• Each node has three states

– Sleeping – initial state
– Find – currently finding the
fragment’s min. wt. outgoing
edge
– Found – found the min. wt.
outgoing edge
• Each fragment has a level

– initially, each node is in a
fragment of level 0
Merging rule for fragments
• Suppose F is a fragment with id X,
level L, and min. wt. outgoing edge
e. Let fragment at other end of e be
F1, with id X1 and level L1. Then
– if L < L1, F merges into F1, new

fragment has id X1, level L1
– if L=L1, and e is also the min. wt.

outgoing edge for F1, then F and
F1 merges; new fragment has id
X2 = weight of e, and level L + 1;
e is called the core edge
– otherwise, F waits until one of

the above becomes true
How to find min. wt. outgoing edge
of a fragment
• nodes on core edge broadcasts initiate
message to all fragment nodes along
fragment edges; contains level and id
• on receiving initiate, a node find its min.

wt. outgoing edge (in Find state) – how?
• nodes send Report message with min.

wt. edge up towards the core edge along
fragment edges (and enters Found state)
• leafs send their min. wt. outgoing edge,

intermediate nodes send the min. of their
min. wt. outgoing edge and min. edge
sent by children in fragment; path info to
best edge kept
• when Report reaches the nodes on the

core edge, min. wt. outgoing edge of the
fragment is known.
What then???
• nodes on core edges send Change_core

message to node i with min. wt. outgoing
edge
• node i then sends a Connect message to
node j at other end with its level
• If j’s fragment level is greater than i’s,
initiate message sent from j to i. This
updates level and id of all nodes in j’s old
fragment; if j has not sent a Report
message yet, nodes in i’s old fragment
starts finding its min. wt. outgoing edge,
else not.
• if j’s fragment level is less, no response
sent and i just waits till j’s fragment id
becomes higher
• if fragment id’s match and j sends
Connect to i also, merge into a level L+1
fragment with new core edge and id, and
send initiate message
• some more details skipped, read
paper
• Algo. terminates when no outgoing

edge found for a fragment
• Worst case message complexity =

O(n log n + m)
Fault Tolerance
and
Recovery
Fault Tolerance & Recovery
Classification of faults:
– based on component that failed
• program/process
• processor/machine
• link
• storage
• clock
– based on behavior of faulty
component
• Crash – just halts
• Failstop – crash with
additional conditions
• Omission – fails to perform
some steps
• Byzantine – behaves
arbitrarily
• Timing – violates timing
constraints
Types of tolerance:
– Masking – system always
behaves as per specifications
even in presence of faults
– Non-masking – system may
violate specifications in presence
of faults. Should at least behave
in a well-defined manner
Fault tolerant system should specify:

– Class of faults tolerated
– what tolerance is given from
each class
Some building blocks
(assumptions/primitives that help in
designing fault tolerant systems):
– Agreement (multiple processes
agree on some value)
– Clock synchronization
– Stable storage (data accessible
after crash)
– Reliable communication (point-
to-point, broadcast, multicast)
– Atomic actions
Agreement Problems
Model:
– total n processes, at most m of
which can be faulty
– reliable communication medium
– fully connected
– receiver always knows the
identity of the sender of a
message
– byzantine faults
– synchronous system. In each
round, a process receives
messages, performs
computation, and sends
messages.
Different problem variations
• Byzantine agreement (or Byzantine

Generals problem)
– one process x broadcasts a
value v
• all nonfaulty processes must
agree on a common value
(Agreement condition).
• The agreed upon value must
be v if x is nonfaulty (Validity
condition)
• Consensus
– Each process broadcasts its
initial value
• satisfy agreement condition
• If initial value of all nonfaulty
processes is v, then the
agreed upon value must be v
• Interactive Consistency
– Each process i broadcasts its
own value vi
• all nonfaulty processes agree
on a common vector (v1, v2,
…,vn)
• If the ith process is nonfaulty,
then the ith value in the vector
agreed upon by nonfaulty
processes must be vi
Solution to Byzantine agreement

problem implies solution to other
two
Byzantine Agreement Problem
• no solution possible if
– asynchronous system, or
– n < (3m + 1)
• needs at least (m+1) rounds of
message exchange (lower bound
result)
• “Oral” messages – messages can
be forged/changed in any manner,
but the receiver always knows the
sender
Lamport-Shostak-Pease
Algorithm
Recursively defined;
OM(m), m > 0
• Source x broadcasts value to all
processes
• Let vi = value received by process
i from source (0 if no value
received). Process i acts as a
new source and initiates OM(m
-1), sending vi to remaining (n - 2)
processes
• For each i, j, i ≠ j, let vj = value
received by process i from
process j in step 2 using O(m-1).
Process i uses the value
majority(v1, v2, …, vn -1)
OM(0)
2. Source x broadcasts value to all
processes
3. Each process uses the value; if
no value received, 0 is used
Time complexity = m+1 rounds

Message Complexity = O(nm)
You can reduce message complexity

to polynomial by increasing time
Atomic Actions and Commit
Protocols
An action may have multiple

subactions executed by different
processes at different nodes of a
distributed system
Atomic action : either all subactions

are done or none are done (all-or-
nothing property/ global atomicity
property) as far as system state is
concerned
Commit protocols – protocols for

enforcing global atomicity property
Two-Phase Commit
• Assumes the presence of write-
ahead log at each process to
recover from local crashes
• One process acts as coordinator
Phase 1:
• coordinator sends
COMMIT_REQUEST to all
processes
• waits for replies from all processes
• on receiving a
COMMIT_REQUEST, a process, if
the local transaction is successful,
writes Undo/redo logs in stable
storage, and sends an AGREED
message to the coordinator.
Otherwise, sends an ABORT
Phase 2:
• If all processes reply AGREED,
coordinator writes COMMIT record
into the log, then sends COMMIT to
all processes. If at least one
process has replied ABORT,
coordinator sends ABORT to all.
Coordinator then waits for ACK
from all processes. If ACK is not
received within timeout period,
resend. If all ACKs are received,
coordinator writes COMPLETE to
log
• On receiving a COMMIT, a process
releases all resources/locks, and
sends an ACK to coordinator
• On receiving an ABORT, a process
undoes the transaction using Undo
log, releases all resources/locks,
and sends an ACK
• Ensures global atomicity; either all
processes commit or all of them
aborts
• Resilient to crash failures (see text
for different scenarios of failure)
• Blocking protocol – crash of
coordinator can block all processes
• Non-blocking protocols possible;
ex., Three-Phase Commit protocol;
we will not discuss in this class
Checkpointing & Rollback
Recovery
Error recovery:
• Forward error recovery – assess
damage due to faults exactly and
repair the erroneous part of the
system state
– less overhead but hard to assess
effect of faults exactly in general
• Backward error recovery – on a

fault, restore system state to a
previous error-free state and restart
from there
– costlier, but more general,
application-independent
technique
Checkpoint and Rollback Recovery –
a form of backward error recovery
Checkpoint :
– local checkpoint – local state of a
process saved in stable storage
for possible rollback on a fault
– global checkpoint – collection of
local checkpoints, one from each
process
Consistent and Strongly Consistent

Global Checkpoint – similar to
consistent and strongly consistent
global state respectively (Also
called “recovery line”)
Orphan message – a message
whose receive is recorded in some
local checkpoint of a global
checkpoint but send is not recorded
in any local checkpoint in that
global checkpoint ( Note : A
consistent global checkpoint cannot
have an orphan message)
Lost message – a message whose
send is recorded but receive is not
in a global checkpoint
Is lost messages a problem??
– not if unreliable channels
assumed (since messages can
be lost anyway)
– if reliable channels assumed,
need to handle this properly!
Cannot lose messages !
– We will assume unreliable
channels for simplicity
Performance measures for a
checkpointing and recovery
algorithm
• during fault-free operation
– checkpointing time
– space for storing checkpoints
and messages (if needed)
• in case of a fault
– recovery time (time to establish
recovery line)
– extent of rollback (how far in the
past did we roll back? how much
computation is lost?)
– is output commit problem
handled? (if an output was sent
out before the fault, say cash
dispensed at a teller m/c, it
should not be resent after
restarting after the fault)
Some parameters that affect
performance
• Checkpoint interval (time between

two successive checkpoints)
• Number of processes
• Communication pattern of the
application
• Fault frequency
• Nature of stable storage
Classification of Checkpoint &
Recovery Algorithms
• Asynchronous/Uncoordinated
– every process takes local checkpoint

independently
– to recover from a fault in one process,
all processes coordinate to find a
consistent global checkpoint from their
local checkpoints
– very low fault-free overhead, recovery
overhead is high
– Domino effect possible (no consistent
global checkpoint exist, so all
processes have to restart from
scratch)
– higher space requirements, as all local
checkpoints need to be kept
– Good for systems where fault is rare
and inter-process communication is
not too high (less chance of domino
effect)
• Synchronous/Coordinated
– all processes coordinate to take

a consistent global checkpoint
– during recovery, every process
just rolls back to its last local
checkpoint independently
– low recovery overhead, but high
checkpointing overhead
– no domino effect possible
– low space requirement, since
only last checkpoint needs to be
stored at each process
• Communication Induced
– Synchronize checkpointing with

communication, since message
send/receive is the fundamental
cause of inconsistency in global
checkpoint
– Ex. : take local checkpoint right
after every send! Last local
checkpoint at each process is
always consistent. But too costly
– Many variations are there, more
efficient than the above, we will
not discuss them in this class.
• Message logging
– Take coordinated or
uncoordinated checkpoint, and
then log (in stable storage) all
messages received since the last
checkpoint
– On recovery, only the recovering
process goes back to its last
checkpoint, and then replays
messages from the log
appropriately until it reaches the
state right before the fault
– Only class that can handle
output commit problem!
– Details too complex to discuss in
this class
Some Checkpointing
Algorithms
Asynchronous/Uncoordinated
– See Juang-Venkatesan’s
algorithm in text, quite well-
explained
Synchronous/Coordinated
– Chandy-Lamport’s global state
collection algorithm can be
modified to handle recovery from
faults
– See Koo-Toueg’s algorithm in
text, quite well-explained

Complete 1 Distributed Systems

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Complete 1 Distributed Systems

Enviado por

Direitos autorais:

Formatos disponíveis

CS60002

Xerox, notes, copies of papers

A very broad definition:

Autonomous: able to act

“Concurrent system” : a better term

A practical “distributed system” will

The usual problem of concurrent

An algorithm need to specify the

• Message complexity : no. of

• Time complexity : For synchronous

• Ordering events in the absence of a

Lamport’s Happened Before

Each process i keeps a clock Ci.

• Each event a in i is timestamped

• if a → b, then C(a) < C(b)

• → is an irreflexive partial order

• Total ordering possible by

a → b implies C(a) < C(b)

C(a) < C(b) doesn’t imply a → b !!

So not a true clock !!

Ci is a vector of size n (no. of

• ta ≠ tb iff for some i, ta[i] ≠ tb[i]

• ta ≤ tb iff for all i, ta[i] ≤ tb[i]

• ta < tb iff (ta ≤ tb and ta ≠ tb)

• ta || tb iff (ta < tb and tb < ta)

• Events a and b are causally related

• If send(m1)→ send(m2), then every

“deliver” – when the message is

• To broadcast m from process i,

• message size increases since

• size can be reduced in some

rec(mij) є LSj iff

transit(LSi,LSj) = { mij | send(mij) є LSi

inconsistent(LSi, LSj) = {mij |

• Uses special marker messages.

• One process acts as initiator, starts

• Marker sending rule for process P:

– If Q has not recorded its state

– If Q has already recorded its

• Markers sent on a channel

• Similar to Chandy-Lamport’s, but

• very well-understood in shared

• Non-token based/Permission based

• No. of messages/critical section

• Every node i has a request queue

To request critical section:

On receiving a request (tsi, i):

To release critical section:

• Purpose of REPLY messages from

• Improvement over Lamport’s

On receiving request (tsi, i) at j:

To enter critical section:

To release critical section:

• Improvement over Ricart-Agarwala

• Permission obtained from only a

To request critical section:

On receiving a REQUEST message:

To enter critical section:

• Major problem: DEADLOCK

• Need three more types of

• Building the request sets?

• Single token circulates, enter CS

• Broadcast a request for the token