Escolar Documentos
Profissional Documentos
Cultura Documentos
Distributed Systems
Text Book:
• “Advanced Concepts in Operating
Systems” by Mukesh Singhal and
Niranjan G. Shivaratri
will cover about half the course,
supplemented by copies of papers
• Resource Sharing
• Higher Performance
• Fault Tolerance
• Scalability
Why is it hard to design them?
– Topology : completely
connected, ring, tree etc.
– Communication : shared
memory/message passing
(reliable? Delay? FIFO/Causal?
Broadcast/multicast?)
– Synchronous/asynchronous
– Failure models (fail stop, crash,
omission, Byzantine…)
• Communication complexity/Bit
Complexity : no. of bits
BUT
Applications:
– Checking “stable” properties,
checkpoint & recovery
Issues:
– Need to capture both node and
channel states
– system cannot be stopped
– no global clock
Some notations:
– LSi : local state of process i
– send(mij) : send event of
message mij from process i to
process j
– rec(mij) : similar, receive instead
of send
– time(x) : time at which state x
was recorded
– time (send(m)) : time at which
send(m) occured
send(mij) є LSi iff
time(send(mij)) < time(LSi)
GS is consistent iff
for all i, j, 1 ≤ i, j ≤ n,
inconsistent(LSi, LSj) = Ф
GS is transitless iff
for all i, j, 1 ≤ i, j ≤ n,
transit(LSi, LSj) = Ф
GS is strongly consistent if it is
consistent and transitless.
Chandy-Lamport’s Algorithm
• Requirements:
– at most one process in critical
section (safety)
– if more than one requesting
process, someone enters
(liveness)
– a requesting process enters
within a finite time (no starvation)
– requests are granted in order
(fairness)
Classification of Distributed
Mutual Exclusion Algorithms
• Token based
– ex. Suzuki-Kasami
Some Complexity Measures
Issues:
– No starvation
Raymond’s Algorithm
• Models
– Synchronous or Asynchronous
– Anonymous (no unique id) or
Non-anonymous (unique ids)
– Uniform (no knowledge of ‘n’,
the number of processes) or
non-uniform (knows ‘n’)
• Known Impossibility Result:
– There is no Synchronous, non-
uniform leader election
protocol for anonymous rings
– Implications ??
Election in Asynchronous
Rings
Lelann-Chang-Robert’s
Algorithm
– send own id to node on left
– if an id received from right,
forward id to left node only if
received id greater than own id,
else ignore
– if own id received, declares itself
“leader”
• works on unidirectional rings
• message complexity = θ(n^2)
• Hirschberg-Sinclair Algorithm
– operates in phases, requires
bidirectional ring
– In kth phase, send own id to 2^k
processes on both sides of yourself
(directly send only to next
processes with id and k in it)
– if id received, forward if received id
greater than own id, else ignore
– last process in the chain sends a
reply to originator if its id less than
received id
– replies are always forwarded
– A process goes to (k+1)th phase
only if it receives a reply from both
sides in kth phase
– process receiving its own id –
declare itself “leader”
• Message Complexity: O(nlgn)
• Lots of other algorithms exist
for rings
• Lower Bound Result:
– Any comparison-based leader
election algorithm in a ring
requires Ώ(nlgn) messages
– What if not comparison-based?
Leader Election in Arbitrary
Networks
• FloodMax
– synchronous, round-based
– at each round, each process sends
the max. id seen so far (not
necessarily its own) to all its
neighbors
– after diameter no. of rounds, if max.
id seen = own id, declares itself
leader
– Complexity = O(d.m), where d =
diameter of the network, m = no. of
edges
– does not extend to asynchronous
model trivially
• Variations of building different types of
spanning trees with no pre-specified
roots. Chosen root at the end is the
leader (Ex., the DFS spanning tree
algorithm we covered earlier)
Clock Synchronization
Clock Synchronization
• Astronomical
– traditionally used
– based on earth’s rotation around
its axis and around the sun
– solar day : interval between two
consecutive transits of the sun
– solar second : 1/86,400 of a
solar day
– period of earth’s rotation varies,
so solar second is not stable
– mean solar second : average
length of large no of solar days,
then divide by 86,400
• Atomic
– based on the transitions of
Cesium 133 atom
– 1 sec. = time for 9,192,631,770
transitions
– about 50+ labs maintain Cesium
clock
– International Atomic Time (TAI) :
mean no. of ticks of the clocks
since Jan 1, 1958
– highly stable
– But slightly off-sync with mean
solar day (since solar day is
getting longer)
– A leap second inserted approx.
occasionally to bring it in sync.
(so far 32, all positive)
– Resulting clock is called UTC –
Universal Coordinated Time
• UTC time is broadcast from
different sources around the world,
ex.
– National Institute of Standards &
Technology (NIST) – runs radio
stations, most famous being
WWV, anyone with a proper
receiver can tune in
– United States Naval Observatory
(USNO) – supplies time to all
defense sources, among others
– National Physical Laboratory in
UK
– GPS satellites
– Many others
NTP : Network Time Protocol
Model
– processes can be active or idle
– only active processes send
messages
– idle process can become active
on receiving an computation
message
– active process can become idle
at any time
– termination: all processes are
idle and no computation
message are in transit
– Can use global snapshot to
detect termination also
Huang’s Algorithm
3. Send of B(DW):
• Find W1, W2 such that W1 > 0, W2
> 0, W1 + W2 = W
• Set W = W1 and send B(W2)
4. Receive of B(DW):
• W += DW;
• if idle, become active
5. Send of C(DW):
• send C(W) to controlling agent
• Become idle
6. Receive of C(DW):
• W += DW
• if W = 1, declare “termination”
Building Spanning Trees
Building Spanning Trees
Applications:
• Broadcast
• Convergecast
• Leader election
Two variations:
• from a given root r
• root is not given a-priori
Flooding Algorithm
– starts from a given root r
– r initiates by sending message M
to all neighbours, sets its own
parent to nil
– For all other nodes, on receiving
M from i for the first time, set
parent to i and send M to all
neighbors except i. Ignore any M
received after that
– Tree built is an arbitrary
spanning tree
– Message complexity
= 2m – (n -1) where m = no of
edges
– Time complexity ??
Constructing a DFS tree with
given root
Main idea:
Gallager-Humblet-Spira Algorithm
Model:
– total n processes, at most m of
which can be faulty
– reliable communication medium
– fully connected
– receiver always knows the
identity of the sender of a
message
– byzantine faults
– synchronous system. In each
round, a process receives
messages, performs
computation, and sends
messages.
Different problem variations
• no solution possible if
– asynchronous system, or
– n < (3m + 1)
• needs at least (m+1) rounds of
message exchange (lower bound
result)
• “Oral” messages – messages can
be forged/changed in any manner,
but the receiver always knows the
sender
Lamport-Shostak-Pease
Algorithm
Recursively defined;
OM(m), m > 0
• Source x broadcasts value to all
processes
• Let vi = value received by process
i from source (0 if no value
received). Process i acts as a
new source and initiates OM(m
-1), sending vi to remaining (n - 2)
processes
• For each i, j, i ≠ j, let vj = value
received by process i from
process j in step 2 using O(m-1).
Process i uses the value
majority(v1, v2, …, vn -1)
OM(0)
2. Source x broadcasts value to all
processes
3. Each process uses the value; if
no value received, 0 is used
Phase 1:
• coordinator sends
COMMIT_REQUEST to all
processes
• waits for replies from all processes
• on receiving a
COMMIT_REQUEST, a process, if
the local transaction is successful,
writes Undo/redo logs in stable
storage, and sends an AGREED
message to the coordinator.
Otherwise, sends an ABORT
Phase 2:
• If all processes reply AGREED,
coordinator writes COMMIT record
into the log, then sends COMMIT to
all processes. If at least one
process has replied ABORT,
coordinator sends ABORT to all.
Coordinator then waits for ACK
from all processes. If ACK is not
received within timeout period,
resend. If all ACKs are received,
coordinator writes COMPLETE to
log
• On receiving a COMMIT, a process
releases all resources/locks, and
sends an ACK to coordinator
• On receiving an ABORT, a process
undoes the transaction using Undo
log, releases all resources/locks,
and sends an ACK
• Ensures global atomicity; either all
processes commit or all of them
aborts
• Resilient to crash failures (see text
for different scenarios of failure)
• Blocking protocol – crash of
coordinator can block all processes
• Non-blocking protocols possible;
ex., Three-Phase Commit protocol;
we will not discuss in this class
Checkpointing & Rollback
Recovery
Error recovery:
• Forward error recovery – assess
damage due to faults exactly and
repair the erroneous part of the
system state
– less overhead but hard to assess
effect of faults exactly in general
Checkpoint :
– local checkpoint – local state of a
process saved in stable storage
for possible rollback on a fault
– global checkpoint – collection of
local checkpoints, one from each
process
– Take coordinated or
uncoordinated checkpoint, and
then log (in stable storage) all
messages received since the last
checkpoint
– On recovery, only the recovering
process goes back to its last
checkpoint, and then replays
messages from the log
appropriately until it reaches the
state right before the fault
– Only class that can handle
output commit problem!
– Details too complex to discuss in
this class
Some Checkpointing
Algorithms
Asynchronous/Uncoordinated
– See Juang-Venkatesan’s
algorithm in text, quite well-
explained
Synchronous/Coordinated
– Chandy-Lamport’s global state
collection algorithm can be
modified to handle recovery from
faults
– See Koo-Toueg’s algorithm in
text, quite well-explained