Escolar Documentos
Profissional Documentos
Cultura Documentos
Wan Fokkink
Distributed Algorithms: An Intuitive Approach
MIT Press, 2013
1/1
Algorithms
2/1
Distributed systems
A distributed system is an interconnected collection of
autonomous processes.
Motivation:
I
multicore programming
3/1
5/1
Message passing
The two main paradigms to capture communication in
a distributed system are message passing and shared memory.
We will only consider message passing.
(The course Concurrency & Multithreading treats shared memory.)
Asynchronous communication means that sending and receiving
of a message are independent events.
In case of synchronous communication, sending and receiving
of a message are coordinated to form a single event; a message
is only allowed to be sent if its destination is ready to receive it.
We will mainly consider asynchronous communication.
6/1
Communication protocols
In a computer network, messages are transported through a medium,
which may lose, duplicate or garble these messages.
A communication protocol detects and corrects such flaws during
message passing.
Example: Sliding window protocols.
B
A
S
F
1
2
7/1
Assumptions
Unless stated otherwise, we assume:
I
asynchronous communication
Remarks:
I
9/1
Formal framework
10 / 1
Transition systems
The (global) state of a distributed system is called a configuration.
The configuration evolves in discrete steps, called transitions.
A transition system consists of:
I
a set C of configurations
C is terminal if for no C.
11 / 1
Executions
0 I, and
12 / 1
13 / 1
Causal order
In each configuration of an asynchronous system,
applicable events at different processes are independent.
The causal order on occurrences of events in an execution
is the smallest transitive relation such that:
I
14 / 1
Computations
15 / 1
Lamports clock
A logical clock C maps occurrences of events in a computation
to a partially ordered set such that a b C (a) < C (b).
Lamports clock LC assigns to each event a the length k of
a longest causality chain a1 ak = a.
LC can be computed at run-time:
Let a be an event, and k the clock value of the previous event
at the same process (k = 0 if there is no such previous event).
If a is an internal or send event, then LC (a) = k + 1.
If a is a receive event, and b the send event corresponding to a,
then LC (a) = max{k, LC (b)} + 1.
16 / 1
Question
Consider the following sequences of events at processes p0 , p1 , p2 :
p0 :
s1 r3
p1 :
r2
p2 :
r1 d
s3
s2 e
9
6
17 / 1
Vector clock
Given processes p0 , . . . , pN1 .
We consider NN with a partial order defined by:
(k0 , . . . , kN1 ) (`0 , . . . , `N1 ) ki `i for all i = 0, . . . , N 1.
The vector clock VC maps occurrences of events in a computation
to NN such that a b VC (a) < VC (b).
VC (a) = (k0 , . . . , kN1 ) where each ki is the length of a longest
causality chain a1i aki i of events at process pi with aki i a.
VC can also be computed at run-time.
Question: Why does VC (a) < VC (b) imply a b ?
18 / 1
Question
Consider the following sequences of events at processes p0 , p1 , p2 :
p0 :
s1 r3
p1 :
r2
p2 :
r1 d
s3
s2 e
(1 0 0)
(2 0 0)
(3 3 3)
(0 1 0)
(2 2 3)
(2 3 3)
(2 0 1)
(2 0 2)
(2 0 3)
(4 3 3)
(2 0 4)
19 / 1
Complexity measures
Resource consumption of a computation of a distributed algorithm
can be considered in several ways.
Message complexity: Total number of messages exchanged.
Bit complexity: Total number of bits exchanged.
(Only interesting when messages can be very long.)
Time complexity: Amount of time consumed.
(We assume: (1) event processing takes no time, and
(2) a message is received at most one time unit after it is sent.)
Space complexity: Amount of space needed for the processes.
Different computations may give rise to different consumption
of resources. We consider worst- and average-case complexity
(the latter with a probability distribution over all computations).
20 / 1
Big O notation
21 / 1
Assertions
22 / 1
Invariants
23 / 1
Snapshots
A snapshot of an execution of a distributed algorithm should return
a configuration of an execution in the same computation.
Snapshots can be used for:
I
Debugging.
24 / 1
Snapshots
We distinguish basic messages of the underlying distributed algorithm
and control messages of the snapshot algorithm.
A snapshot of a basic computation consists of:
I
Chandy-Lamport algorithm
Consider a directed network with FIFO channels.
Initiators can take a local snapshot of their state,
and send a control message hmarkeri to their neighbors.
When a process that hasnt yet taken a snapshot receives hmarkeri,
it takes a local snapshot of its state, and sends hmarkeri to its neighbors.
q computes as channel state of pq the messages it receives via pq
after taking its local snapshot and before receiving hmarkeri from p.
If channels are FIFO, this produces a meaningful snapshot.
Message complexity: (E )
Time complexity: O(D)
26 / 1
27 / 1
hmkri
m1
snapshot
hmkri
28 / 1
hmkri
m1
hmkri
m2
snapshot
29 / 1
m1
snapshot
hmkri
m2
30 / 1
m1
{m2 }
Lai-Yang algorithm
Suppose channels are non-FIFO. We use piggybacking.
Initiators can take a local snapshot of their state.
When a process has taken its local snapshot,
it appends true to each outgoing basic message.
When a process that hasnt yet taken a snapshot receives a message
with true or a control message (see next slide) for the first time,
it takes a local snapshot of its state before reception of this message.
q computes as channel state of pq the basic messages without
the tag true that it receives via pq after its local snapshot.
32 / 1
33 / 1
Wave algorithms
termination: it is finite;
34 / 1
35 / 1
Traversal algorithms
A traversal algorithm is a centralized wave algorithm;
i.e., there is one initiator, which sends around a token.
I
36 / 1
37 / 1
p
4
8
12
6
10
9
11 2
Tree edges, which are part of the spanning tree, are solid.
Frond edges, which arent part of the spanning tree, are dashed.
Question: Could this spanning tree have been produced
by a depth-first search starting at p ?
39 / 1
Depth-first search
Depth-first search is obtained by adding to Tarrys algorithm:
R3 When a process receives the token, it immediately sends it
back through the same channel if this is allowed by R1,2.
Example:
p
10
4
12
8
5
7
11
Awerbuchs algorithm
A process holding the token for the first time informs all neighbors
except its parent and the process to which it forwards the token.
42 / 1
43 / 1
Cidons algorithm
Abolishes acks from Awerbuchs algorithm.
44 / 1
45 / 1
1
3
46 / 1
Tree algorithm
The tree algorithm is a decentralized wave algorithm
for undirected, acyclic networks.
The local algorithm at a process p:
I
decide
decide
48 / 1
Echo algorithm
The echo algorithm is a centralized wave algorithm for undirected networks.
I
decide
50 / 1
Wait-for graph
A (non-blocked) process can issue a request to M other processes,
and becomes blocked until N of these requests have been granted.
Then it informs the remaining M N processes that the request
can be purged.
Only non-blocked processes can grant a request.
A directed wait-for graph captures dependencies between processes.
There is an edge from node p to node q if p sent a request to q
that wasnt yet purged by p or granted by q.
52 / 1
53 / 1
When no more grants are possible, nodes that remain blocked in the
wait-for graph are deadlocked in the snapshot of the basic algorithm.
55 / 1
OR (1outof3) request
56 / 1
57 / 1
Deadlock
58 / 1
59 / 1
60 / 1
61 / 1
No deadlock
62 / 1
63 / 1
64 / 1
notified u true
for all w Out u send NOTIFY to w
if requests u = 0 then Grant u
for all w Out u await DONE from w
Grant u :
free u true
for all w Inu send GRANT to w
for all w Inu await ACK from w
66 / 1
NOTIFY
requests x = 0
requests w = 1
NOTIFY
requests v = 1
u is the initiator.
67 / 1
GRANT
GRANT
v
await DONE from w
(for DONE to u)
w
NOTIFY
68 / 1
ACK
NOTIFY
v
await DONE w
(for DONE to u)
w
GRANT
requests w = 0
69 / 1
GRANT
requests v = 0
DONE
w
await DONE from x
(for DONE to v )
await ACK from v
(for ACK to x)
70 / 1
ACK
DONE
71 / 1
DONE
ACK
72 / 1
ACK
73 / 1
DONE
74 / 1
76 / 1
77 / 1
Termination detection
The basic algorithm is terminated if (1) each process is passive, and
(2) no basic messages are in transit.
send
internal
active
receive
passive
receive
78 / 1
Dijkstra-Scholten algorithm
Requires a centralized basic algorithm, and an undirected network.
A tree T is maintained, in which the initiator p0 is the root,
and all active processes are processes of T . Initially, T consists of p0 .
cc p estimates (from above) the number of children of process p in T .
I
Shavit-Francez algorithm
Allows a decentralized basic algorithm; requires an undirected network.
A forest F of (disjoint) trees is maintained, rooted in initiators.
Initially, each initiator of the basic algorithm constitutes a tree in F .
I
82 / 1
Question
Why is the following termination detection algorithm not correct ?
I
Ranas algorithm
Allows a decentralized basic algorithm; requires an undirected network.
A logical clock provides (basic and control) events with a time stamp.
The time stamp of a process is the highest time stamp of its events
so far (initially it is 0).
Each basic message is acknowledged.
If at time t a process becomes quiet, meaning that (1) it is passive,
and (2) all basic messages it sent have been acknowledged,
it starts a wave (of control messages), tagged with t (and its id).
Only quiet processes, that have been quiet from a time t onward,
take part in the wave.
If a wave completes, its initiator calls Announce.
84 / 1
85 / 1
87 / 1
Complication 2 - Example
p0
Safras algorithm
Allows a decentralized basic algorithm and a directed network.
Each process maintains a counter of type Z; initially it is 0.
At each outgoing/incoming basic message, the counter is
increased/decreased.
At any time, the sum of all counters in the network is 0,
and it is 0 if and only if no basic messages are in transit.
At each round trip, the token carries the sum of the counters
of the processes it has traversed.
The token may compute a negative sum for a round trip, when
a visited passive process receives a basic message, becomes active,
and sends basic messages that are received by an unvisited process.
89 / 1
Safras algorithm
Processes are colored white or black. Initially they are white,
and a process that receives a basic message becomes black.
I
- If the token and p0 are white and the sum of all counters is zero,
p0 calls Announce.
- Else, p0 becomes white and sends a white token again.
90 / 1
Garbage collection
Processes are provided with memory.
Objects carry pointers to local objects and references to remote objects.
A root object can be created in memory; an object is always accessed
by navigating from a root object.
Aim of garbage collection: To reclaim inaccessible objects.
Three operations by processes to build or delete a reference:
I
Reference counting
Drawbacks:
I
93 / 1
94 / 1
95 / 1
A tree is maintained for each object, with the object at the root,
and the references to this object as the other nodes in the tree.
96 / 1
97 / 1
100 / 1
101 / 1
Mark-scan
Mark-scan garbage collection consists of two phases:
I
103 / 1
Routing
Routing means guiding a packet in a network to its destination.
A routing table at node u stores for each v 6= u a neighbor w of u:
Each packet with destination v that arrives at u is passed on to w .
Criteria for good routing algorithms:
I
Chandy-Misra algorithm
Consider an undirected, weighted network, with weights vw > 0.
A centralized algorithm to compute all shortest paths to initiator u0 .
Initially, dist u0 (u0 ) = 0, dist v (u0 ) = if v 6= u0 , and parent v (u0 ) = .
u0 sends the message h0i to its neighbors.
When a node v receives hdi from a neighbor w ,
and if d + vw < dist v (u0 ), then:
I
parent u0
dist w
parent w
u0
dist v
parent v
dist x
parent x
dist x
parent x
dist v
parent v
u0
dist w
parent w
dist x
parent x
dist x
parent x
dist x
parent x
u0
dist w
parent w
dist v
parent v
dist v
parent v
4
1
6
1
u0
106 / 1
107 / 1
Merlin-Segall algorithm
A centralized algorithm to compute all shortest paths to initiator u0 .
Initially, dist u0 (u0 ) = 0, dist v (u0 ) = if v 6= u0 ,
and the parent v (u0 ) values form a sink tree with root u0 .
Each round, u0 sends h0i to its neighbors.
1. Let a node v get hdi from a neighbor w .
If d + vw < dist v (u0 ), then dist v (u0 ) d + vw
(and v stores w as future value for parent v (u0 )).
If w = parent v (u0 ), then v sends hdist v (u0 )i
to its neighbors except parent v (u0 ).
2. When a v 6= u0 has received a message from all neighbors,
it sends hdist v (u0 )i to parent v (u0 ), and updates parent v (u0 ).
u0 starts a new round after receiving a message from all neighbors.
108 / 1
Message complexity: (N 2 E )
For each root, the algorithm requires (NE ) messages.
109 / 1
0
0
8
8
5
4
4
4
5
5
4
4
5
5
110 / 1
0
0
1
1
1
1
1
4
4
4
1
4
4
4
5
1
4
4
0
4
1
1
4
4
111 / 1
0
0
1
1
1
2
2
4
4
0
4
1
1
4
3
2
2
0
4
1
1
112 / 1
u0
Touegs algorithm
if u 6= v and uv 6 E
114 / 1
Floyd-Warshall algorithm
We first discuss a uniprocessor algorithm.
Initially, S = ; the first three equations define d .
While S doesnt contain all nodes, a pivot w 6 S is selected:
I
w is added to S.
115 / 1
Question
116 / 1
Touegs algorithm
Assumption: Each node knows from the start the ids of all nodes.
Because pivots must be picked uniformly at all nodes.
Initially, at each node u:
I
Su = ;
117 / 1
Touegs algorithm
At the w -pivot round, w broadcasts its values dist w (v ), for all nodes v .
If parent u (w ) = for a node u 6= w at the w -pivot round,
then dist u (w ) = , so dist u (w )+dist w (v ) dist u (v ) for all nodes v .
Hence the sink tree of w can be used to broadcast dist w .
Each node u sends hrequest, w i to parent u (w ), if it isnt ,
to let it pass on dist w to u.
If u isnt in the sink tree of w , it proceeds to the next pivot round,
with Su Su {w }.
118 / 1
Touegs algorithm
Suppose u is in the sink tree of w .
If u 6= w , it waits for the values dist w (v ) from parent u (w ).
u forwards the values dist w (v ) to neighbors that send hrequest, w i.
If u 6= w , it checks for each node v whether
dist u (w ) + dist w (v ) < dist u (v ).
If so, dist u (v ) dist u (w ) + dist w (u) and parent u (v ) parent u (w ).
Finally, u proceeds to the next pivot round, with Su Su {w }.
119 / 1
pivot v
dist x (v )
dist v (x)
parent x (v )
parent v (x)
dist u (w )
dist w (u)
parent u (w ) v
pivot w
pivot x
1
x
parent w (u) v
dist x (v )
dist v (x)
parent x (v )
parent v (x)
dist u (w )
dist w (u)
parent u (w ) x
4
u
parent w (u) x
dist u (v )
dist v (u)
parent u (v )
parent v (u)
w
120 / 1
Drawbacks:
I
121 / 1
Upon reception of dist w , u can therefore first update dist u and parent u ,
and only forward values dist w (v ) for which dist u (v ) has changed.
122 / 1
Link-state routing
The failure of a node is eventually detected by its neighbors.
Each node periodically sends out a link-state packet, containing
I
124 / 1
Breadth-first search
126 / 1
forward/reverse
explore/reverse
explore
explore
n
n+1
127 / 1
129 / 1
130 / 1
Fredericksons algorithm
Computes ` 1 levels per round.
Initially, the initiator is at distance 0, all other nodes at distance .
After round n, the tree has been constructed up to depth `n.
In round n + 1:
I
131 / 1
Fredericksons algorithm
132 / 1
Fredericksons algorithm
Let a node v receive hexplore, ki. We distinguish two cases.
I
k < dist v :
dist v k, and the sender becomes v s parent.
If k is not divisible by `, v sends hexplore, k+1i
to its other neighbors.
If k is divisible by `, v sends back hreverse, k, truei.
k dist v :
If k is not divisible by `, v doesnt send a reply
(because v sent hexplore, dist v + 1i into this edge).
If k is divisible by `, v sends back hreverse, k, falsei.
133 / 1
Fredericksons algorithm
I
Fredericksons algorithm
135 / 1
137 / 1
Store-and-forward deadlocks
138 / 1
Destination controller
139 / 1
140 / 1
Hops-so-far controller
141 / 1
143 / 1
G0
G1
G2
144 / 1
Let vw be an edge in Gi .
Each ith buffer slot of v is linked to the ith buffer slot of w .
Moreover, if i < n1, the ith buffer slot of w is linked to
the (i+1)th buffer slot of v .
145 / 1
146 / 1
148 / 1
Election algorithms
In an election algorithm, each computation should terminate
in a configuration where one process is the leader.
Assumptions:
I
149 / 1
Chang-Roberts algorithm
Consider a directed ring.
I
N1
N2
0
1
anti-clockwise:
N3
N(N+1)
2
messages
151 / 1
Franklins algorithm
Consider an undirected ring.
Each active process p repeatedly compares its own id
with the ids of its nearest active neighbors on both sides.
If such a neighbor has a larger id, then p becomes passive.
I
153 / 1
0
1
N3
154 / 1
Dolev-Klawe-Rodeh algorithm
Consider a directed ring.
The comparison of ids of an active process p and
its nearest active neighbors q and r is performed at r .
s
5
3
leader
156 / 1
157 / 1
158 / 1
p waits until it has received ids from all neighbors except one,
which becomes its parent.
1
1
1
5
1
5
leader
6
4
6
4
1
5
6
4
1
6
160 / 1
10
3
1
8
9
2
162 / 1
Fragments
Lemma: Let F be a fragment
(i.e., a connected subgraph of the minimum spanning tree M).
Let e be the lowest-weight outgoing edge of F
(i.e., e has exactly one endpoint in F ).
Then e is in M.
Proof: Suppose not.
Then M {e} has a cycle,
containing e and another outgoing edge f of F .
Replacing f by e in M gives a spanning tree
with a smaller sum of weights of edges.
163 / 1
Kruskals algorithm
This algorithm also works when edges can have the same weight.
Then the minimum spanning tree may not be unique.
164 / 1
Gallager-Humblet-Spira algorithm
Consider an undirected, weighted network,
in which different edges have different weights.
Distributed computation of a minimum spanning tree:
I
165 / 1
F
` < `0 F
F 0:
F F 0 = (fn0 , `0 )
` = `0 eF = eF 0 :
F F 0 = (weight eF , ` + 1)
The core edge of a fragment is the last edge that connected two
sub-fragments at the same level; its end points are the core nodes.
166 / 1
Parameters of a process
Its state :
I
Initialization
168 / 1
find
At reception of hinitiate, fn, `, found
i, a process stores fn and `,
sets its state to find or found, and adopts the sender as its parent.
170 / 1
Questions
171 / 1
172 / 1
When the process p that reported its lowest-weight outgoing basic edge
receives changeroot, p sets this channel to branch, and sends
hconnect, level p i into it.
173 / 1
Questions
In case 2, if level q = level p , why does q postpone processing
the incoming connect message from p ?
Answer: The fragment of q might be in the process of joining
another fragment at level level q , in which case the fragment of
p should subsume the name and level of that joint fragment,
instead of joining the fragment of q at an equal level.
Why does this postponement not give rise to a deadlock ?
(I.e., why cant there be a cycle of fragments waiting for
a reply to a postponed connect message ?)
Answer: Because different channels have different weights.
175 / 1
Back to election
By two extra messages at the very end,
the core node with the largest id becomes the leader.
So Gallager-Humblet-Spira is an election algorithm
for undirected networks (with a total order on the channels).
Lower bounds for the average-case message complexity
of election algorithms based on comparison of ids:
Rings: (N log N)
General networks: (E + N log N)
177 / 1
pq qp
pq qp
ps qr
tq
qt
tq
rs sr
rs sr
sp rq
pq
qp
qr
sp rq
ps qr
hconnect, 0i
hinitiate, 5, 1, findi
htest, 5, 1i
hconnect, 0i
hinitiate, 5, 1, findi
hreport, i
hconnect, 0i
hinitiate, 3, 1, findi
accept
hreport, 9i
hreport, 7i
hconnect, 1i
htest, 3, 1i
accept
5
11
rs
sr
rq
rq qp qt qr rs
ps sp
pr
rp
tq pq qr sr rq
15
r
hreport, 7i
hreport, 9i
hconnect, 1i
hinitiate, 7, 2, findi
htest, 7, 2i
htest, 7, 2i
reject
hreport, i
178 / 1
Fairness
181 / 1
Probabilistic algorithms
182 / 1
183 / 1
h j, 1, falsei
i >j
j > k, `
186 / 1
Each message sent upwards in the constructed tree reports the size
of its subtree.
All other messages report 0.
Else, it selects a new id, and initiates a new wave, in the next round.
188 / 1
h0, i, 0i
0, j
h0, i, 1i
0, j
0, i
h0, i, 0i
h0, i, 0i
h0, i, 0i
h0, i, 0i
0, i
h0, i, 1i
h0, i, 1i
0, k
h1, `, 0i
1, `
h1, `, 0i
0, i
h1, `, 5i
h1, `, 0i
h1, `, 4i
h1, `, 2i
h1, `, 0i
h1, `, 0i
0, i
1, m
1, m
h1, `, 1i
h1, `, 1i
0, i
The process at the left computes size 6, and becomes the leader.
189 / 1
190 / 1
191 / 1
Example:
(j, 2)
(i, 2)
(i, 2)
(j, 2)
193 / 1
194 / 1
(`, 3)
(i, 2)
(i, 2)
(i, 2)
(k, 2)
(`, 3)
(`, 3)
(i, 4)
(m, 3)
(`, 3)
(`, 3)
(j, 4)
(j, 4)
(n, 4)
195 / 1
196 / 1
197 / 1
198 / 1
Fault tolerance
199 / 1
Consensus
200 / 1
Consensus - Assumptions
201 / 1
202 / 1
<0,0,1>
weight 1
<0,1,1>
<0,0,1>
weight 1
weight 1
<1,0,2>
<0,0,1>
<1,1,1>
weight 1
<2,0,1>
crashed
weight 2
decide 0
weight 2
decide 0
<1,0,2>
weight 2
crashed
<3,0,2>
<2,0,1>
<3,0,2>
weight 1
decide 0
weight 2
N
2
N
2.
209 / 1
N
3.
N+k
2
211 / 1
Echo mechanism
Complication: A Byzantine process may send different votes
to different processes.
Example: Let N = 4 and k = 1. Each round, a correct process
waits for three votes, and needs three b-votes to decide b.
decide 1
Byzantine
1 1
1
decide 0
0
1
Byzantine
0
0 0
decide 0
N+k
2
N+k
2
214 / 1
>
>
N+k
2 ,
Only relevant vote messages are depicted (without their round number).
215 / 1
Byzantine
1 1
1
0
0
0
0
0
1
0 1
1
0
Byzantine
0 0
0 decide 0
1
0
0
<decide,0>
Byzantine
decide 0 0
Byzantine
0 <decide,0>
0
0
decide 0 0
216 / 1
N+k
2
217 / 1
N+k
2
b-votes.
N+k
2
k =
Nk
2
b-votes.
218 / 1
219 / 1
Failure detection
Aim: To detect crashed processes.
T is the time domain, with a total order.
F ( ) is the set of crashed processes at time .
1 2 F (1 ) F (2 )
(i.e., no restart)
a failure pattern F
222 / 1
223 / 1
224 / 1
228 / 1
hvalue, 0, 0i
lu = 1
lu = 1
hack, 0i
lu = 1
lu = 0
1
lu = 1
lu = 0
1
lu = 1
crashed
lu = 1
lu = 1
crashed
hvote, 1, 1, 1i
1
lu = 1
decide 0
0
0
lu = 1
hvalue, 1, 0i
lu = 1
crashed
hdecide, 0i
lu = 1
lu = 1
hack, 1i
crashed
lu = 0
lu = 0
1
decide 0
lu = 1
lu = 1
N = 3 and k = 1
0
decide 0
lu = 1
Messages and acks that a process sends to itself and irrelevant messages are omitted.
230 / 1
232 / 1
Clock synchronization
At certain time intervals, the processes synchronize clocks :
They read each others clock values, and adjust their local clocks.
The aim is to achieve, for some , and all ,
|Cp ( ) Cq ( )|
Due to drift, this precision may degrade over time,
necessitating repeated clock synchronizations.
We assume a known bound dmax on network latency.
For simplicity, let dmax be much smaller than ,
so that this latency can be ignored in the clock synchronization.
234 / 1
Clock synchronization
Suppose that after each synchronization, at say real time ,
for all processes p, q:
|Cp ( ) Cq ( )| 0
for some 0 < .
Due to -bounded drift of local clocks, at real time + R,
1
|Cp ( + R) Cq ( + R)| 0 + ( )R < 0 + R
So synchronizing every
235 / 1
N
3.
N
3 .)
Mahaney-Schneider synchronizer
Consider a complete network of N processes,
in which at most k < N3 processes are Byzantine.
Each correct process in a synchronization round:
1. Collects the clock values of all processes (waiting for 2dmax ).
2. Discards those reported values for which < N k processes
report a value in the interval [ , + ]
(they are from Byzantine processes).
3. Replaces all discarded/non-received values by an accepted value.
4. Takes the average of these N values as its new clock value.
237 / 1
X
X
1
1
1
2
apr ) (
aqr )|
(
k2 <
N processes r
N processes r
N
3
So we can take 0 = 23 .
There should be a synchronization every
Synchronous networks
Lets again forget about Byzantine processes for a moment.
A synchronous network proceeds in pulses. In one pulse, each process:
1. sends messages
2. receives messages
3. performs internal events
240 / 1
Cq
real time
Cq1 ( ) Cp1 ( ) +
242 / 1
Cp
real time
Hence,
Cq1 ((i 1)2 ) Cp1 ((i 1)2 ) +
Cp1 (i2 )
243 / 1
Byzantine broadcast
Consider a synchronous network of N processes,
in which at most k < N3 processes are Byzantine.
One process g , called the general, is given an input xg {0, 1}.
The other processes, called lieutenants, know who is the general.
Requirements for k-Byzantine broadcast:
I
Scenario 2
0 0 0 0
1 U Byzantine
T
0
The processes in S and T decide 0
Scenario 3
1 1 1 1
1 U
Byzantine T
0
The processes in S and U decide 1
S Byzantine
0 0 1 1
1 U
T
0
The processes in T decide 0 and in U decide 1
245 / 1
Initially:
Byzantine r
g 1
q
1 p
g 1
Byzantine r
q 1
After pulse 1:
After pulse 1:
Byzantine
Byzantine
Since k 1 < N1
3 , by induction, all correct lieutenants p build,
in the recursive call Broadcast p (6, 1),
the same list M = {1, 1, 0, 0, 0, b}, for some b {0, 1}.
So in Broadcast g (7, 2), they all decide major (M).
248 / 1
Byzantine
Then the five subcalls Broadcast q (5, 0), for the correct lieutenants q,
would at each correct lieutenant lead to the list {1, 0, 0, 1, 1}.
So in that case b = major ({1, 0, 0, 1, 1}) = 1.
249 / 1
N
3.
Partial synchrony
A synchronous system can be obtained if local clocks have known
bounded drift, and there is a known upper bound on network latency.
In a partially synchronous system,
I
these bounds are known, but only valid from some unknown point
in time.
252 / 1
Public-key cryptosystems
Given a large message domain M.
A public-key cryptosystem consists of functions Sq , Pq : M M,
for each process q, with
Sq (Pq (m)) = Pq (Sq (m)) = m
for all m M.
if Wp is a singleton {v }, or
otherwise
g Byzantine
pulse 1:
pulse 2:
pulse 3:
257 / 1
Mutual exclusion
258 / 1
259 / 1
Ricart-Agrawala algorithm
When a process pi wants to access its critical section, it sends
request(tsi , i) to all other processes, with tsi a logical time stamp.
When pj receives this request, it sends permission to pi as soon as:
I
261 / 1
263 / 1
Raymonds algorithm
Given an undirected network, with a sink tree.
At any time, the root, holding a token, is privileged.
Each process maintains a FIFO queue, which can contain
ids of its children, and its own id. Initially, this queue is empty.
Queue maintenance:
I
Raymonds algorithm
When the root exits its critical section (and its queue is non-empty),
I
Let p get the token from its parent, with q at the head of its queue:
I
266 / 1
1
2
3
4
267 / 1
1
2
3
4
5
5
268 / 1
3, 2
1
2
3
4
5
5
269 / 1
3, 2
1
2
3
4
5, 4
5
4
270 / 1
1
2
3
4
5, 4, 1
5
4
271 / 1
1
2
3
4
4, 1
5
4
272 / 1
1
2
3
4
4, 1
5
4
273 / 1
1
2
3
4
1
5
3
274 / 1
1
2
3
4
1
5
275 / 1
1
2
3
4
276 / 1
1
2
3
4
277 / 1
278 / 1
279 / 1
1
2
4
3
5
Question: What are the quorums if 1,2 crashed? And if 1,2,3 crashed?
And if 1,2,4 crashed?
281 / 1
3
5
Self-stabilization
Advantages:
I
fault tolerance
straightforward initialization
285 / 1
286 / 1
p0 is privileged if x0 = xN1 .
p2 2
p1 3
1 p3
p0 0
It isnt hard to see that it self-stabilizes. For instance,
p2
2
p2
p3
p1
p0
p3
p1
p0
0
288 / 1
root p :
dist p :
290 / 1
291 / 1
root p p
dist p 0
root p p, or
parent p = , or
292 / 1
root p root q
dist p dist q + 1
293 / 1
Et cetera
294 / 1
295 / 1
297 / 1
298 / 1