12 Consistency & Replication

Distributed Systems
Consistency & Replication (II)
Client-centric Consistency Models

Guarantees for a single client How to hide inconsistencies from a client ?
assuming a data store where concurrent conflicting updates are rare
and relatively easy to resolve
Examples:
DNS
Single naming authority per zone lazy propagation of updates
WWW
No write-write conflicts Usually acceptable to serve slightly out-of-date pages from a cache
Bayou (Terry et al 1994)

2
Eventual Consistency
The principle of a mobile user accessing different replicas of a distributed database.
If no updates take place for some time, all replicas gradually converge to a consistent state
3
Alternative client-centric models

xi[t]: version of object x at local copy Li at time t
result of updates to a series of writes since system initialization at Li WS(xi[t]): series of writes WS(xi[t2]; xj[t2]): series of writes that have also been performed at copy Lj at a later time
Assume an owner for each data item

avoid write-write conflicts
Monotonic reads Monotonic writes Read-your-values Writes-follow-reads

4
Monotonic Reads
WS(x1) is part of WS(x2) If a process has seen a value of x at time t, it will never see an older value at a later time. Example: -replicated mailboxes with on-demand propagation of updates
The read operations performed by a single process P at two different local copies of the same data store. a) A monotonic-read consistent data store b) A data store that does not provide monotonic reads.
5
Monotonic Writes
If an update is made to a copy, all preceding updates must have been completed first. A write may affect only part of the state of a data item
FIFO propagation of updates by each process Example: - s/w library
No guarantee that x at L2 has the same value as x at L1 at the time W(x1) completed
a) b)
The write operations performed by a single process P at two different local copies of the same data store A monotonic-write consistent data store. A data store that does not provide monotonic-write consistency.
6
Read Your Writes

A write is completed before a successive read, no matter where the read takes place
Negative examples: - updates of Web pages - changes of passwords

The effects of the previous write at L1 have not yet been propagated !
a) b)
A data store that provides read-your-writes consistency. A data store that does not.
7
Writes Follow Reads

Any successive write will be performed on a copy that is up-to-date with the value most recently read by the process.
Example: - updates of a newsgroup:

Responses are visible only after the original posting has been received
a) b)
A writes-follow-reads consistent data store A data store that does not provide writesfollow-reads consistency
8
Implementing client-centric models (I)

Globally unique ID per write operation
Assigned by the initiating server
Per-client state:
Read set
Write IDs relevant to clients read operations
Write set
IDs of writes performed by client
Major performance issue:

Size of read/write sets ?
9
Implementing client-centric models (II)

Monotonic read:
When a client issues a read, the server is given the clients read set to check whether all the identified reads have taken place locally
If not, the server contacts others to ensure that it is brought up-todate
After the read, the clients read set is updated with the servers relevant writes
Monotonic write:
When a client issues a write, the server is given the clients write set
to ensure that all specified writes have been applied (in-order)
The write operations ID is appended to clients write set

10
Implementing client-centric models (III)

Read-your-writes:
Before serving a read request, the server fetches (from other servers) all writes in the clients write set
Writes-follow-reads:
Server is brought up-to-date with the writes in the clients read set After write, the new ID is added to the clients write set, along with the IDs in the read set
as these have become relevant for the write just performed
11
Implementing client-centric models (IV)

Grouping a clients read and write operations into sessions
A session is typically associated with an application
but may also be associated with an application that can be temporarily shutdown (eg: email agent)
What if the client never closes a session ?
How to represent the read & write sets ?

List of IDs for write operations
Not all of these are actually needed !!
12
Implementing client-centric models (V)

Using vector timestamps for improving efficiency:
When server Si accepts a write operation, it assigns to it a globally unique WID and a timestamp ts(WID) Each server maintains vector RCVD(i)
RCVD(i)[j] := timestamp of the latest write initiated at server Sj that has been received & processed at Si Server returns its current vector timestamp with its responses to read/write requests Client adjusts the timestamp for its own read/write set
13
Implementing client-centric models (VI)

Efficient representation of read/write set A:
VT(A): vector timestamp
VT(A)[i] := max. timestamp of all operations in A that were initiated at server Si
Union of 2 sets of write IDs:

VT(A+B)[i] := max{ VT(A)[i], VT(B)[i] }
Efficient way to check if A is contained in B:

VT(A)[i] <= VT(B)[i]
14
Replica Placement (I)
The logical organization of different kinds of copies of a data store into three concentric rings.
15
Replica Placement (II)

Permanent copies
Basis of distributed data store
Example from the Web:
Anycasting & round-robin clusters Mirror sites
Server-initiated
Push caches
Dynamic replication to handle bursts Read-only
Content Distribution Network (CDN)
Client-initiated
Improve access time to data
Danger of stale data
Private vs Shared caches

16
Server-Initiated Replicas
Counting access requests from different clients.
P := closest server for both C1 & C2
CntQ(P, F)
At each server: Count of accesses for each file Originating clients Routing DB to determine closest server for client C
Deletion threshold: del(S, F) Replication threshold: rep(S, F)
Dynamic decisions to delete/migrate/replicate file F to server S

Extra care to ensure that at least one copy remains !
17
Update propagation
State vs Operations
Notification of an update
Invalidation protocols Best for low read/write ratio (%)
Transfer data from one copy to another

Transfer of actual data or log of changes Batching Best for relatively high read/write %
Propagate the update to other copies

Active replication
Pull vs Push
Push replicas maintain a high degree of consistency
Updates are expected to be of use to multiple readers
Pull best for low read/write % Hybrid scheme based on lease model
Unicast vs Multicast
Push multicast group Pull single server or client requests an update
18
Leases
A promise by a server that it will push updates for a specified time period
After expiration, client has to pull for updates
Alternatives:
Age-based leases
Depending on the last time an item was modified
Long-lasting leases for items that are expected to remain unmodified
Renewal frequency-based leases

Short-term leases for clients that only occasionally ask fo a specific item
Leases based on state-space overhead at the server:

Lower expiration time as the servers approaches overload
19
Pull versus Push Protocols

Stateful server: keeps track of all caches
Issue State of server Messages sent Response time at client Push-based List of client replicas and caches Update (and possibly fetch update later) Immediate (or fetch-update time) Pull-based None Poll and update Fetch-update time
Comparison between push-based & pull-based protocols in the case of multiple client, single server systems.
20
Remote-Write Protocols (I)
Primary-based remote-write protocol with a fixed server to which all read & write operations are forwarded.
21
Remote-Write Protocols (II)
The principle of primary-backup protocol.

22
Primary-backup protocols
Blocking updates
straightforward implementation of sequential consistency
The primary orders all updates Processes see the effects of their most recent write
Non-blocking updates
reduce blocking delay for the process that initiated the update
The process only waits until the primarys ACK
Fault tolerance ?
23
Local-Write Protocols (I)
Keeping track of each data items current location ? Primary-based local-write protocol in which a single copy is migrated between processes.
24
Local-Write Protocols (II)
Suitable for disconnected operation Primary-backup protocol in which the primary migrates to the process wanting to perform an update.
25
Active Replication (I)
The problem of replicated invocations.

26
Active Replication (II)
(a) Forwarding an invocation request from a replicated object. (b) Returning a reply to a replicated object.
27
Giffords quorum scheme (I)

Version numbers or timestamps per copy A number of votes is assigned to each physical copy
weight related to demand for a particular copy totV(g): total number of votes for group of RMs totV: total votes
Obtain quorum before read/write:

R votes before read W votes before write W > 0.5*totV no write-write conflicts (R + W) > totV(g) no read-write conflicts
Any quorum pair must contain common copies

In case of partition, it is not possible to perform conflicting 28 operations on the same copy
Giffords quorum scheme (II)

Read:
Version number inquiries to find set (g) of RMs
totV(g) >= R
Not all copies need to be up-to-date

Every read quorum contains at least one current copy
Write:
Version number inquiries to find set (g) of RMs
totV(g) >= W up-to-date copies
If there are insufficient up-to-date copies, replace a non-current copy with a copy of the current copy
Groups of RMs can be configured to provide different performance/reliability characteristics

Decrease W to improve writes Decrease R to improve reads
29
Giffords quorum scheme (III)

Performance penalty for reads
Due to the need for collecting a read quorum
Support for copies on local disks of clients

Assigned zero votes - weak representatives
These copies cannot be included in a quorum
After obtaining a read quorum, a read may be carried out on the local copy if it is up-to-date
Blocking probability:
In some cases, a quorum cannot be obtained
30
Giffords quorum scheme (IV)

Example 1 Example 2 Example 3 Latency Replica 1 (milliseconds) Replica 2 Replica 3 Voting Replica 1 configuration Replica 2 Replica 3 Quorum sizes R W 75 65 65 1 0 0 1 1 75 100 750 2 1 1 2 3 75 750 750 1 1 1 1 3
Ex1: file with high % read/write Ex2: file with moderate %read/write
Reads can be satisfied by local RM, but writes must also access one remote RM
Derived performance of file suite: Read Write Latency Blocking probability Latency Blocking probability 65 0.01 75 0.01 75 0.0002 100 0.0101 75 0.000001 750 0.03
Ex3: file with very high % read/write Examples assume 99% availability for RMs 31
Quorum-Based Protocols
Three examples of the voting algorithm: a) A correct choice of read & write set b) A choice that may lead to write-write conflicts c) A correct choice, known as ROWA (read one, write all)
32
Transactions with Replicated Data

Better performance
Concurrent service Reduced latency
Higher availability Fault tolerance

What if a replica fails or becomes isolated ?
Upon rejoining, it must catch up
Replicated transaction service

Data replicated at a set of replica managers
Replication transparency
One copy serializability Read one, write all
Failures must be observed to have happened before any active Tx s at other servers
33
Network Partitions
Separate but viable groups of servers Optimistic schemes validate on recovery
Available copies with validation
Pessimistic schemes limit availability until recovery

T withdraw(B) deposit(B)
B B
partition
34
Fault Tolerance
Design to recover after a failure with no loss of (committed) data. Designs for fault tolerance:
Single server, fail and recover Primary server with trailing backups Replicated service
35
Fault Tolerance = ?
Define correctness criteria When 2 replicas are separated by network partition:
Both are deemed incorrect & stop serving. One (the master) continues & the other ceases service. One (the master) continues to accept updates & both continue to supply reads (of possibly stale data). Both continue service & subsequently synchronise.
36
Passive Replication (I)

At any time, system has a single primary RM One or more secondary backup RMs Front ends communicate with primary, primary executes requests, response to all backups If primary fails, one backup is promoted to primary New primary starts from Coordination phase for each new request What happens if primary crashes before/during/after agreement phase?
37
Passive Replication (II)

Primary C FE RM RM Backup
FE
RM Backup
38
Passive replication (III)

Satisfies linearizability Front end: looks up new primary, when current primary does not respond Primary RM is performance bottleneck Can tolerate F failures for F+1 RMs A variation: clients can access backup RMs (linearizability is lost, but clients get sequential consistency) SUN NIS (yellow pages) uses passive replication: clients can contact primary or backup servers for reads, but only primary server for updates
39
Active replication (I)

RMs are state machines with equivalent roles Front ends communicates the client requests to RM group, using totally ordered reliable multicast RMs process independently requests & reply to front end (correct RMs process each request identically) Front end can synthesize final response to client (tolerating Byzantine failures) Active replication provides sequential consistency if multicast is reliable & ordered Byzantine failures (F out of 2F+1): front end waits until it gets F+1 identical responses
40
Active replication (II)

RM
FE
RM
FE
RM
41
Replication Architectures
How many replicas are required?
All or majority ? T
A
getBalance(A)
A A
Forward all updates as soon as received. Two phase commit protocol.

Contacted replica acts as coordinator What if one of the replicas isnt available?
deposit(B)
B
replica managers
B B B
42
Primary copy replication
Available Copies Replication

A
Not all copies will always be available. Failures

Timeout at failed replica Rejected by recovering, unsynchronised replica
getBalance(A)
deposit(B)
Y replica managers
deposit(A) U getBalance(B)
B M B N
B P
43
Local Validation
Failure & recovery events do not occur during a Tx. Example:
T reads A before server Xs failure, therefore T failX T observes server Ns failure when it writes B, therefore failN T failN T.getBalance(A) T.deposit(B) failX failX U.getBalance(B) U.deposit(A) failN
Server x fails followed by Transaction U which is followed by Server Ns failure which is followed by Transaction T which is followed by server Xs failure. This is inconsistent, so the transactions must not be allowed to commit.
Failure and recovery must be serialised just like a Tx: They occur before or after a Tx, but not during.
44

12 Consistency &amp; Replication

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

12 Consistency &amp; Replication

Enviado por

Direitos autorais:

Formatos disponíveis

Distributed Systems

Consistency & Replication (II)

Client-centric Consistency Models

Bayou (Terry et al 1994)

Alternative client-centric models

Assume an owner for each data item

Monotonic reads Monotonic writes Read-your-values Writes-follow-reads

FIFO propagation of updates by each process Example: - s/w library

Read Your Writes

Negative examples: - updates of Web pages - changes of passwords

Writes Follow Reads

Example: - updates of a newsgroup:

Implementing client-centric models (I)

Major performance issue:

Implementing client-centric models (II)

The write operations ID is appended to clients write set

Implementing client-centric models (III)

Implementing client-centric models (IV)

What if the client never closes a session ?

How to represent the read & write sets ?

Implementing client-centric models (V)

Implementing client-centric models (VI)

Union of 2 sets of write IDs:

Efficient way to check if A is contained in B:

Replica Placement (I)

Replica Placement (II)

Content Distribution Network (CDN)

Private vs Shared caches

Dynamic decisions to delete/migrate/replicate file F to server S

Transfer data from one copy to another

Propagate the update to other copies

Renewal frequency-based leases

Leases based on state-space overhead at the server:

Pull versus Push Protocols

Remote-Write Protocols (I)

Remote-Write Protocols (II)

The principle of primary-backup protocol.

Local-Write Protocols (I)

Local-Write Protocols (II)

Active Replication (I)

The problem of replicated invocations.

Active Replication (II)

Giffords quorum scheme (I)

Obtain quorum before read/write:

Any quorum pair must contain common copies

Giffords quorum scheme (II)

Not all copies need to be up-to-date

Groups of RMs can be configured to provide different performance/reliability characteristics

Giffords quorum scheme (III)

Support for copies on local disks of clients

Giffords quorum scheme (IV)

Transactions with Replicated Data

Higher availability Fault tolerance

Replicated transaction service

Pessimistic schemes limit availability until recovery

Passive Replication (I)

Passive Replication (II)

Passive replication (III)

Active replication (I)

Active replication (II)

Forward all updates as soon as received. Two phase commit protocol.

Primary copy replication

Available Copies Replication

12 Consistency & Replication

12 Consistency & Replication