Você está na página 1de 44

Distributed Systems

Consistency & Replication (II)

Client-centric Consistency Models


Guarantees for a single client How to hide inconsistencies from a client ?
assuming a data store where concurrent conflicting updates are rare
and relatively easy to resolve

Examples:
DNS
Single naming authority per zone lazy propagation of updates

WWW
No write-write conflicts Usually acceptable to serve slightly out-of-date pages from a cache

Bayou (Terry et al 1994)


2

Eventual Consistency
The principle of a mobile user accessing different replicas of a distributed database.

If no updates take place for some time, all replicas gradually converge to a consistent state
3

Alternative client-centric models


xi[t]: version of object x at local copy Li at time t
result of updates to a series of writes since system initialization at Li WS(xi[t]): series of writes WS(xi[t2]; xj[t2]): series of writes that have also been performed at copy Lj at a later time

Assume an owner for each data item


avoid write-write conflicts

Monotonic reads Monotonic writes Read-your-values Writes-follow-reads


4

Monotonic Reads
WS(x1) is part of WS(x2) If a process has seen a value of x at time t, it will never see an older value at a later time. Example: -replicated mailboxes with on-demand propagation of updates

The read operations performed by a single process P at two different local copies of the same data store. a) A monotonic-read consistent data store b) A data store that does not provide monotonic reads.
5

Monotonic Writes
If an update is made to a copy, all preceding updates must have been completed first. A write may affect only part of the state of a data item

FIFO propagation of updates by each process Example: - s/w library

No guarantee that x at L2 has the same value as x at L1 at the time W(x1) completed

a) b)

The write operations performed by a single process P at two different local copies of the same data store A monotonic-write consistent data store. A data store that does not provide monotonic-write consistency.
6

Read Your Writes


A write is completed before a successive read, no matter where the read takes place

Negative examples: - updates of Web pages - changes of passwords


The effects of the previous write at L1 have not yet been propagated !

a) b)

A data store that provides read-your-writes consistency. A data store that does not.
7

Writes Follow Reads


Any successive write will be performed on a copy that is up-to-date with the value most recently read by the process.

Example: - updates of a newsgroup:


Responses are visible only after the original posting has been received

a) b)

A writes-follow-reads consistent data store A data store that does not provide writesfollow-reads consistency
8

Implementing client-centric models (I)


Globally unique ID per write operation
Assigned by the initiating server

Per-client state:
Read set
Write IDs relevant to clients read operations

Write set
IDs of writes performed by client

Major performance issue:


Size of read/write sets ?
9

Implementing client-centric models (II)


Monotonic read:
When a client issues a read, the server is given the clients read set to check whether all the identified reads have taken place locally
If not, the server contacts others to ensure that it is brought up-todate

After the read, the clients read set is updated with the servers relevant writes

Monotonic write:
When a client issues a write, the server is given the clients write set
to ensure that all specified writes have been applied (in-order)

The write operations ID is appended to clients write set


10

Implementing client-centric models (III)


Read-your-writes:
Before serving a read request, the server fetches (from other servers) all writes in the clients write set

Writes-follow-reads:
Server is brought up-to-date with the writes in the clients read set After write, the new ID is added to the clients write set, along with the IDs in the read set
as these have become relevant for the write just performed
11

Implementing client-centric models (IV)


Grouping a clients read and write operations into sessions
A session is typically associated with an application
but may also be associated with an application that can be temporarily shutdown (eg: email agent)

What if the client never closes a session ?

How to represent the read & write sets ?


List of IDs for write operations
Not all of these are actually needed !!
12

Implementing client-centric models (V)


Using vector timestamps for improving efficiency:
When server Si accepts a write operation, it assigns to it a globally unique WID and a timestamp ts(WID) Each server maintains vector RCVD(i)
RCVD(i)[j] := timestamp of the latest write initiated at server Sj that has been received & processed at Si Server returns its current vector timestamp with its responses to read/write requests Client adjusts the timestamp for its own read/write set

13

Implementing client-centric models (VI)


Efficient representation of read/write set A:
VT(A): vector timestamp
VT(A)[i] := max. timestamp of all operations in A that were initiated at server Si

Union of 2 sets of write IDs:


VT(A+B)[i] := max{ VT(A)[i], VT(B)[i] }

Efficient way to check if A is contained in B:


VT(A)[i] <= VT(B)[i]

14

Replica Placement (I)

The logical organization of different kinds of copies of a data store into three concentric rings.
15

Replica Placement (II)


Permanent copies
Basis of distributed data store
Example from the Web:
Anycasting & round-robin clusters Mirror sites

Server-initiated
Push caches
Dynamic replication to handle bursts Read-only

Content Distribution Network (CDN)

Client-initiated
Improve access time to data
Danger of stale data

Private vs Shared caches


16

Server-Initiated Replicas
Counting access requests from different clients.
P := closest server for both C1 & C2

CntQ(P, F)

At each server: Count of accesses for each file Originating clients Routing DB to determine closest server for client C
Deletion threshold: del(S, F) Replication threshold: rep(S, F)

Dynamic decisions to delete/migrate/replicate file F to server S


Extra care to ensure that at least one copy remains !
17

Update propagation
State vs Operations
Notification of an update
Invalidation protocols Best for low read/write ratio (%)

Transfer data from one copy to another


Transfer of actual data or log of changes Batching Best for relatively high read/write %

Propagate the update to other copies


Active replication

Pull vs Push
Push replicas maintain a high degree of consistency
Updates are expected to be of use to multiple readers

Pull best for low read/write % Hybrid scheme based on lease model

Unicast vs Multicast
Push multicast group Pull single server or client requests an update
18

Leases
A promise by a server that it will push updates for a specified time period
After expiration, client has to pull for updates

Alternatives:
Age-based leases
Depending on the last time an item was modified
Long-lasting leases for items that are expected to remain unmodified

Renewal frequency-based leases


Short-term leases for clients that only occasionally ask fo a specific item

Leases based on state-space overhead at the server:


Lower expiration time as the servers approaches overload
19

Pull versus Push Protocols


Stateful server: keeps track of all caches
Issue State of server Messages sent Response time at client Push-based List of client replicas and caches Update (and possibly fetch update later) Immediate (or fetch-update time) Pull-based None Poll and update Fetch-update time

Comparison between push-based & pull-based protocols in the case of multiple client, single server systems.
20

Remote-Write Protocols (I)

Primary-based remote-write protocol with a fixed server to which all read & write operations are forwarded.
21

Remote-Write Protocols (II)

The principle of primary-backup protocol.


22

Primary-backup protocols
Blocking updates
straightforward implementation of sequential consistency
The primary orders all updates Processes see the effects of their most recent write

Non-blocking updates
reduce blocking delay for the process that initiated the update
The process only waits until the primarys ACK

Fault tolerance ?
23

Local-Write Protocols (I)

Keeping track of each data items current location ? Primary-based local-write protocol in which a single copy is migrated between processes.

24

Local-Write Protocols (II)

Suitable for disconnected operation Primary-backup protocol in which the primary migrates to the process wanting to perform an update.
25

Active Replication (I)

The problem of replicated invocations.


26

Active Replication (II)

(a) Forwarding an invocation request from a replicated object. (b) Returning a reply to a replicated object.
27

Giffords quorum scheme (I)


Version numbers or timestamps per copy A number of votes is assigned to each physical copy
weight related to demand for a particular copy totV(g): total number of votes for group of RMs totV: total votes

Obtain quorum before read/write:


R votes before read W votes before write W > 0.5*totV no write-write conflicts (R + W) > totV(g) no read-write conflicts

Any quorum pair must contain common copies


In case of partition, it is not possible to perform conflicting 28 operations on the same copy

Giffords quorum scheme (II)


Read:
Version number inquiries to find set (g) of RMs
totV(g) >= R

Not all copies need to be up-to-date


Every read quorum contains at least one current copy

Write:
Version number inquiries to find set (g) of RMs
totV(g) >= W up-to-date copies
If there are insufficient up-to-date copies, replace a non-current copy with a copy of the current copy

Groups of RMs can be configured to provide different performance/reliability characteristics


Decrease W to improve writes Decrease R to improve reads
29

Giffords quorum scheme (III)


Performance penalty for reads
Due to the need for collecting a read quorum

Support for copies on local disks of clients


Assigned zero votes - weak representatives
These copies cannot be included in a quorum

After obtaining a read quorum, a read may be carried out on the local copy if it is up-to-date

Blocking probability:
In some cases, a quorum cannot be obtained

30

Giffords quorum scheme (IV)


Example 1 Example 2 Example 3 Latency Replica 1 (milliseconds) Replica 2 Replica 3 Voting Replica 1 configuration Replica 2 Replica 3 Quorum sizes R W 75 65 65 1 0 0 1 1 75 100 750 2 1 1 2 3 75 750 750 1 1 1 1 3

Ex1: file with high % read/write Ex2: file with moderate %read/write
Reads can be satisfied by local RM, but writes must also access one remote RM

Derived performance of file suite: Read Write Latency Blocking probability Latency Blocking probability 65 0.01 75 0.01 75 0.0002 100 0.0101 75 0.000001 750 0.03

Ex3: file with very high % read/write Examples assume 99% availability for RMs 31

Quorum-Based Protocols

Three examples of the voting algorithm: a) A correct choice of read & write set b) A choice that may lead to write-write conflicts c) A correct choice, known as ROWA (read one, write all)
32

Transactions with Replicated Data


Better performance
Concurrent service Reduced latency

Higher availability Fault tolerance


What if a replica fails or becomes isolated ?
Upon rejoining, it must catch up

Replicated transaction service


Data replicated at a set of replica managers

Replication transparency
One copy serializability Read one, write all
Failures must be observed to have happened before any active Tx s at other servers
33

Network Partitions
Separate but viable groups of servers Optimistic schemes validate on recovery
Available copies with validation

Pessimistic schemes limit availability until recovery


T withdraw(B) deposit(B)
B B

partition
34

Fault Tolerance
Design to recover after a failure with no loss of (committed) data. Designs for fault tolerance:
Single server, fail and recover Primary server with trailing backups Replicated service

35

Fault Tolerance = ?
Define correctness criteria When 2 replicas are separated by network partition:
Both are deemed incorrect & stop serving. One (the master) continues & the other ceases service. One (the master) continues to accept updates & both continue to supply reads (of possibly stale data). Both continue service & subsequently synchronise.

36

Passive Replication (I)


At any time, system has a single primary RM One or more secondary backup RMs Front ends communicate with primary, primary executes requests, response to all backups If primary fails, one backup is promoted to primary New primary starts from Coordination phase for each new request What happens if primary crashes before/during/after agreement phase?

37

Passive Replication (II)


Primary C FE RM RM Backup

FE

RM Backup

38

Passive replication (III)


Satisfies linearizability Front end: looks up new primary, when current primary does not respond Primary RM is performance bottleneck Can tolerate F failures for F+1 RMs A variation: clients can access backup RMs (linearizability is lost, but clients get sequential consistency) SUN NIS (yellow pages) uses passive replication: clients can contact primary or backup servers for reads, but only primary server for updates
39

Active replication (I)


RMs are state machines with equivalent roles Front ends communicates the client requests to RM group, using totally ordered reliable multicast RMs process independently requests & reply to front end (correct RMs process each request identically) Front end can synthesize final response to client (tolerating Byzantine failures) Active replication provides sequential consistency if multicast is reliable & ordered Byzantine failures (F out of 2F+1): front end waits until it gets F+1 identical responses
40

Active replication (II)


RM

FE

RM

FE

RM

41

Replication Architectures
How many replicas are required?
All or majority ? T
A

getBalance(A)

A A

Forward all updates as soon as received. Two phase commit protocol.


Contacted replica acts as coordinator What if one of the replicas isnt available?

deposit(B)
B

replica managers

B B B
42

Primary copy replication

Available Copies Replication


A

Not all copies will always be available. Failures


Timeout at failed replica Rejected by recovering, unsynchronised replica

getBalance(A)

deposit(B)

Y replica managers

deposit(A) U getBalance(B)

B M B N

B P
43

Local Validation
Failure & recovery events do not occur during a Tx. Example:
T reads A before server Xs failure, therefore T failX T observes server Ns failure when it writes B, therefore failN T failN T.getBalance(A) T.deposit(B) failX failX U.getBalance(B) U.deposit(A) failN
Server x fails followed by Transaction U which is followed by Server Ns failure which is followed by Transaction T which is followed by server Xs failure. This is inconsistent, so the transactions must not be allowed to commit.
Failure and recovery must be serialised just like a Tx: They occur before or after a Tx, but not during.
44

Você também pode gostar