Escolar Documentos
Profissional Documentos
Cultura Documentos
Topic 9
Cache Coherence
The goal is to make sure that READ(X) returns the most recent value of the shared variable X, i.e. all valid copies of a shared variable are identical. Origin:
Sharing of Writable Data Process Migration (Processes move from one processor to another during their execution) I/O activity (Reading and writing to MM)
3
Cache Coherence
Sharing of Writable Data (2 processors/cores)
Before any updation, all 3 copies of X are same Write Through (at all time cache & MM are same) : Suppose, P1 writes X to MM, P2 reads X from Local Cache), 2 copies exist. Write back (all 3 copies may be different)
Process Migration:
Process updates X, migrates to P2, write back Write Through, Process running on P1, updates X, moves to P2, corresponding cache block exists in P2 4
Cache Coherence
I/O activities:
Write Through, Read from I/O device, MM copy changed but Caches copies remain same Write Back , Write to I/O, Old value from MM will be written to I/O device One solution may be to share a local cache between a Processor and an IOP (I/O Processor).
To simplify the problem, so that problem will exist between MM and Cache copies. Performance degradation due to hampering of LOR of the processors 5
Cache Coherence
Solutions:
Software based or Hardware based
S/W based:
Compiler tags data as cacheable and noncacheable. Only read-only data is considered cachable and put in private cache. All other data are non-cachable, and can be put in a global cache, if available.
6
Cache Coherence
H/W based:
Snooping/snoopy protocol: -MM and caches are connected by BUS (broadcast media) -Whatever communication is taking place, all other can listen (by Cache controller constantly monitoring the bus) -2 main categories: Cache Block Invalidation and cache Block Update
7
Cache Coherence
Cache block Update & Cache block Invalidation protocols: -In Update, updates are sent over the Inter-core/Interprocessor BUS (broadcast media) and all caches whose block is matching will perform the updates. Subsequent updates to the same variable will generate extra traffic. -In invalidation, invalidation signal is sent over the bus for the updated cache block and all caches whose block is matching will invalidate their cache block. Subsequent updates to the same variable will not generate traffic. 8
Cache Coherence
Write Invalidate: (a) Write-Through policy
W(i), W(j) Write to block by Pi & Pj R(i), R(j) Read block by Pi & Pj Z(i), Z(j) Replace/Invalidate block in Pi & Pj
R(j), W(j), Z(i), Z(j) Invalid (i) R(i), W(i) R(i), W(i), R(j), Z(j) Valid (i)
W(j), Z(i)
Cache Coherence
Write Invalidate: (b) Write-Back policy (also called MSI Protocol)
R(i), W(i), Z(j)
R(j)
RW RO
W(j), Z(i)
W(i) R(j), Z(j), Z(i), W(j) R(i) W(i) W(j), Z(i) Invalid
10
Cache Coherence
MSI Protocol:
Cache Coherence
for any given pair of local caches, the permitted states are
M M S I S I
This protocol was mainly used in SGI 4D Variants: Most modern shared memory MP uses variant to reduce the amount of traffic. Exclusive state in MESI, Owned state in MOSI
12
Cache Coherence
MESI Protocol (Papamarcos & Patel 1984):
It is a version of the snooping cache protocol. Each cache block can be in one of four states
INVALID: Not valid SHARED: Multiple caches may hold valid copies. EXCLUSIVE: No other cache has this block, M-block is valid. Therefore, writing of a block in E state need not send Cache Invalidate signal to others (simply change the state to M). MODIFIED: Present only in current cache which is a Valid block, but copy in M-block is not valid. Writing it back to MM will simply change its state to E. 13
Cache Coherence
MESI Protocol (Papamarcos & Patel 1984):
Event Read Hit Read Miss WH WM Local Use local copy I to S, or I to E (S,E,M) to M I to M Remote No action (S,E,M) to S (S,I) to I (S,E,M) to I
When a cache block changes its status from M, it first updates the main memory.
14
Cache Coherence
MESI Protocol (Papamarcos & Patel 1984):
Following the read miss, the holder of the modified copy signals the initiator to try again. Meanwhile, it seizes the bus, and write the updated copy into the Main memory.
15
MESI Protocol:
Cache Coherence
for any given pair of local caches, the permitted states are
M M E S I E S I
16
Cache Coherence
MOSI Protocol:
It is a version of the snooping cache protocol. Each cache block can be in one of four states
INVALID: Not valid SHARED: Multiple caches may hold valid copies. OWNED: The current cache owns this block. Therefore, it will serve the requests (by snooping the bus) from other processors for this block in I mode. Other cache will receive values from Owning cache much faster than retrieving values from MM. MODIFIED: Present only in current cache which is a Valid block, but copy in M-block is not valid. 17
MOSI Protocol:
Cache Coherence
for any given pair of local caches, the permitted states are
M M O S I O S I
18
Cache Coherence
MOESI Protocol:
It is a full cache coherency protocol encompassing all possible states commonly used in other protocols. Each cache block can be in one of five states
INVALID: Not valid (valid copy may be either in MM or other cache) SHARED: Multiple caches may hold valid copies. Many be dirty w.r.t. MM (if a cache block exists in Owned State) or clean. 19
Cache Coherence
MOESI Protocol:
EXCLUSIVE: Contains most recent correct copy of data, No other cache has this block and M-block is valid. Therefore, writing of a block in E state need not send Cache Invalidate signal to others (simply change the state to M or O). May respond to a snoop request with data. MODIFIED: Present only in current cache which is a Valid block, but copy in M-block is not valid. Writing it back to MM will simply change its state to E. Must respond to a snoop request with data. OWNED: holds a Valid block (most recent) with other caches like in Shared state, but the copy in M-block can be stale. Only one cache can be in O state. Writing it back to MM will simply change its state to S. Can changed to M state after invalidating all shared copies. Must respond to a snoop request with data.
20
Cache Coherence
MOESI Protocol:
for any given pair of local caches, the permitted states are
M M O E S I O E S I
21
Cache Coherence
MESIF Protocol:
It is a full cache coherency protocol developed by Intel for ccNUMA (Cache-coherent Non Uniform Memory Architecture) processors. Each cache block can be in one of five states M, E, S, I and F. Where M, E, S and I are same as used in MESI protocol.
F (Forward): A cache block in F state means the designated cache is the responder for any cache request for the be designated cache block. For any cache holds a block in S state, exactly one cache can hold the block in F state. 22
Cache Coherence
MESIF Protocol:
for any given pair of local caches, the permitted states are
M M E S I F E S I F
23
Cache Coherence
H/W based:
Directory based protocols Snooping can not be used in Point-to-point links (Reliable and efficient) MM and processors/caches are connected by Switching network Which block can presence in which cache (processor) and one dirty/clean bit
24
Cache Coherence
H/W based:
Directory based protocols may be
Full-Map directory (maintains ptrs for all) Limited Directory (only limited ptrs with replacement). Can statistically computes avg no of ptrs reqd. Chained Directory (last entry maintains Chain Terminator, CT), chain starts from Shared memory
25
P0
P1
01
Memory
26
P0
P1
0 1
0 1
Memory
27
Process migration
Can occur even if independent jobs are executing
I/O
Often fixed via O/S cache flushes
02/07
28
Informally, with coherent caches: accesses to a memory location appear to occur simultaneously in all copies of the memory location
copies caches
Cache Coherence
30
Memory
line A: V line B: I line C: V
Cache 0
line A line B
Cache 1
line A
Cache 2
02/07
Shared (S): <1,X,X,X,,1> -- local cache has a valid copy, main memory has a valid copy, other caches ?? Modified(M): <1,0,0,..0,0> -- local cache has only valid copy. Exclusive(E): <1,0,0,..0,1> -- local cache has a valid copy, no other caches do, main memory has a valid copy. Owned(O): <1,X,X,X,.X> -- local cache has a valid copy, all other caches and memory may have a valid copy.
Only one cache can be in O state <1,X,1,X, 0> is included in O, but not included in any of the others.
Example
Memory line A: V line B: I line C: V Cache 0 line A line B Cache 1 line A Cache 2
33
34
Bus-Based Protocols
Protocol consists of states and actions (state transitions) Actions can be invoked from processor or bus
Bus
Bus Actions
Cache Controller
State
Tags
Cache Data
Processor Actions
Processor
02/07
36
MSI Protocol
Action and Next State
Current State I Processor Read Cache Read Acquire Copy S No Action S No Action M Processor Write Cache Read&M Acquire Copy M Cache Upgrade M No Action M No Action I Cache Write back I Eviction Cache Read No Action I Cache Read&M No Action I Cache Upgrade No Action I
Invalidate Frame I
02/07
MSI Example
Thread Event Bus Action Data From Global State Local States: C0 C1 C2
I S M S I
I I I I M
I I I S I
If line is in no cache
Read, modify, Write requires 2 bus transactions Optimization: add Exclusive state
02/07
State transitions:
Local read: I->S, fetch shared Local write: I->M, fetch modified; S->M, invalidate other copies Remote read: M->S, supply data Remote write: M->I, supply data; S->I, invalidate local copy
State transitions:
Local read: I->E if only copy, I->S if other copies exist Local write: E->M silently, S->M, invalidate other copies
39
MESI Protocol
Variation used in Pentium Pro/2/3
(cache to cache transfer) 4-State Protocol
Modified: <1,0,00> Exclusive: <1,0,0,,1> Shared: <1,X,X,,1> Invalid: <0,X,X,X>
Bus/Processor Actions
Same as MSI
02/07
40
MESI Protocol
Action and Next State
Current State I Processor Read Cache Read If no sharers: E If sharers: S No Action S No Action E No Action M Processor Write Cache Read&M M Eviction Cache Read No Action I Cache Read&M No Action I Cache Upgrade No Action I
No Action I
02/07
MESI Example
Optimization cache-to-cache transfer for shared data (but only one sharer can respond) If modified in some cache and another reads then writeback to memory (w/ snarf)
Optimization: add Owned state (and perform cache-to-cache transfer)
Thread Event Bus Action Data From Global State Local States: C0 C1 C2
I E M
I I I
I I I
MOESI Optimization
Observation: shared ownership complicates/delays sourcing data
Owner is responsible for sourcing data to any requestor Add O (owner) state to protocol: MOSI/MOESI Last requestor becomes the owner Ownership can be on per-node basis in hierarchically structured system Avoid writeback (to memory) of dirty data Also called shared-dirty state, since memory is stale
43
MOESI Protocol
Used in AMD Opteron 5-State Protocol
Modified: <1,0,00> Exclusive: <1,0,0,,1> Shared: <1,X,X,,X> Invalid: <0,X,X,X> Owned: <1,X,X,X,X> ; only one owner
02/07
44
MOESI Protocol
Action and Next State
Current State I Processor Read Cache Read If no sharers: E If sharers: S No Action S No Action E Processor Write Cache Read&M M Eviction Cache Read Cache Read&M Cache Upgrade No Action I No Action I No Action I
No Action I No Action I
Respond shared; S Respond shared; Supply data; S Respond shared; Supply data; O Respond shared; Supply data; O
No Action I Respond shared; Supply data; I Respond shared; Supply data; I Respond shared; Supply data; I
No Action I
No Action O
No Action M
02/07
MOESI Example
Thread Event Bus Action Data From Global State local states C0 C1 C2
I E M O I
I I I I M
I I I S I
02/07
Update Protocols
Basic idea:
All writes (updates) are made visible to all caches:
(address,value) tuples sent everywhere Similar to write-through protocol for uniprocessor caches
Simple optimization
Obviously not scalable beyond a few processors No one actually builds machines this way
Send updates to memory/directory Directory propagates updates to all known copies: less bandwidth
Write-combining of adjacent updates (if consistency model allows) Send write-combined data Delay sending write-combined data until requested Writes are combined into larger units, updates are delayed until needed Effectively the same as invalidate protocol
47
Includes Cache Update action Includes Cache Writeback action Bus includes Shared flag
Appears to also require memory inhibit signal Distinguish shared case where cache (not memory) supplies data
48
Sc
Respond Shared; Sc
No Action E
No Action I
Respond shared; Supply data Sc Respond shared; Supply data; Sm Respond shared; Update copy; Sc
No Action M
Cache Write-back I
Example
Thread Event 0. Initially: 1. T0 read 2. T0 write 3. T2 read 4. T1 write CR none CR CR,CU C0 C0 Memory Bus Action Data From Global State <0,0,0,1> <1,0,0,1> <1,0,0,0> <1,0,1,0> <1,1,1,0> local states C0 C1 C2 I E M Sm Sc I I I I Sm I I I Sc Sc
5. T0 read
none (hit)
C0
<1,1,1,0>
Sc
Sm
Sc
02/07
51
Sc
Respond Shared; Sc
Respond Shared Sc
Ec
No Action Ec
No Action I
Respond shared; Sc
Respond Shared Sc
Em
No Action Em
Cache Write-back I
02/07
Update vs Invalidate
[Weber & Gupta, ASPLOS3]
Consider sharing patterns
No Sharing
Independent threads Coherence due to thread migration Update protocol performs many wasteful updates
Read-Only
No significant coherence issues; most protocols work well
Migratory Objects
02/07
Manipulated by one processor at a time Often protected by a lock Usually a write causes only a single invalidation E state useful for Read-modify-Write patterns Update protocol could proliferate copies
53
Many Writers/Readers
Invalidate probably works better Update will proliferate copies
02/07
54
Nasty Realities
State diagram is for (ideal) protocol assuming instantaneous and actions In reality controller implements more complex diagrams
A protocol state transition may be started by controller when bus activity changes local state Example: an upgrade pending (for bus) when an invalidate for same line arrives
IR
IW
IRR
IWR
SW
02/07
Respond Shared:
No Action IW
No Action IW
Further Optimizations
Observation: Shared blocks should only be fetched from memory once
If I find a shared block on chip, forward the block Problem: multiple shared blocks possible, who forwards?
Everyone? Power/bandwidth wasted
Very old idea (IBM machines have done this for a long time), but recent Intel patent issued anyway [Hum/Goodman]
57
Further Optimizations
Observation: migratory data often flies by
Add T (transition) state to protocol Tag is still valid, data isnt Data can be snarfed as it flies by Only works with certain kinds of interconnect networks Replacement policy issues
58
59
InboundSnoo pRate = si = n s o
60
Snoop Bandwidth
Snoop filtering of various kinds is possible Filter snoops at sink: Jetty filter [Moshovos et al., HPCA 2001]
Check small filter cache that summarizes contents of local cache Avoid power-hungry lookups in each tag array
61
Snoop Latency
Snoop latency:
Must reach all nodes, return and combine responses Probably the bigger scalability issue in the future Topology matters: ring, mesh, torus, hypercube No obvious solutions
LDir XSnp RDir XRsp CRsp XRd RDat XDat UDat RDat XDat UDat RDat XDat UDat RDat XDat UDat
62
63
Powerful filtering effect: only observe commands that you need to observe Meanwhile, bandwidth at directory scales by adding memory controllers as you increase size of the system
Leads to very scalable designs (100s to 1000s of CPUs)
Directory shortcomings
Indirection through directory has latency penalty If shared line is dirty in other CPUs cache, directory must forward request, adding latency This can severely impact performance of applications with heavy sharing (e.g. relational databases)
64
65
Overlap directory read with data read Best possible latency given NUMA arrangement
66
Memory Module
Directory
...
Memory Module
Directory
Memory Controller becomes an active participant Sharing info held in memory directory
Directory may be distributed
Interconnection Network
Cache
Cache
...
Cache
Processor
Processor
Processor
Example
Local cache suffers load miss Line in remote cache in M state
It is the owner
Processor
processor read
Processor
Processor
Cache
Local Controller
Cache
Owner Controller
...
Cache
Remote Controller
Interconnect
Memory Controller
Directory
Memory Controller
Directory
Memory Controller
Directory
Memory Banks
Memory Banks
...
Memory Banks
I'
I''
S'
02/07
70
Memory Data; Add Requestor to Sharers; M Memory Invalidate All Sharers; M' Memory Read&M; to Owner M' Memory Upgrade All Sharers; M'' No Action I
Make Sharers Empty; U Memory Data to Requestor; Write memory; Add Requestor to Sharers; S When all ACKS Memory Data; M When all ACKS then M Memory Data to Requestor; M
S'
M'
M''
02/07
71
Another Example
Local write (miss) to shared line Requires invalidations and acks
Processor
processor write
Processor
Processor
Cache
Local Controller
Cache
Remote Controller
...
Cache
Remote Controller
cache ack
cache Read&M
cache ack
Interconnect
Memory Controller
Directory
Memory Controller
Directory
Memory Banks
Memory Banks
...
Memory Banks
Example Sequence
Similar to earlier sequences
Thread Event Controller Actions Data From global state local states: C0 C1 C2
0. Initially:
<0,0,0,1>
1. T0 read 2. T0 write
Memory
<1,0,0,1> <1,0,0,0>
S M
I I
I I
3. T2 read 4. T1 write
CR,MR,MD CRM,MI,CA,MD
C0 Memory
<1,0,1,1> <0,1,0,0>
S I
I M
S I
Local Controller
memory data memory read
Owner Controller
3
owner data
Local Controller
cache read
owner data
3
memory read
Owner Controller
owner ack
cache read
Memory Controller
Memory Controller
a)
b)
Predict sharers
75
Protocol Races
Atomic bus
Only stable states in protocol (e.g. M, S, I) All state transitions are atomic (I->M) No conflicting requests can interfere since bus is held till transaction completes
Distinguish coherence transaction from data transfer Data transfer can still occur much later; easier to handle this case
Read after write (similar to prev. WAW) Writeback race: P0 evicts dirty block, P1 reads
Dirty block is in the network (no copy at P0 or at dir) NAK P1, or force P0 to keep copy till dir ACKs WB
Summary
Coherence States Snoopy bus-based Invalidate Protocols Invalidate protocol optimizations Update Protocols (Dragon/Firefly) Directory protocols Implementation issues
79