Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9

Cache Coherency in Multiprocessors (MPs)/ Multicores
Topic 9
Cache Coherency Problem

Multiple Processors/Multi-cores with local cache and Shared Main Memory
Necessary condition for the problem to arise.
Information in the Cache Memories and MM are not same
Cache Coherence
The goal is to make sure that READ(X) returns the most recent value of the shared variable X, i.e. all valid copies of a shared variable are identical. Origin:
Sharing of Writable Data Process Migration (Processes move from one processor to another during their execution) I/O activity (Reading and writing to MM)
3
Cache Coherence
Sharing of Writable Data (2 processors/cores)
Before any updation, all 3 copies of X are same Write Through (at all time cache & MM are same) : Suppose, P1 writes X to MM, P2 reads X from Local Cache), 2 copies exist. Write back (all 3 copies may be different)
Process Migration:
Process updates X, migrates to P2, write back Write Through, Process running on P1, updates X, moves to P2, corresponding cache block exists in P2 4
Cache Coherence
I/O activities:
Write Through, Read from I/O device, MM copy changed but Caches copies remain same Write Back , Write to I/O, Old value from MM will be written to I/O device One solution may be to share a local cache between a Processor and an IOP (I/O Processor).
To simplify the problem, so that problem will exist between MM and Cache copies. Performance degradation due to hampering of LOR of the processors 5
Cache Coherence
Solutions:
Software based or Hardware based
S/W based:
Compiler tags data as cacheable and noncacheable. Only read-only data is considered cachable and put in private cache. All other data are non-cachable, and can be put in a global cache, if available.
6
Cache Coherence
H/W based:
Snooping/snoopy protocol: -MM and caches are connected by BUS (broadcast media) -Whatever communication is taking place, all other can listen (by Cache controller constantly monitoring the bus) -2 main categories: Cache Block Invalidation and cache Block Update
7
Cache Coherence
Cache block Update & Cache block Invalidation protocols: -In Update, updates are sent over the Inter-core/Interprocessor BUS (broadcast media) and all caches whose block is matching will perform the updates. Subsequent updates to the same variable will generate extra traffic. -In invalidation, invalidation signal is sent over the bus for the updated cache block and all caches whose block is matching will invalidate their cache block. Subsequent updates to the same variable will not generate traffic. 8
Cache Coherence
Write Invalidate: (a) Write-Through policy
W(i), W(j) Write to block by Pi & Pj R(i), R(j) Read block by Pi & Pj Z(i), Z(j) Replace/Invalidate block in Pi & Pj
R(j), W(j), Z(i), Z(j) Invalid (i) R(i), W(i) R(i), W(i), R(j), Z(j) Valid (i)
W(j), Z(i)
Cache Coherence
Write Invalidate: (b) Write-Back policy (also called MSI Protocol)
R(i), W(i), Z(j)
R(j)
RW RO
R(i), R(j), Z(j)
W(j), Z(i)
W(i) R(j), Z(j), Z(i), W(j) R(i) W(i) W(j), Z(i) Invalid
10
MSI Protocol: It is the basic cache coherence

protocol used in MP. Each cache block can be in one of three possible states
MODIFIED: The block has been modified, but it is inconsistent with the M-block copy. So, cache block with M state must be written back to MM when it is evicted. INVALID: Not valid, so must be fetched from MM or other cache SHARED: Multiple caches may hold valid copies (which is similar to the copy in MM) These states are maintained through comm between the Caches and MM. 11
Cache Coherence
MSI Protocol:
Cache Coherence
for any given pair of local caches, the permitted states are
M M S I S I
This protocol was mainly used in SGI 4D Variants: Most modern shared memory MP uses variant to reduce the amount of traffic. Exclusive state in MESI, Owned state in MOSI
12
Cache Coherence
MESI Protocol (Papamarcos & Patel 1984):
It is a version of the snooping cache protocol. Each cache block can be in one of four states
INVALID: Not valid SHARED: Multiple caches may hold valid copies. EXCLUSIVE: No other cache has this block, M-block is valid. Therefore, writing of a block in E state need not send Cache Invalidate signal to others (simply change the state to M). MODIFIED: Present only in current cache which is a Valid block, but copy in M-block is not valid. Writing it back to MM will simply change its state to E. 13
Cache Coherence
Event Read Hit Read Miss WH WM Local Use local copy I to S, or I to E (S,E,M) to M I to M Remote No action (S,E,M) to S (S,I) to I (S,E,M) to I
When a cache block changes its status from M, it first updates the main memory.
14
Cache Coherence
Following the read miss, the holder of the modified copy signals the initiator to try again. Meanwhile, it seizes the bus, and write the updated copy into the Main memory.
15
MESI Protocol:
Cache Coherence
M M E S I E S I
This protocol was mainly used in Intels Pentium Processors
16
Cache Coherence
MOSI Protocol:
It is a version of the snooping cache protocol. Each cache block can be in one of four states
INVALID: Not valid SHARED: Multiple caches may hold valid copies. OWNED: The current cache owns this block. Therefore, it will serve the requests (by snooping the bus) from other processors for this block in I mode. Other cache will receive values from Owning cache much faster than retrieving values from MM. MODIFIED: Present only in current cache which is a Valid block, but copy in M-block is not valid. 17
MOSI Protocol:
Cache Coherence
M M O S I O S I
18
Cache Coherence
MOESI Protocol:
It is a full cache coherency protocol encompassing all possible states commonly used in other protocols. Each cache block can be in one of five states
INVALID: Not valid (valid copy may be either in MM or other cache) SHARED: Multiple caches may hold valid copies. Many be dirty w.r.t. MM (if a cache block exists in Owned State) or clean. 19
Cache Coherence
MOESI Protocol:
EXCLUSIVE: Contains most recent correct copy of data, No other cache has this block and M-block is valid. Therefore, writing of a block in E state need not send Cache Invalidate signal to others (simply change the state to M or O). May respond to a snoop request with data. MODIFIED: Present only in current cache which is a Valid block, but copy in M-block is not valid. Writing it back to MM will simply change its state to E. Must respond to a snoop request with data. OWNED: holds a Valid block (most recent) with other caches like in Shared state, but the copy in M-block can be stale. Only one cache can be in O state. Writing it back to MM will simply change its state to S. Can changed to M state after invalidating all shared copies. Must respond to a snoop request with data.
20
Cache Coherence
MOESI Protocol:
M M O E S I O E S I
21
Cache Coherence
MESIF Protocol:
It is a full cache coherency protocol developed by Intel for ccNUMA (Cache-coherent Non Uniform Memory Architecture) processors. Each cache block can be in one of five states M, E, S, I and F. Where M, E, S and I are same as used in MESI protocol.
F (Forward): A cache block in F state means the designated cache is the responder for any cache request for the be designated cache block. For any cache holds a block in S state, exactly one cache can hold the block in F state. 22
Cache Coherence
MESIF Protocol:
M M E S I F E S I F
23
Cache Coherence
H/W based:
Directory based protocols Snooping can not be used in Point-to-point links (Reliable and efficient) MM and processors/caches are connected by Switching network Which block can presence in which cache (processor) and one dirty/clean bit
24
Cache Coherence
H/W based:
Directory based protocols may be
Full-Map directory (maintains ptrs for all) Limited Directory (only limited ptrs with replacement). Can statistically computes avg no of ptrs reqd. Chained Directory (last entry maintains Chain Terminator, CT), chain starts from Shared memory
25
Cache Coherence Problem

Load A Store A<= 1 Load A Load A
P0
P1
01
Memory
26
Cache Coherence Problem

Load A Store A<= 1 Load A Load A
P0
P1
0 1
0 1
Memory
27
Possible Causes of Incoherence

Sharing of writeable data
Cause most commonly considered
Process migration
Can occur even if independent jobs are executing
I/O
Often fixed via O/S cache flushes
02/07
28
Informally, with coherent caches: accesses to a memory location appear to occur simultaneously in all copies of the memory location
copies caches
Cache Coherence
Cache coherence suggests an absolute time scale -- this is not necessary

What is required is the "appearance" of coherence... not absolute coherence E.g. temporary incoherence between memory and a write-back cache may be OK. 02/07 29
Update vs. Invalidation Protocols

Coherent Shared Memory
All processors see the effects of others writes
How/when writes are propagated

Determine by coherence protocol
30
Global Coherence States

A memory line can be present (valid) in any of the caches and/or memory Represent global state with an N+1 element vector
First N components => cache states (valid/invalid) N+1st component => memory state (valid/invalid)
Example: Line A: <1,1,0,1> Line B: <1,0,0,0> Line C: <0,0,0,1>
Memory
line A: V line B: I line C: V
Cache 0
line A line B
Cache 1
line A
Cache 2
02/07
Local Coherence States

Individual caches can maintain a summary of the state of memory lines, from a local perspective
Reduces storage for maintaining state May have only partial information Invalid (I): <0,X,X,X....X> -- local cache does not have a valid copy; (cache miss)
Dont confuse invalid state with empty frame
Shared (S): <1,X,X,X,,1> -- local cache has a valid copy, main memory has a valid copy, other caches ?? Modified(M): <1,0,0,..0,0> -- local cache has only valid copy. Exclusive(E): <1,0,0,..0,1> -- local cache has a valid copy, no other caches do, main memory has a valid copy. Owned(O): <1,X,X,X,.X> -- local cache has a valid copy, all other caches and memory may have a valid copy.
Only one cache can be in O state <1,X,1,X, 0> is included in O, but not included in any of the others.
Example
Memory line A: V line B: I line C: V Cache 0 line A line B Cache 1 line A Cache 2
Memory line A: V line B: I line C: V
Cache 0 line A: S line B: M line C: I
Cache 1 line A: S line B: I line C: I
Cache 2 line A: I line B: I line C: I
33
Snoopy Cache Coherence

All requests broadcast on bus All processors and memory snoop and respond Cache blocks writeable at one processor or readonly at several
Single-writer protocol
Snoops that hit dirty lines?

Flush modified data out of cache Either write back to memory, then satisfy remote miss from memory, or Provide dirty data directly to requestor Big problem in MP systems
Dirty/coherence/sharing misses
34
Bus-Based Protocols
Protocol consists of states and actions (state transitions) Actions can be invoked from processor or bus
Bus
Bus Actions
Cache Controller
State
Tags
Cache Data
Processor Actions
Processor
02/07
Minimal Coherence Protocol FSM

[Source: Patterson/Hennessy, Comp. Org. & Design]
36
MSI Protocol
Action and Next State
Current State I Processor Read Cache Read Acquire Copy S No Action S No Action M Processor Write Cache Read&M Acquire Copy M Cache Upgrade M No Action M No Action I Cache Write back I Eviction Cache Read No Action I Cache Read&M No Action I Cache Upgrade No Action I
No Action S Memory inhibit; Supply data; S
Invalidate Frame I Invalidate Frame; Memory inhibit; Supply data; I
Invalidate Frame I
02/07
MSI Example
Thread Event Bus Action Data From Global State Local States: C0 C1 C2
0. Initially: 1. T0 read 2. T0 write 3. T2 read 4. T1 write CR CU CR CRM C0 Memory Memory
<0,0,0,1> <1,0,0,1> <1,0,0,0> <1,0,1,1> <0,1,0,0>
I S M S I
I I I I M
I I I S I
If line is in no cache
Read, modify, Write requires 2 bus transactions Optimization: add Exclusive state
02/07
Invalidate Protocol Optimizations

Observation: data can be read shared
Add S (shared) state to protocol: MSI
State transitions:
Local read: I->S, fetch shared Local write: I->M, fetch modified; S->M, invalidate other copies Remote read: M->S, supply data Remote write: M->I, supply data; S->I, invalidate local copy
Observation: data can be write-private (e.g. stack frame)

Avoid invalidate messages in that case Add E (exclusive) state to protocol: MESI
State transitions:
Local read: I->E if only copy, I->S if other copies exist Local write: E->M silently, S->M, invalidate other copies
39
MESI Protocol
Variation used in Pentium Pro/2/3
(cache to cache transfer) 4-State Protocol
Modified: <1,0,00> Exclusive: <1,0,0,,1> Shared: <1,X,X,,1> Invalid: <0,X,X,X>
Bus/Processor Actions
Same as MSI
Adds shared signal to indicate if other caches have a copy
02/07
40
MESI Protocol
Current State I Processor Read Cache Read If no sharers: E If sharers: S No Action S No Action E No Action M Processor Write Cache Read&M M Eviction Cache Read No Action I Cache Read&M No Action I Cache Upgrade No Action I
Cache Upgrade M No Action M No Action M
No Action I No Action I Cache Write-back I
Respond Shared: S Respond Shared; S Respond dirty; Write back data; S
No Action I No Action I Respond dirty; Write back data; I
No Action I
02/07
MESI Example
Optimization cache-to-cache transfer for shared data (but only one sharer can respond) If modified in some cache and another reads then writeback to memory (w/ snarf)
Optimization: add Owned state (and perform cache-to-cache transfer)
Thread Event Bus Action Data From Global State Local States: C0 C1 C2
0. Initially: 1. T0 read 2. T0 write

02/07
<0,0,0,1> CR none Memory <1,0,0,1> <1,0,0,0>
I E M
I I I
I I I
MOESI Optimization
Observation: shared ownership complicates/delays sourcing data
Owner is responsible for sourcing data to any requestor Add O (owner) state to protocol: MOSI/MOESI Last requestor becomes the owner Ownership can be on per-node basis in hierarchically structured system Avoid writeback (to memory) of dirty data Also called shared-dirty state, since memory is stale
43
MOESI Protocol
Used in AMD Opteron 5-State Protocol
Modified: <1,0,00> Exclusive: <1,0,0,,1> Shared: <1,X,X,,X> Invalid: <0,X,X,X> Owned: <1,X,X,X,X> ; only one owner
Owner can supply data, so memory does not have to

Avoids lengthy memory access
02/07
44
MOESI Protocol
Current State I Processor Read Cache Read If no sharers: E If sharers: S No Action S No Action E Processor Write Cache Read&M M Eviction Cache Read Cache Read&M Cache Upgrade No Action I No Action I No Action I
Cache Upgrade M No Action M
No Action I No Action I
Respond shared; S Respond shared; Supply data; S Respond shared; Supply data; O Respond shared; Supply data; O
No Action I Respond shared; Supply data; I Respond shared; Supply data; I Respond shared; Supply data; I
No Action I
No Action O
Cache Upgrade M No Action M
Cache Write-back I Cache Write-back I
No Action M
02/07
MOESI Example
Thread Event Bus Action Data From Global State local states C0 C1 C2
0. Initially: 1. T0 read 2. T0 write 3. T2 read 4. T1 write CR none CR CRM C0 C0 Memory
<0,0,0,1> <1,0,0,1> <1,0,0,0> <1,0,1,0> <0,1,0,0>
I E M O I
I I I I M
I I I S I
02/07
Update Protocols
Basic idea:
All writes (updates) are made visible to all caches:
(address,value) tuples sent everywhere Similar to write-through protocol for uniprocessor caches
Simple optimization
Obviously not scalable beyond a few processors No one actually builds machines this way
Further optimizations: combine & delay
Send updates to memory/directory Directory propagates updates to all known copies: less bandwidth
Write-combining of adjacent updates (if consistency model allows) Send write-combined data Delay sending write-combined data until requested Writes are combined into larger units, updates are delayed until needed Effectively the same as invalidate protocol
Logical end result
47
Update Protocol: Dragon

Dragon (developed at Xerox PARC) 5-State Protocol
Invalid:<0,X,X,X> Some say no invalid state due to confusion regarding empty frame versus invalid line state Exclusive: <1,0,0,,1> Shared-Clean (Sc): <1,X,X,X> memory may not be up-to-date Shared-Modified (Sm): <1,X,X,X0> memory not up-to-date; only one copy in Sm Modified: <1,0,0,0>
Includes Cache Update action Includes Cache Writeback action Bus includes Shared flag
Appears to also require memory inhibit signal Distinguish shared case where cache (not memory) supplies data
48
Dragon State Diagram

Current State I Processor Read Cache Read If no sharers: E If sharers: Sc No Action Sc Processor Write Cache Read If no sharers: M If sharers: Cache Update Sm Cache Update If no sharers: M If sharers: Sm No Action M Sm No Action Sm Cache Update If no sharers: M If sharers: Sm No Action M Cache Write-back I No Action I Eviction Cache Read I Cache Update I
Sc
Respond Shared; Sc
Respond shared; Update copy; Sc
No Action E
No Action I
Respond shared; Supply data Sc Respond shared; Supply data; Sm Respond shared; Update copy; Sc
No Action M
Cache Write-back I
Respond shared; Supply data; Sm
Example
Thread Event 0. Initially: 1. T0 read 2. T0 write 3. T2 read 4. T1 write CR none CR CR,CU C0 C0 Memory Bus Action Data From Global State <0,0,0,1> <1,0,0,1> <1,0,0,0> <1,0,1,0> <1,1,1,0> local states C0 C1 C2 I E M Sm Sc I I I I Sm I I I Sc Sc
5. T0 read
none (hit)
C0
<1,1,1,0>
Sc
Sm
Sc
Appears to require atomic bus cycles CR,CU on write to invalid line

02/07
Update Protocol: Firefly

Develped at DEC by ex-Xerox people 5-State Protocol
Similar to Dragon different state naming based on shared/exclusive and clean/dirty Invalid:<0,X,X,X> EC: <1,0,0,,1> SC: <1,X,X,X> memory may not be up-to-date EM: <1,0,0,0> SM: <1,X,X,X0> memory not up-to-date; only one copy in Sm
Performs write-through updates (different from Dragon)
02/07
51
Firefly State Diagram

Current State I Processor Read Cache Read If no sharers: Ec If sharers: Sc No Action Sc Processor Write Cache Read If no sharers: Em If sharers: Cache Update Sm Cache Read&M If no sharers: Ec If sharers: Sc No Action Em Sm No Action Sm Cache Read&M If no sharers: Ec If sharers: Sc No Action Em Cache Write-back I Respond shared; Supply data; Sm Respond shared; Supply data; Sc No Action I Eviction Cache Read Cache Read&M No Action I No Action I
Sc
Respond Shared; Sc
Respond Shared Sc
Ec
No Action Ec
No Action I
Respond shared; Sc
Respond Shared Sc
Em
No Action Em
Cache Write-back I
Respond shared; Supply data; Sm
Respond shared; Supply data; Sc
02/07
Update vs Invalidate
[Weber & Gupta, ASPLOS3]
Consider sharing patterns
No Sharing
Independent threads Coherence due to thread migration Update protocol performs many wasteful updates
Read-Only
No significant coherence issues; most protocols work well
Migratory Objects

02/07
Manipulated by one processor at a time Often protected by a lock Usually a write causes only a single invalidation E state useful for Read-modify-Write patterns Update protocol could proliferate copies
53
Update vs Invalidate, contd.

Synchronization Objects
Locks Update could reduce spin traffic invalidations Test&Test&Set w/ invalidate protocol would work well
Many Readers, One Writer

Update protocol may work well, but writes are relatively rare
Many Writers/Readers
Invalidate probably works better Update will proliferate copies
What is used today?

Invalidate is dominant CMP may change this assessment more on-chip bandwidth
02/07
54
Nasty Realities
State diagram is for (ideal) protocol assuming instantaneous and actions In reality controller implements more complex diagrams
A protocol state transition may be started by controller when bus activity changes local state Example: an upgrade pending (for bus) when an invalidate for same line arrives
Partial Implementation State Table

Current State I Processor Read Request Bus IR No Action S No Action E No Action M Processor Write Request Bus IW Request Bus SW No Action M No Action M Cache Read IRR Cache Read&M IWR If no sharers: E If sharers: S Load line M Load line Cache Upgrade ECE/CS M Bus Grant Bus Response Cache Read Cache Read&M Cache Upgrade No Action I No Action I No Action I Respond Shared: S Respond Shared; S Respond dirty; Write back data; S No Action I No Action I No Action I Respond dirty; Write back data; I
IR
IW
IRR
IWR
SW
02/07
757; copyright J. E. Smith, 2007 SW
Respond Shared:
No Action IW
No Action IW
Further Optimizations
Observation: Shared blocks should only be fetched from memory once
If I find a shared block on chip, forward the block Problem: multiple shared blocks possible, who forwards?
Everyone? Power/bandwidth wasted
Single forwarder, but who?

Last one to receive block: F state I->F for requestor, F->S for forwarder
What if F block is evicted?

Favor F blocks in replacement? Dont allow silent eviction (force some other node to be F) Fall back on memory copy if cant find F copy
Very old idea (IBM machines have done this for a long time), but recent Intel patent issued anyway [Hum/Goodman]
57
Further Optimizations
Observation: migratory data often flies by
Add T (transition) state to protocol Tag is still valid, data isnt Data can be snarfed as it flies by Only works with certain kinds of interconnect networks Replacement policy issues
Many other optimizations are possible

Literature extends 25 years back Many unpublished (but implemented) techniques as well
58
Implementing Cache Coherence

Snooping implementation
Origins in shared-memory-bus systems All CPUs could observe all other CPUs requests on the bus; hence snooping
Bus Read, Bus Write, Bus Upgrade Invalidate shared copies Provide up-to-date copies of dirty lines
Flush (writeback) to memory, or Direct intervention (modified intervention or dirty miss)
React appropriately to snooped commands
Snooping suffers from:

Scalability: shared busses not practical Ordering of requests without a shared bus Lots of recent and on-going work on scaling snoop-based systems
59
Snooping Cache Coherence

Basic idea: broadcast snoop to all caches to find owner Not scalable?
Address traffic roughly proportional to square of number of processors Current implementations scale to 64/128-way (Sun/IBM) with multiple address-interleaved broadcast networks
Inbound snoop bandwidth: big problem

OutboundSnoopRate = s o = CacheM issRate + BusUpgradeRate
InboundSnoo pRate = si = n s o
60
Snoop Bandwidth

Snoop filtering of various kinds is possible Filter snoops at sink: Jetty filter [Moshovos et al., HPCA 2001]
Check small filter cache that summarizes contents of local cache Avoid power-hungry lookups in each tag array
Filter snoops at source: Multicast snooping [Bilir et al., ISCA 1999]

Predict likely sharing set, snoop only predicted sharers Double-check at directory to make sure
Filter snoops at source: Region coherence

Concurrent work: [Cantin/Smith/Lipasti, ISCA 2005; Moshovos, ISCA 2005] Check larger region of memory on every snoop; remember when no sharers Snoop only on first reference to region, or when region is shared Eliminate 60%+ of all snoops
61
Snoop Latency
Snoop latency:
Must reach all nodes, return and combine responses Probably the bigger scalability issue in the future Topology matters: ring, mesh, torus, hypercube No obvious solutions
Parallelism: fundamental advantage of snooping
Broadcast exposes parallelism, enables speculative latency reduction
LDir XSnp RDir XRsp CRsp XRd RDat XDat UDat RDat XDat UDat RDat XDat UDat RDat XDat UDat
62
Scaleable Cache Coherence

Eschew physical bus but still snoop
Point-to-point tree structure (indirect) or ring Root of tree or ring provide ordering point Use some scalable network for data (ordering less important)
Or, use level of indirection through directory

Directory at memory remembers:
Which processor is single writer Which processors are shared readers
Level of indirection has a price

Dirty misses require 3 hops instead of two
Snoop: Requestor->Owner Directory: Requestor->Directory->Owner
63
Implementing Cache Coherence

Directory implementation
Extra bits stored in memory (directory) record state of line Memory controller maintains coherence based on the current state Other CPUs commands are not snooped, instead:
Directory forwards relevant commands
Powerful filtering effect: only observe commands that you need to observe Meanwhile, bandwidth at directory scales by adding memory controllers as you increase size of the system
Leads to very scalable designs (100s to 1000s of CPUs)
Directory shortcomings
Indirection through directory has latency penalty If shared line is dirty in other CPUs cache, directory must forward request, adding latency This can severely impact performance of applications with heavy sharing (e.g. relational databases)
64
Directory Protocol Implementation

Basic idea: Centralized directory keeps track of data location(s) Scalable
Address traffic roughly proportional to number of processors Directory & traffic can be distributed with memory banks (interleaved) Directory cost (SRAM) or latency (DRAM) can be prohibitive Full map (N processors, N bits): cost/scalability Limited map (limits number of sharers) Coarse map (identifies board/node/cluster; must use broadcast)
Point to shared copies Fixed number, linked lists (SCI), caches chained together Latency vs. cost vs. scalability
Presence bits track sharers
Vectors track sharers
65
Directory Protocol Latency
LDir XSnp RDir XRd RDat XDat UDat
Overlap directory read with data read Best possible latency given NUMA arrangement
Access to non-shared data

Access to shared data

Dirty miss, modified intervention Shared intervention?

No inherent parallelism Indirection adds latency Minimum 3 hops, often 4 hops
If DRAM directory, no gain If directory cache, possible gain (use F state)
66
Directory-based Cache Coherence

An alternative for large, scalable MPs Can be based on any of the protocols discussed thus far
We will use MSI
Memory Module
Directory
Memory Module
Directory
...
Memory Module
Directory
Memory Controller becomes an active participant Sharing info held in memory directory
Directory may be distributed
Interconnection Network
Use point-to-point messages Network is not totally ordered
Cache
Cache
...
Cache
Processor
Processor
Processor
Example: Simple Directory Protocol

Local cache controller states
M, S, I as before
Local directory states

Shared: <1,X,X,1>; one or more proc. has copy; memory is up-to-date Modified: <0,1,0,.,0> one processor has copy; memory does not have a valid copy Uncached: <0,0,0,1> none of the processors has a valid
copy
Directory also keeps track of sharers

Can keep global state vector in full e.g. via a bit vector
68
Example
Local cache suffers load miss Line in remote cache in M state
It is the owner
Processor
processor read
Processor
Processor
Four messages send over network

Cache read from local controller to home memory controller Memory read to remote cache controller Owner data back to memory controller; change state to S Memory data back to local cache; change state to S
Cache
Local Controller
Cache
Owner Controller
...
Cache
Remote Controller
memory data response cache read memory read
owner data response
Interconnect
Memory Controller
Directory
Memory Controller
Directory
Memory Controller
Directory
Memory Banks
Memory Banks
...
Memory Banks
Cache Controller State Diagram

Intermediate States for clarity
Cache Controller Actions and Next States from Processor Side
Current State I Processor Read Cache Read I' No Action S No Action M Processor Write Cache Read&M I'' Cache Upgrade S' No Action M No Action* I Cache Writeback I Owner Data; S Owner Data; I Eviction Memory Read Memory Read&M
from Memory Side

Memory Invalidate No Action I Invalidate Frame; Cache ACK; I Invalidate Frame; Cache ACK; I Fill Cache S Fill Cache M No Action M Memory Upgrade Memory Data
I'
I''
S'
02/07
ECE/CS 757; copyright J. E. Smith, 2007
70
Memory Controller State Diagram

Memory Controller Actions and Next States command from Local Cache Controller
Current Directory State U Cache Read Memory Data; Add Requestor to Sharers; S Memory Data; Add Requestor to Sharers; S Memory Read from Owner; S' Cache Read&M Cache Upgrade
response from Remote Cache Controller

Data Write-back Cache ACK Owner Data
Memory Data; Add Requestor to Sharers; M Memory Invalidate All Sharers; M' Memory Read&M; to Owner M' Memory Upgrade All Sharers; M'' No Action I
Make Sharers Empty; U Memory Data to Requestor; Write memory; Add Requestor to Sharers; S When all ACKS Memory Data; M When all ACKS then M Memory Data to Requestor; M
S'
M'
M''
02/07
ECE/CS 757; copyright J. E. Smith, 2007
71
Another Example
Local write (miss) to shared line Requires invalidations and acks
Processor
processor write
Processor
Processor
Cache
Local Controller
Cache
Remote Controller
...
Cache
Remote Controller
cache ack
cache Read&M
memory data response memory invalidate
cache ack
Interconnect
Memory Controller
Directory
Home Memory Controller

Directory
Memory Controller
Directory
Memory Banks
Memory Banks
...
Memory Banks
Example Sequence
Similar to earlier sequences
Thread Event Controller Actions Data From global state local states: C0 C1 C2
0. Initially:
<0,0,0,1>
1. T0 read 2. T0 write
CR,MD CU, MU*,MD
Memory
<1,0,0,1> <1,0,0,0>
S M
I I
I I
3. T2 read 4. T1 write
CR,MR,MD CRM,MI,CA,MD
C0 Memory
<1,0,1,1> <0,1,0,0>
S I
I M
S I
Variation: Three Hop Protocol

Have owner send data directly to local controller Owner Acks to Memory Controller in parallel
Local Controller
memory data memory read
Owner Controller
3
owner data
Local Controller
cache read
owner data
3
memory read
Owner Controller
owner ack
cache read
Memory Controller
Memory Controller
a)
b)
Directory Protocol Optimizations

Remove dead blocks from cache:
Eliminate 3- or 4-hop latency Dynamic Self-Invalidation [Lebeck/Wood, ISCA 1995] Last touch prediction [Lai/Falsafi, ISCA 2000] Dead block prediction [Lai/Fide/Falsafi, ISCA 2001] Prediction in coherence protocols [Mukherjee/Hill, ISCA 1998] Instruction-based prediction [Kaxiras/Goodman, ISCA 1999] Sharing prediction [Lai/Falsafi, ISCA 1999] Improve latency by snooping, conserve bandwidth with directory Multicast snooping [Bilir et al., ISCA 1999; Martin et al., ISCA 2003] Bandwidth-adaptive hybrid [Martin et al., HPCA 2002] Token Coherence [Martin et al., ISCA 2003] VCT Multicasting [Enright Jerger thesis 2008]
Predict sharers
Hybrid snooping/directory protocols
75
Protocol Races
Atomic bus
Only stable states in protocol (e.g. M, S, I) All state transitions are atomic (I->M) No conflicting requests can interfere since bus is held till transaction completes
Distinguish coherence transaction from data transfer Data transfer can still occur much later; easier to handle this case
Atomic buses dont scale

At minimum, separate bus request/response
Large systems have broadly variable delays

Req/resp separated by dozens of cycles Conflicting requests can and do get issued Messages may get reordered in the interconnect
How do we resolve them? 76
Resolving Protocol Races

Req/resp decoupling introduces transient states
E.g. I->S is now I->ItoX->ItoS_nodata->S
Conflicting requests to blocks in transient states

NAK ugly; livelock, starvation potential Keep adding more transient states
Directory protocol makes this a bit easier

Can order at directory, which has full state info Even so, messages may get reordered
77
Common Protocol Races

Read strings: P0 read, P1 read, P2 read
Easy, since read is nondestructive Can rely on F state to reduce DRAM accesses Forward reads to previous requestor (F)
Write strings: P0 write, P1 write, P2 write

Forward P1 write req to P0 (M) P0 completes write then forwards M block to P1 Build string of writes (write string forwarding)
Read after write (similar to prev. WAW) Writeback race: P0 evicts dirty block, P1 reads
Dirty block is in the network (no copy at P0 or at dir) NAK P1, or force P0 to keep copy till dir ACKs WB
Many others crop up, esp. with optimizations 78
Summary
Coherence States Snoopy bus-based Invalidate Protocols Invalidate protocol optimizations Update Protocols (Dragon/Firefly) Directory protocols Implementation issues
79

Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9

Enviado por

Direitos autorais:

Formatos disponíveis

Cache Coherency in Multiprocessors (MPs)/ Multicores

Cache Coherency Problem

Information in the Cache Memories and MM are not same

R(i), R(j), Z(j)

MSI Protocol: It is the basic cache coherence

This protocol was mainly used in Intels Pentium Processors

Cache Coherence Problem

Cache Coherence Problem

Possible Causes of Incoherence

Cache coherence suggests an absolute time scale -- this is not necessary

Update vs. Invalidation Protocols

How/when writes are propagated

Global Coherence States

Example: Line A: <1,1,0,1> Line B: <1,0,0,0> Line C: <0,0,0,1>

Local Coherence States

Memory line A: V line B: I line C: V

Cache 0 line A: S line B: M line C: I

Cache 1 line A: S line B: I line C: I

Cache 2 line A: I line B: I line C: I

Snoopy Cache Coherence

Snoops that hit dirty lines?

Minimal Coherence Protocol FSM

No Action S Memory inhibit; Supply data; S

Invalidate Frame I Invalidate Frame; Memory inhibit; Supply data; I

0. Initially: 1. T0 read 2. T0 write 3. T2 read 4. T1 write CR CU CR CRM C0 Memory Memory

<0,0,0,1> <1,0,0,1> <1,0,0,0> <1,0,1,1> <0,1,0,0>

Invalidate Protocol Optimizations

Observation: data can be write-private (e.g. stack frame)

Adds shared signal to indicate if other caches have a copy

Cache Upgrade M No Action M No Action M

No Action I No Action I Cache Write-back I

Respond Shared: S Respond Shared; S Respond dirty; Write back data; S

No Action I No Action I Respond dirty; Write back data; I

0. Initially: 1. T0 read 2. T0 write

<0,0,0,1> CR none Memory <1,0,0,1> <1,0,0,0>

Owner can supply data, so memory does not have to

Cache Upgrade M No Action M

Cache Upgrade M No Action M

Cache Write-back I Cache Write-back I

0. Initially: 1. T0 read 2. T0 write 3. T2 read 4. T1 write CR none CR CRM C0 C0 Memory

<0,0,0,1> <1,0,0,1> <1,0,0,0> <1,0,1,0> <0,1,0,0>

Further optimizations: combine & delay

Logical end result

Update Protocol: Dragon

Dragon State Diagram

Respond shared; Update copy; Sc

Respond shared; Supply data; Sm

Appears to require atomic bus cycles CR,CU on write to invalid line

Update Protocol: Firefly

Performs write-through updates (different from Dragon)

Firefly State Diagram

Respond shared; Supply data; Sm

Respond shared; Supply data; Sc

Update vs Invalidate, contd.

Many Readers, One Writer

What is used today?

Partial Implementation State Table

757; copyright J. E. Smith, 2007 SW

Single forwarder, but who?

What if F block is evicted?

Many other optimizations are possible

Implementing Cache Coherence

React appropriately to snooped commands