Escolar Documentos
Profissional Documentos
Cultura Documentos
Making it real.
Cache miss.
L36 11/6 1
Cache Deployment Problem: Memory access times (IF, MEM) limit Beta clock speed. Solution: Use cache for both instruction fetches and data accesses: IF RF ALU MEM WB Cache
MAIN MEMORY
- assume HIT time for pipeline clock period; - STALL pipe on misses.
Variants:
- Separate I & D caches - Level 1 cache on CPU chip, L2 cache offchip.
6.004 Fall 97... L36 11/6 2
ON REFERENCE TO <X>: Look for X among tags... HIT: X = TAG(i) , for some cache line i
REPLACEMENT SELECTION:
READ: Read Mem(X) Set TAG(k)=X, DATA(K)=<X> WRITE: Start Write to Mem(X) Set TAG(k)=X, DATA(K)= new <X>
6.004 Fall 97... L36 11/6 3
SAW 8/7/00 10:06
Write Strategy
Sure, we update the CACHE on write... do we really need to update main MEMORY?
Line Size
BLOCKS of data words, sharing a single TAG (like increasing word size)
L36 11/6 5
I
400 404 408
D
4000+...
GOAL: Given some cache design, simulate (by hand or machine) execution well enough to determine hit ratio. Simplifications: 1. Observe that the sequence of memory locations referenced is 400, 4000, 404, 408, 400, 4004, ... We can use this simpler reference string, rather than the program, to simulate cache behavior. 2. We can make our life easier in many cases by converting to word addresses: 100, 1000, 101, 102, 100, 1001, ...
6.004 Fall 97... L36 11/6 6
Cache Simulation...
the hard way
Consider our tiny loop with a 4-word Direct Mapped cache:
Adr 100 1000 101 102 100 1001 101 102 100 1002 101 102 100 1003 101 102 ... Line# 0 0 1 2 0 1 1 2 0 2 1 2 0 3 1 2 Miss? M M M M M M M
1st Iteration: all misses (filling cache!) Contention between <PC> and low array elements...
M M M
CPU
A B
<A> <B>
MAIN MEMORY
ON REFERENCE TO <X>: Look for X among tags... HIT: X = TAG(i) , for some cache line i
REPLACEMENT SELECTION:
Least Recently Used, First-In-FirstOut (FIFO), Random BAD ALGORITHMS: Stack, Most Recently Used
6.004 Fall 97... L36 11/6 8
FIFO (LRR) Replacement Cheap alternative to LRU: Maintain a counter for each associative set; cycles through N alternative lines. Replace lines in order (next to replace is Least Recently Replaced) PRO: O(log N) hardware cost! CON: Occasionally replaces a recently used line. Empirical result: For sizable caches, works nearly as well as LRU! Variant (even cheaper):
Single counter for ALL sets!
6.004 Fall 97... L36 11/6 10
LRU vs LRR
Consider our tiny loop with a 8 word (total), 2-way set-associative cache:
LRU
Adr 100 1000 101 102 100 1001 101 102 100 1002 101 102 100 1003 101 102 100 1004 101 102 100 Line# 0 0 1 2 0 1 1 2 0 2 1 2 0 3 1 2 0 0 1 0 0 Miss? M M M M M
LRR
LRR C 0 1 0 0 1 Miss? M M M M M
0 1
6.004 Fall 97... L36 11/6 11
M M
<2000>, <100> 0 0 0 0 0 M M M M M
Random Replacement
Observations: 1. Reasonable replacement algorithms perform about the same for sizable caches; many pseudo-LRU schemes (like LRR). 2. For every cache design, there are pathologically bad cases... depending on your tendencies, you may worry about Paranoid: those designed by your enemies Pessimist: Murphys law accidents of address assignment Bizarre Proposal: select RANDOMLY a line for replacement! Mitigates Paranoia by frustrating any attempt to systematically design a worst-case scenario; Chronic pessimism by smoothing out the performance curve: at least you wont CONSISTENTLY get worst-case performance with your favorite program!
6.004 Fall 97... L36 11/6 13
SAW 8/7/00 10:06
VALID bits
V 0 0 1 0 0 1 0 TAG DATA
CPU
A B
<A> <B>
MAIN MEMORY
Problem: Ignoring cache lines that don't contain anything of value... ... eg, on
- start-up - Back door changes to memory (eg loading program from disk)
Solution: Extend each TAG with VALID bit. Valid bit must be set for cache line to HIT. At power-up / reset : clear all valid bits Set valid bit when cache line is first replaced. Cache Control Feature:
Flush cache by clearing all valid bits. Under program/external control.
6.004 Fall 97...
L36 11/6 14
Handling of WRITES Observation: Most (90+%) of memory accesses are READs. How should we handle writes? Issues: Write-behind: writes to main memory may be buffered, perhaps pipelined. Write-through: CPU writes are cached, but also written to main memory. Memory always holds the truth. Write-back: CPU writes are cached, but not immediately written to main memory. Memory contents can be stale. Our cache thus far uses _______________. Can we improve write performance?
6.004 Fall 97... L36 11/6 1 5
WRITE BACK
V 0 0 1 0 0 1 0 TAG DATA
CPU
A B
<A> <B>
MAIN MEMORY
ON REFERENCE TO <X>: Look for X among tags... HIT: X = TAG(i) , for some cache line i
READ: return DATA(i) WRITE: change DATA(i);Start Write to Mem(X) REPLACEMENT SELECTION:
MISS: X not found in TAG of any cache line Select some line k to hold <X> Write Back: Write Data(k) to Mem(Tag(k))
READ: Read Mem(X) Set TAG(k)=X, DATA(K)=<X> WRITE: Start Write to Mem(X) Set TAG(k)=X, DATA(K)= new <X> Is write-back worth the trouble? Depends on (1) cost of write; (2) consistency issues.
6.004 Fall 97... L36 11/6 16
SAW 8/7/00 10:06
CPU
A B
<A> <B>
MAIN MEMORY
ON REFERENCE TO <X>: Look for X among tags... HIT: X = TAG(i) , for some cache line i
READ: WRITE:
REPLACEMENT SELECTION:
Select some line k to hold <X> IF D(k) = 1 (Write Back)
READ: Read Mem(X) Set TAG(k)=X, DATA(K)=<X>, D(k)=0 WRITE: Start Write to Mem(X) D(k)=1 Set TAG(k)=X, DATA(K)= new <X>,
6.004 Fall 97... L36 11/6 17
SAW 8/7/00 10:06
Multiple data per cache line IDEA: TAG storage is expensive, so share each tag among several consecutive data words.... DATA D V TAG 0 1 0 0 1 1 A <A> <A+1> MAIN CPU 0 MEMORY 0 0 1 B <B> <B+1> 0 ADDRESS from CPU:
Blocked Caches
BLOCKS of 2 words, on 2 word boundaries. B ALWAYS read/write 2 word blocks from/to memory. LOCALITY: Access one word in block, others likely. COST: some redundant fetches. BIG WIN if wide path to memory (eg, interleaved).
6.004 Fall 97... L36 11/6 18
Replacement Strategy:
BIG caches: any sane approach works well REAL randomness assuages paranoia!
Performance analysis:
Tedious hand synthesis may build intuition from simple examples, BUT Computer simulation of cache behavior on REAL programs (or using REAL trace data) is the basis for most real-world cache design decisions. YOULL DO THIS IN THE LAB!
6.004 Fall 97... L36 11/6 19