Você está na página 1de 19

Cache Design:

Making it real.

Why the book? Lecture notes are faster!

Cache miss.

6.004 Fall 97...

L36 11/6 1

SAW 8/7/00 10:06

Cache Deployment Problem: Memory access times (IF, MEM) limit Beta clock speed. Solution: Use cache for both instruction fetches and data accesses: IF RF ALU MEM WB Cache
MAIN MEMORY

- assume HIT time for pipeline clock period; - STALL pipe on misses.

Variants:
- Separate I & D caches - Level 1 cache on CPU chip, L2 cache offchip.
6.004 Fall 97... L36 11/6 2

SAW 8/7/00 10:06

Basic Cache Operation


TAG 1.0 CPU A B <A> <B> DATA (1) MAIN MEMORY

ON REFERENCE TO <X>: Look for X among tags... HIT: X = TAG(i) , for some cache line i

READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem(X)

MISS: X not found in TAG of any cache line

REPLACEMENT SELECTION:

Select some line k to hold <X>

READ: Read Mem(X) Set TAG(k)=X, DATA(K)=<X> WRITE: Start Write to Mem(X) Set TAG(k)=X, DATA(K)= new <X>
6.004 Fall 97... L36 11/6 3
SAW 8/7/00 10:06

Review: Associativity N-line Fully Associative:


address address

compare adr with each tag simultaneously.

Non-Associative (direct mapped):


compare adr with only 1 tag.

N-way set associative:


address

compare adr with N tags simultaneously.

(This is called a ____________!)


6.004 Fall 97... L36 11/6 4
SAW 8/7/00 10:06

Additional Cache Design Issues Replacement Strategy...


On MISS, how to choose cache line whose contents will be replaced. Only applies to _________________ caches!

Write Strategy
Sure, we update the CACHE on write... do we really need to update main MEMORY?

Validity status of Cache Contents


Does a cache line ALWAYS contain valid data?

Line Size
BLOCKS of data words, sharing a single TAG (like increasing word size)

6.004 Fall 97...

L36 11/6 5

SAW 8/7/00 10:06

Cache Benchmarking Suppose this loop is entered with <R3>=4000


and <R2>=0: ADR: Instruction
400: 404: 408:

I
400 404 408

D
4000+...

LD(R3,0,R0) ADDC(R3,4,R3) BNE(R0,400)

GOAL: Given some cache design, simulate (by hand or machine) execution well enough to determine hit ratio. Simplifications: 1. Observe that the sequence of memory locations referenced is 400, 4000, 404, 408, 400, 4004, ... We can use this simpler reference string, rather than the program, to simulate cache behavior. 2. We can make our life easier in many cases by converting to word addresses: 100, 1000, 101, 102, 100, 1001, ...
6.004 Fall 97... L36 11/6 6

SAW 8/7/00 10:06

Cache Simulation...
the hard way
Consider our tiny loop with a 4-word Direct Mapped cache:
Adr 100 1000 101 102 100 1001 101 102 100 1002 101 102 100 1003 101 102 ... Line# 0 0 1 2 0 1 1 2 0 2 1 2 0 3 1 2 Miss? M M M M M M M

1st Iteration: all misses (filling cache!) Contention between <PC> and low array elements...

M M M

Equilibrium: Miss data fetches, 1/4 of instruction fetches (due to conflict).

Long-loop hit ratio: about 9/16.


6.004 Fall 97... L36 11/6 7
SAW 8/7/00 10:06

ISSUE: Replacement Strategy


TAG DATA

CPU

A B

<A> <B>

MAIN MEMORY

ON REFERENCE TO <X>: Look for X among tags... HIT: X = TAG(i) , for some cache line i

READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem(X)

MISS: X not found in TAG of any cache line

REPLACEMENT SELECTION:

Select some line k to hold <X>... GOOD ALGORITHMS:

Least Recently Used, First-In-FirstOut (FIFO), Random BAD ALGORITHMS: Stack, Most Recently Used
6.004 Fall 97... L36 11/6 8

SAW 8/7/00 10:06

LRU Replacement Intuitive LOCALITY argument:


GOAL: Keep the n most recently used locations in cache theyre likely to be used again soon. Hence Replacement policy: replace least recently used cached location. Empirical performance: GOOD (locality at work!) LRU Implementation: Consider N-line fullyassociative (or N-way set associative) cache with LRU replacement: LRU replacement REQUIRES that we remember the order in which all N lines were last accessed N! orderings O(log N!) = O(n log n) LRU bits +complex logic. Is there a cheaper LRU Approximation?
6.004 Fall 97... L36 11/6 9

SAW 8/7/00 10:06

FIFO (LRR) Replacement Cheap alternative to LRU: Maintain a counter for each associative set; cycles through N alternative lines. Replace lines in order (next to replace is Least Recently Replaced) PRO: O(log N) hardware cost! CON: Occasionally replaces a recently used line. Empirical result: For sizable caches, works nearly as well as LRU! Variant (even cheaper):
Single counter for ALL sets!
6.004 Fall 97... L36 11/6 10

SAW 8/7/00 10:06

LRU vs LRR
Consider our tiny loop with a 8 word (total), 2-way set-associative cache:

LRU
Adr 100 1000 101 102 100 1001 101 102 100 1002 101 102 100 1003 101 102 100 1004 101 102 100 Line# 0 0 1 2 0 1 1 2 0 2 1 2 0 3 1 2 0 0 1 0 0 Miss? M M M M M

LRR
LRR C 0 1 0 0 1 Miss? M M M M M

0 1
6.004 Fall 97... L36 11/6 11

M M

LRU gets _____%, LRR gets about 70%.


SAW 8/7/00 10:06

Devils Advocate Game


Your company uses the cheaper LRR cache, the competition uses LRU. Can you devise a benchmark to make your cache look better?
Adr 100 1000 100 2000 NOW Line 0 contains: 1000 100 2000 1000 100 ... Line# 0 0 0 0 Miss? M M M LRR C 0 1 0 Miss? M M M

<2000>, <100> 0 0 0 0 0 M M M M M

<2000>,<1000> HIT! 1 0 M HIT! M HeeHee!

Devils Advocacy: synthesize worst-case reference

pattern by ALWAYS referencing the word just replaced!

Challenge: synthesize program that matches reference string!


6.004 Fall 97... L36 11/6 12
SAW 8/7/00 10:06

Random Replacement
Observations: 1. Reasonable replacement algorithms perform about the same for sizable caches; many pseudo-LRU schemes (like LRR). 2. For every cache design, there are pathologically bad cases... depending on your tendencies, you may worry about Paranoid: those designed by your enemies Pessimist: Murphys law accidents of address assignment Bizarre Proposal: select RANDOMLY a line for replacement! Mitigates Paranoia by frustrating any attempt to systematically design a worst-case scenario; Chronic pessimism by smoothing out the performance curve: at least you wont CONSISTENTLY get worst-case performance with your favorite program!
6.004 Fall 97... L36 11/6 13
SAW 8/7/00 10:06

VALID bits
V 0 0 1 0 0 1 0 TAG DATA

CPU

A B

<A> <B>

MAIN MEMORY

Problem: Ignoring cache lines that don't contain anything of value... ... eg, on
- start-up - Back door changes to memory (eg loading program from disk)

Solution: Extend each TAG with VALID bit. Valid bit must be set for cache line to HIT. At power-up / reset : clear all valid bits Set valid bit when cache line is first replaced. Cache Control Feature:
Flush cache by clearing all valid bits. Under program/external control.
6.004 Fall 97...

L36 11/6 14

SAW 8/7/00 10:06

Handling of WRITES Observation: Most (90+%) of memory accesses are READs. How should we handle writes? Issues: Write-behind: writes to main memory may be buffered, perhaps pipelined. Write-through: CPU writes are cached, but also written to main memory. Memory always holds the truth. Write-back: CPU writes are cached, but not immediately written to main memory. Memory contents can be stale. Our cache thus far uses _______________. Can we improve write performance?
6.004 Fall 97... L36 11/6 1 5

SAW 8/7/00 10:06

WRITE BACK
V 0 0 1 0 0 1 0 TAG DATA

CPU

A B

<A> <B>

MAIN MEMORY

ON REFERENCE TO <X>: Look for X among tags... HIT: X = TAG(i) , for some cache line i

READ: return DATA(i) WRITE: change DATA(i);Start Write to Mem(X) REPLACEMENT SELECTION:

MISS: X not found in TAG of any cache line Select some line k to hold <X> Write Back: Write Data(k) to Mem(Tag(k))

READ: Read Mem(X) Set TAG(k)=X, DATA(K)=<X> WRITE: Start Write to Mem(X) Set TAG(k)=X, DATA(K)= new <X> Is write-back worth the trouble? Depends on (1) cost of write; (2) consistency issues.
6.004 Fall 97... L36 11/6 16
SAW 8/7/00 10:06

For WRITE-BACK caches: "DIRTY" bits


DV 0 0 1 1 0 0 0 1 0 TAG DATA

CPU

A B

<A> <B>

MAIN MEMORY

ON REFERENCE TO <X>: Look for X among tags... HIT: X = TAG(i) , for some cache line i

READ: WRITE:

return DATA(i) change DATA(i);Write to Mem(X) D(i)=1

MISS: X not found in TAG of any cache line

REPLACEMENT SELECTION:
Select some line k to hold <X> IF D(k) = 1 (Write Back)

Write Data(k) to Mem(Tag(k))

READ: Read Mem(X) Set TAG(k)=X, DATA(K)=<X>, D(k)=0 WRITE: Start Write to Mem(X) D(k)=1 Set TAG(k)=X, DATA(K)= new <X>,
6.004 Fall 97... L36 11/6 17
SAW 8/7/00 10:06

Multiple data per cache line IDEA: TAG storage is expensive, so share each tag among several consecutive data words.... DATA D V TAG 0 1 0 0 1 1 A <A> <A+1> MAIN CPU 0 MEMORY 0 0 1 B <B> <B+1> 0 ADDRESS from CPU:

Blocked Caches

MATCH with TAG


B B

SELECT from BLOCK

BLOCKS of 2 words, on 2 word boundaries. B ALWAYS read/write 2 word blocks from/to memory. LOCALITY: Access one word in block, others likely. COST: some redundant fetches. BIG WIN if wide path to memory (eg, interleaved).
6.004 Fall 97... L36 11/6 18

SAW 8/7/00 10:06

Trends & Observations


Associativity:
Less important as size increases 2-way or 4-way usually plenty for typical program clustering; BUT additional associativity Smooths performance curve Reduces number of select bits (well see shortly how this helps) TREND: Invest in RAM, not comparators.

Replacement Strategy:
BIG caches: any sane approach works well REAL randomness assuages paranoia!

Performance analysis:
Tedious hand synthesis may build intuition from simple examples, BUT Computer simulation of cache behavior on REAL programs (or using REAL trace data) is the basis for most real-world cache design decisions. YOULL DO THIS IN THE LAB!
6.004 Fall 97... L36 11/6 19

SAW 8/7/00 10:06

Você também pode gostar