Fasd

Introduction to Caches
Introduction
In the simplest model of a computer, there is a CPU, and there is physical memor
y (RAM) and there is a bus to connect the two. Over the years, computers have be
come more sophisticated, and implementation of memory systems has also become mo
re sophisticated.
In particular, RISC machines became popular. Even though Intel processors domina
te the market, they use a lot of RISC ideas. Recall that RISC machine have the f
ollowing features:
Many general-purpose registers (32 or more)
All instructions have same number of bits, i.e. equal width instructions (32 bit
s is typical)
One clock cycle per instruction (if possible)
Fewer kinds of instructions than CISC
Faster clock speeds than CISC
Load-store architecture (all instructions, except load-store, use registers as o
perands. Load-store are only instructions to access memory)
Basically, in RISC, some complex instructions were eliminated in favor of runnin
g several simpler instructions that did the same task. Initially, RISC used as f
ew instructions as they could get away with. Some academic ISAs had only about a
dozen instructions.
Over time, people realized this was not a reasonable way to develop an ISA. The
decision about which instructions to keep and which instructions to get rid of w
ere based on benchmark programs. If simulations showed that including an instruc
tion in the ISA would improve performance (speed) in many typical programs (i.e.
, benchmarks), then the instruction was kept in the ISA. Otherwise, it was disca
rded.
As a result, a typical RISC program might contain many more instructions than it
s equivalent CISC program, since it may take more RISC instructions to run the e
quivalent CISC code. (Admittedly, it was expected that this would happen infrequ
ently).
How can RISC programs possibly be faster? How can running more RISC instructions
be faster than running fewer CISC instructions? Several ways. First, if you des
ign a CPU that has a small set of instructions, then it's easier to design and o
ptimize the hardware. You can then run the CPU at a faster clock rate.
The CPU may run more instructions, but each instruction will run quicker because
of the higher clock rate. The hope is that the slowdown from executing more ins
tructions is more than offset by running each instruction more quickly.
Second, some CISC instructions run in many clock cycles. For example, it may tak
e many clock cycles to run an integer divide. CISC instructions may not always r
un in one clock cycle, while most RISC instructions attempt to run in one clock
cycle (it doesn't always succeed either, but it succeeds more often than for CIS
C instructions).
Third, if RISC is to be successful, then memory has to be cheap and quick. A big
ger program uses up more memory, and you need to be able to access instructions
that must faster. It's not surprising that RISC became a viable way to design a
CPU when memory costs started to decline. When memory was expensive, the key con
sideration was to keep total program lengths as short as could be managed, even
if it meant the instructions took longer to execute on the CPU.
Even cheap memory is not enough. Memory (RAM) is often the bottleneck when execu
ting code. This is an important fact to remember. You may have, say, 2 GHz machi
nes, but typical RAM can not retrieve instructions from memory at that speed.
A solution is to use small, but fast memories. For example, accessing registers
is very quick, while accessing RAM is comparatively slow. This suggests that if
we want speed, we need more registers, and hope that most data stays in the regi
sters, and that we rarely need to access RAM.
This is not a reasonable assumption. Registers simply don't hold enough memory.
However, if we can use small, fast memory---this is going to be called cache, th
en this memory is faster than RAM, but slower than registers, and we can keep th
e most commonly used data (or instructions) in the cache. There's going to more
data in cache (typically 512 K or so) than in registers (typically, 128 bytes).
There's a basic principle about memory. If you want lots of memory, it will be c
heap (i.e., it doesn't cost much) and it's slow (to access data). If you want ve
ry fast memory, it will be small (i.e., hold little data) and expensive (per byt
e of memory). Historically, there used to be two levels in the memory hierarchy:
registers (very fast, but very few of them) and RAM (slow, but lots of it).
A Range of Speed and Cost
Normally, one thinks of memory in three levels: registers, RAM (physical memory)
and disk. This forms part of the memory hierarchy.
In the last 20 years or so (since about 1990), cache has become increasingly imp
ortant. It lies between registers and physical memory in terms of size and speed
. From slowest to fastest: disk, RAM, cache, registers. Each one is faster, yet
smaller.
On the left are registers. They give you the fastest access time. Currently, acc
ess times for a register is about
Intuitive Example of Caching
Consider the following analogy of caching. You are planning to write a very long
paper on the history of mythology of many different cultures. You start with an
cient Babylonian myths, and work your way to Greek, Roman, Norse, Native America
n, Asian, etc. myths. You've divided your paper so that each chapter covers one
culture's mythology. Thus, at any one time, you are studying one mythology.
You have a computer on your desk, and get your books online. You can download on
ly so many books because your computer is limited in memory. The advantage of do
wnloading the books onto your own computer is so that you can access the materia
l quickly. Downloading the books is quite slow, so you hope not to do it very of
ten.
If it weren't for the fact that you collect related books and write on those boo
ks for each chapter (say, 10 books on Greek mythology, then 10 books on Norse my
thology), you might be forever downloading books. Even though the majority of bo
oks remain at the central library site, the books you need are on your computer,
which is quick to access.
What happens if you have already downloaded 10 books (assuming they all take the
same amount of space), and you want an 11th book? You will need to delete one o
f your books off your computer, to make room for the next book.
Which one should you pick? You could pick the one that you haven't read in the l
ongest time. That policy is called "least recently used". Or you could see which
book you've had on your computer the longest time. That policy is called "first
in first out". Or perhaps the one you've looked at least. That's called "least
frequently used".
In any case, you use some policy to decide which book to get rid of to make spac
e for the new book.
An Improved Analogy
While the previous analogy works pretty well, it's not very accurate. The real p
roblem with the analogy is that it doesn't accurately portray the penalty involv
ed when the data is NOT found in the cache. Let's modify the analogy to make it
more accurate.
Let's assume your computer can still store 10 books. There are two sites you can
access books. However, there's the "Virtual Central Library" (VCL), which has a
ny book that you might want. There's also a "Local Library" (LL).
Whenever you want to access a book, you will look in the Local Library (LL) firs
t. If it's at the LL, you will copy the book to your own computer. If it's not t
here, the LL will contact the Virtual Central Library (VCL). The LL will copy th
e book to its own library (possibly removing a book from the LL to make space),
and then you will copy the book from the LL to your own computer.
The LL is much smaller than the VCL, but can store more books than you can on yo
ur computer.
Thus, the LL serves as a "middle man". If the book you want is in the LL, you ac
cess it much faster than if you have to download it from the Virtual Central Lib
rary (VCL). For example, let's say it takes 1 second to access the LL, and 100 s
econds to access the VCL.
However, when you are unable to find the book in the LL, you spend 100 seconds g
etting a copy from the VCL to the LL, and one more second to get it from the LL
to your local computer. Thus, it's actually slower to use the LL (not by much) i
f the book you're looking for is not there. This is where we get a more accurate
representation for cache.
Cache works very similarly. Basically, you want some data (or an instruction) at
some address. This data either appears in the cache (which is like the LL) or i
t isn't. If it's not there, you need to access the data from RAM (which is like
the VCL). This data is then copied to the cache, and then from the cache to the
registers. Thus, the registers act like your local computer.
In fact, cache has become so useful, most modern CPUS often have two levels of c
ache. They are called L1 and L2 caches for level 1 and level 2 cache.
Here's how to visualize an L2 cache. Imagine a "State Library" (SL), which is bi
gger than the LL, but smaller than the VCL.
Every book in the LL is in the State Library (SL). However, the SL has more book
s than the LL. Every book the SL is also in the VCL. Thus, LL is a subset of SL,
which is a subset of VCL.
To access a book, you go to the LL. If it's not there, you check the SL. If it's
not there, you go to the VCL. That data gets copied to the SL, then back to the
LL, and finally to your computer.
In principle, you can carry this idea of caching many, many levels.
Each level is a subset of the next level. The higher the level, the more memory,
and the slower it gets. For example, we could label the levels of memory as L1,
L2, L3, etc. where all the data in level Li is a subset of that in level Li+1.
When you look for data at a particular address, you search for it at the smalles
t level, and if you can't find it there, you look at the next level. Once you fi
nd the data you are looking for, you copy it down to the lower levels, possibly
getting rid of a cache line in the process.
Definition A cache line is the "unit" of data you transfer to a cache. Typically
, this number is a power of 2 (say, at least, 16 bytes).
Other Forms of Caching
The concept of caching occurs everywhere in the computer world. Anytime you have
to access data from a location that is slow, you may want to cache a copy at a
location which allows you to access the data faster.
For example, suppose you visit a particular webpage at a website. Initially, tha
t webpage may be slow to download. But, when you visit another webpage and come
back, it seems much faster "download" the old webpage. Why is that?
A local copy of the webpage is kept on your computer. Thus, your computer's hard
drive (or possibly memory) is being used as a cache for that webpage. This can
sometimes create problem, especially if the content of the webpage is always bei
ng updated (sports scores for a live games, for example).
These days, hard disks also have memory that serves as cache. Recall that hard d
isks are used to store files. Recall that hard disks are also slow. So, one idea
is to use RAM as a cache for the disk. As files are accessed, place it in a spe
cial RAM just for files. When you want to edit a file, you check in the file RAM
first, and if it's not there, then you check on the disk. This is done automati
cally for you, via the operating system.
As you can see, the principle of caching is simple. Use faster memory to store f
requently accessed data.
Properties Needed for Effective Caching
We're going to use the same idea for the memory hierarchy. Storing a small subse
t of the data from RAM into cache will allow the CPU to access certain data very
quickly, as long as it resides in the cache.
The memory hierarchy wouldn't be very effective if two facts about programs were
n't true. Programs exhibit spatial locality and temporal locality.
spatial locality Spatial locality says that if you access memory address x, you
are like to access memory address x plus or minus Delta where Delta is a small n
umber. That is, you are likely to access addresses very near x in the near futur
e.
Do programs exhibit spatial locality? Yes, they do. For example, when you access
ing data from arrays, you are accessing memory locations that are contiguous. Al
so, instructions usually run sequentially (with the occasionaly branch or jump).
Since instructions are contiguous in memory, they exhibit spatial locality. Whe
n you run one instruction, you're likely to run the next one too. That's the nat
ure of running programs.
temporal locality Temporal locality says that if you access memory address x at
time t, you are like to access memory address x at time t +/- Delta. That is, yo
u are likely to access the same address sometime soon.
This happens when you run code in a loop. You access the same instructions over
and over. If you process an array in a loop (say, bubble sorting), then you acce
ss array elments over and over again too.
How Locality Helps
Why is it important to have these properties? In particular, why is temporal loc
ality important? Let's consider our library example. Suppose you wish to access
random books from the VCL.
Each time you download a book, you would look in the local library. Assuming it'
s random, the probabilty that the book is in the LL would be low (since it store
s only a small subset of the VCL). So you would have to go to the Virtual Centra
l Library to find the book. Since accessing the VCL is very slow, you pay a fair
ly large cost (in time) for accessing books.
Furthermore, you have to copy that book from the VCL to the LL too. We copy the
book to the LL, so it can be cached, and accessed again in the near future. Howe
ver, since you're accessing books randomly, the books in the LL won't be used in
the near future, and you will always have to access the VCL.
In effect, the LL is useless to you.
You get benefit from using the Local Library if you're accessing the books found
in the LL all the time. When you search for the same books over and over, they
appear in the LL, and access time is much smaller to access the LL than the VCL.
To apply the analogy to hardware, if you want to to access data and it's found i
n cache, you get savings in access time, since accessing data (or instructions)
found in the cache is much quicker than accessing main memory, all the time.
On the other hand, if you don't access data in the cache very often, then the ca
che is not useful, and may in fact, cause a small delay than not having the cach
e at all.
Taking Advantage of Spatial and Temporal Locality
How do we apply the principles of spatial and temporal locality to cache design?
Temporal locality is simple. You simply place data in the cache, because it's l
ike to be accessed in the near future. Since cache access is quicker than physic
al memory (RAM) access, placing data you just used into the cache should make it
much quicker to access in the near future.
What about spatial locality? This says that if you access address X, you are lik
e to access addresses near X too.
Spatial locality suggests the following strategy when accessing data. Load the d
ata from many contiguous addresses near X from RAM into the cache. That is, copy
a block of data from memory to cache. This block should be larger than 4 bytes
which is the typical amount you would access from RAM to registers.
This idea wouldn't be so great if it weren't quicker to access the entire block
at once, as opposed to accessing the bytes one at a time. This makes some sense
if you think about it. Suppose you were ordering books. If ordering 10 books at
once took just as much time as ordering each book individually, then there's no
need to order 10 books at a time. Just order them whenever you want them. (There
's price----i.e., you might save money if you order 10 books at once, but there'
s no real analogy to price for hardware).
To take advantage of temporal locality, place data that's accessed into the cach
e, which has quicker access time than physical memory (RAM).
To take advantage of spatial locality, copy a block of data from RAM that includ
es address X, the desired address. This block is called the cache line. This blo
ck should contain more than four bytes (typically 2k bytes where k > 2).
Accessing a Block of Data
How do we access a block of data? One way to do it is to access all data from ad
dress X - Delta to X + Delta. That is, pick addresses before and after X.
However, this turns out to be inconvenient. It's easier to access the data in th
e following way.
Suppose we want to copy a data block of 32 bytes from memory (instead of just 1
or 4 bytes). We need 5 bits to specify each of the 32 bits. Thus, 00000 is byte
0, and 11111 is byte 31.
Suppose you want to access address A31-0. Instead of accessing only that single
address, generate the following 32 addresses:
A31-5 00000
A31-5 00001
A31-5 00010
A31-5 00011
....
....
A31-5 11111
That is, take the upper 27 bits of the address. Concatenate that with all possib
le 5 bit bitstrings (there are 32 such bitstrings).
This creates 32 consecutive addresses. Copy the data in those addresses from RAM
to the cache. That data is called the cache line.
To see if you understand what's going on, how would you create the addresses if
you wish to copy 64 bytes in a cache line.
Instead of using the upper 27 bits, you now use the upper 26 bits. You then crea
te all possible 6-bit bitstrings (there are 64 of those) and concatenate them to
the 26 bits. This creates 64 addresses in memory. Copy the 64 bytes from main m
emory to the cache.
We will give the upper 27 bits a name. We will call it the tag. Thus, when you h
ave 32 bytes, you know exactly where they come from in memory, if you know the t
ag.
When we copy the 32 bytes, we place them in the cache in the same order they app
ear in memory. We can imagine that data being stored in an array with indexes 00
000 up to 11111. (There are 32 elements in this array, which should not be so su
rprising, since that's how much we copied).
If we want to access address A31-5 00011, then we look at index 00011 of the "ar
ray", to get the data.
00011 is considered the offset into the array.
Given the tag and the offset, we know what address the data at a given offset is
. For example, if we access the data at 00011, then this data is the same as the
one at the address given by the tag concatenated with 00011, i.e., address A31-
5 00011.
Slots
Let's get into more specific details about the cache. In particular, we'll descr
ibe an individual slot.
A slot consists of the following:
V the valid bit, indicating whether the slot holds valid data. If V = 1, then th
e data is valid. If V = 0, the data is not valid. Initially, it's invalid. Once
data is placed into the slot, it's valid.
D the dirty bit. This bit only has meaning if V = 1. This indicates that the dat
a in the slot (to be discussed momentarily) has been modified (written to) or no
t. If D = 1, the data has been modified since being in the cache. If D = 0, then
the data is the same as it was when it first entered the cache.
Tag The tag represents the upper bits of the address. The size of the tag is 32
- lg N where N is the number of bytes in the data part of the slot.
Cache Line This is the actual data itself. There are N bytes, where N is a power
of 2. We will also call this the data block.
The total amount of memory used in a slot that stores 32 bytes in a cache line c
an be easily computed. There's 2 bits (one for V and D), plus the tag (27 bits),
plus the data in the cache line (32 x 1 byte = 32 bytes = 256 bits). That's a t
otal of 285 bits.
Here's a diagram of one slot.
There's variation to the slot, which we'll discuss when we talk about direct map
ped cache, and set associative cache.
Cache Hits and Cache Misses
When you look for data at a given address, and find it stored in cache, then you
have a cache hit. If you don't find the data, then it's a cache miss.
You want to maximize cache hits, because then you have much improved performance
(speed).
What Happens in a Cache Miss
As you run your program, it needs data at an address. This can either be the add
ress of actual data being load or stored, or it can be the address of an instruc
tion.
You look for the address to see if it's one of the addresses in cache. If the ad
dress (and its contents) are NOT in the cache, then you must access it from main
memory. This means that you must copy the data from memory to the cache.
Since the cache is a small subset of main memory, there may be data that you nee
d to remove from the cache (just as you had to remove books from your computer,
if you needed other books to replace it).
Instruction and Data Cache
Instructions and data have different access patterns, and access different regio
ns of memory. Thus, having the same cache for both instructions and data may not
always work out.
Thus, it's rather common to have two caches: an instruction cache that only stor
es instructions, and a data cache that only stores data.
You see some indication of this in the CPU design where there's "instruction mem
ory" and "data memory".
Write Back vs. Write Through
Suppose you need to modify data at some address located in the cache. How do you
do this?
There are two ways to do this:
write-back This says that you update main memory only when the cache line is evi
cted. Otherwise, only update the cache.
This approach creates an inconsistency between cache and memory. Cache has one v
iew of memory (which is accurate), while main memory does not have the latest up
dates.
Normally, this is not a problem. After all, there's only one CPU, so who else wo
uld need to access memory? However, if you had more than one CPU, each with thei
r own cache, this problem of memory consistency becomes a much bigger issue. A C
PU accessing memory may not have the latest update, because another CPU has modi
fied a memory location, but the update is only in the cache.
Write-back is convenient because you don't have to access main memory, which can
be slow.
write-through This says that you update main memory at the same time you update
cache. Does that sound like a bad idea? The key is to avoid waiting for the writ
e to memory to complete. Essentially, once the write to cache is done, the progr
am can proceed, while the write to memory is done "in the background".
The disadvantage occurs when there are many writes to memory. This can cause a b
acklog waiting for the writes to occur.
The advantage of write-through is that you don't have to worry about memory cons
istency, since cache and memory should have the same values (even though it may
take memory a little longer to get the correct value).
Categories of Cache Misses
We can subdivide cache misses into one of three categories:
a compulsory miss (or cold miss) occurs when there's little to no data in the ca
che. Initially, you're going to have a cache miss, no matter how big the cache i
s, and no matter how many bytes the cache line contains. Once a program has been
running some, the cache becomes more fully utilized, and such misses don't occu
r.
a conflict miss usually occurs in direct-mapped caches. Two cache lines may map
to the same cache slot, even as there may be empty slots, that could store both
cache lines. Such a conflict would force an unnecessary eviction of a cache line
, but this is the price you pay for direct-mapped cache. Set-associative caches
have this problem too, to a lesser degree. Fully associative caches avoid this k
ind of problem.
a capacity miss can be one of two sorts. One is a miss that would not have occur
red if the cache size were larger, i.e., more slots. The other occurs with the n
umber of bytes in a cache line. Suppose you are processing a large int array. If
the cache line has 16 bytes, you will get a cache miss after processing 4 eleme
nts of the array. Had the cache line been larger, you wouldn't have had this mis
s.
These misses also occur in virtual memory.
Slot Replacement Policy
Suppose you have a cache miss, and all the slots in the cache are being used. Yo
u will now have to pick a slot to get rid of. You are evicting a cache line.
Which slot should you pick? There are many policies. Any will do. Here's a list:
LRU Least Recently Used. Pick the slot that hasn't been used in the longest time
.
LFU Least Frequently Used. Pick the slot that has been used the least often.
FIFO First In First Out. Pick the slot that has been in the cache the longest (w
hich is NOT the same as LRU).
These policies often require additional hardware to indicate time of use or freq
uency of use. There have also been other fancier schemes to decide how to get ri
d of a slot.
When you choose a slot to be evicted, you look at the dirty bit (we assume a wri
te-back policy). If D = 1, i.e., the slot is dirty, then copy the cache line bac
k into memory. Otherwise, you can overwrite the cache line with the new data, ta
g, etc.
When the new cache line enters,
V is set to 1, (it's valid)
D is set to 0, (no longer dirty)
Tag bits are copied in
Cache line is copied in from main memory
Assembly Programmers?
When you are programming in assembly, you aren't even aware of the cache. Much o
f the actual interactions with cache are handled by the hardware and the operati
ng system.
This is convenient, because the cache scheme can be changed without having to mo
dify any programs. Often, there's a question as to how much the actual implement
ation (called the ISP, the instruction set processor) should be visible to the I
SA programmer. The more that's visible, the more that you restrict the implement
ation, and the fewer chances for optimization.
Thus, you only want to reveal just enough of the real hardware for programmers t
o get the job done, and no more. In effect, the ISA is a kind of specification f
or the actual CPU, which can be implemented in one of many ways.
Word Alignment
One motivation for having words at word-aligned addresses is that all four bytes
will always fall within a cache line. A cache line is, for lack of a better phr
ase, super word-aligned.
Super word-aligned means that if you have 2k consecutive bytes, the first byte m
ust reside at an address divisible by 2k. Thus, if the cache line has 16 bytes,
it must be at an address divisible by 16, which means it's automatially word-ali
gned (since word-aligned means divisible by 4).
You'll never have half a word in one cache line, and the other half in another c
ache line (provided the cache line contains at least 4 bytes) as long as the wor
d is word-aligned.
To convince yourself of this, try to have a word that is word-aligned, and see i
f you can create a cache line where only part of the word is within it.
If you can do it, more than likely, you have misunderstood how cache lines work.
Summary
Cache is a mechanism used to store frequently accessed data in fast memory. Cach
e is a subset of the data found in RAM, which means any data you find in cache,
you can also find in RAM.
When data at an address is needed, the CPU first searches in the cache. If it fi
nds it there, then a load or store is performed to or from a register.
If a cache miss occurs, then a cache line needs to be copied from main memory to
a slot in a cache, which may require the eviction of a slot.
Ideally, you want the percentage of cache hits to be high, to give the illusion
that all the useful data is in the cache, and being accessed at the speed of the
cache, as opposed to the speed of main memory. How effective this is depends on
how much spatial and temporal locality the particular program being run has.asl
dkfj alsdkfj lasdkjf asldkjf

Fasd

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Fasd

Enviado por

Direitos autorais:

Formatos disponíveis

Introduction to Caches

Você também pode gostar