HW4

COMPUTER ARCHITECTURE
HOMEWORK 4: CACHES
Exercise 1
Given the series of address references:
1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17
show the hits and misses and final cache contents for a two-way set-associative cache
with one-word blocks and a total size of 16 words. Assume LRU replacement
strategy.
Exercise 2
Consider the machines with different cache configurations:
• Cache1: Direct-mapped with one-word blocks
• Cache2: Direct-mapped with four-word blocks
• Cache3: Two-way set associative with four-word blocks
The following miss rate measurements have been made:

• Cache1: instruction miss rate is 4%; data miss rate is 8%
For these machines, one-half of the instructions contain a data reference. Assume that
the cache miss
penalty is 6+Block size in words.
a) Determine which machine spends most cycles on cache misses.
b) If the cycle times of the first and the second machine are 2ns and of the third
machine is 2.4ns,
determine which machine is the fastest and which is the slowest.
Exercise 3
The following C program is run (with no optimizations) on a machine with cache that
has four-word
(16-byte) blocks and holds 256 bytes of data:
int i, j, c, stride, array[256];

...
for (i=0; i<10000; i++)
for (j=0; j<256; j=j+stride)
c = array[j]+5;
If we consider only the cache activity by references to the array and we assume that
integers are words, what
is the expected miss rate when the cache is direct-mapped and stride=132 ? How
about stride=131 ?
Would either of these change if the cache were two-way set associative ?
Exercise 4
You are building a system around a processor with in-order execution that runs at 1.1
GHz and a CPI of 0.7 excluding memory accesses. The only instructions that read or
write data from memory are loads (20% of all instructions) and stores (5% of all
instructions). The memory system for this computer is composed of a split L1 cache
that imposes no penalty on hits. Both the I-cache and D-cache are direct mapped and
hold 32 KB each. The I-cache has a 2% miss rate and 32-byte blocks, and the D-cache
is write-through with a 5% miss rate and 16-byte blocks. There is a write-buffer on
the D-cache that eliminates stalls for 95% of all writes. The 512 KB write-back,
unified L2 cache has 64-byte blocks and an access time of 15 ns. It is connected
to the L1 cache by a 128-bit data bus that runs at 266 MHz and can transfer one 128-
bit word per bus cycle. Of all memory references sent to the L2 cache in this system,
80% are satisfied without going to main memory. Also, 50% of all blocks replaced are
dirty. The 128-bit-wide main memory has an access latency of 60 ns, after which any
number of bus words may be transferred at the rate of one per cycle on the 128-bit-
wide 133 MHz main memory bus.
a) What is the average memory access time for instruction accesses?
b) What is the average memory access time for data reads?
c) What is the average memory access time for data writes?
d) What is the overall CPI, including memory accesses?
e) You are considering replacing the 1.1 GHz CPU with one that runs at 2.1
GHz, but otherwise
identical. How much faster does the system run with a faster processor? Assume the
L1 cache still has no hit penalty, and that the speed of the L2 cache, main memory,
and buses remain the same in absolute terms (e.g., the L2 cache still has a 15 ns
access time and 266 MHz bus connecting it to
the CPU and L1 cache).
Exercise 5
Consider a processor in which the ratio of load:store instructions is 3:1. The memory
system consists of
split L1 caches and a unified L2 cache. The caches are blocking.
• L1 I-cache: 64B block, 3% miss rate, 0.5ns hit time.
• L1 D-cache: 64B block, 4% miss rate, 0.5ns hit time, write-through. Assume
that the write buffer
never fills up. No write allocate. If a store instruction misses, it takes 0.5ns to
determine that the instruction is a miss, and another 0.5ns to write the store to the
write buffer.
• No critical word first, no early restart.
• L2 cache: 64B block, 30% local miss rate, write-back, no write buffer. 40% of
the blocks are dirty.
• L2 access time: 20ns, not including bus transfer time. The L2 access time must
be taken into account
for every L2 access, whether it is a hit or a miss.
• Main memory access time: 80ns, not including bus transfer time.
• L1-L2 bus: 16B wide. It takes 3.5ns to transfer 16 bytes on this bus.
• L2-main memory bus: 8B wide. It takes 7ns to transfer 8 bytes on this bus.
a) What is the average memory access time for instructions?
b) What is the average memory access time for data (read and write)?
c) Repeat item (a) when the system uses critical word first and early restart.
Assume that if there isan L2 miss, and to make place for the new block a dirty
block must be discarded from L2, the dirtyblock will be first entirely written
to the main memory and only after that the missing block will beloaded.
Exercise 6
For a small instruction cache, a direct mapping cache could consistently outperform a
fully associative cache using LRU replacement.
a) Explain how this is possible. Show an example.
b) Explain where replacement policy fits into the three C’s model, and explain
why this means that misses caused by a replacement policy are “ignored” – or,
more precisely, cannot in general be definitively classified – by the three C’s
model.
c) Are there any replacement policies for the fully associative cache that would
outperform the directmapped cache? Ignore the policy of “do what a direct-
mapped cache would do.”
Exercise 7
The best memory hierarchy performance could occur when some instructions are
prevented from entering
the cache. That is, the cache does not always read-allocate. Explain why this could be
true.

HW4

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

HW4

Enviado por

Direitos autorais:

Formatos disponíveis

COMPUTER ARCHITECTURE

The following miss rate measurements have been made:

int i, j, c, stride, array[256];

Você também pode gostar