Solution of CSE 240A Assignemnt 3

Solution of CSE 240A Assignemnt 3
Problem 1: 1. Before doing this problem, we need to know how a given address is partitioned in the cache. According to the cache conguration given in the problem, we can calculate the total number of sets as: N umber of sets = Cache Size Block Size Associativity 64 KB = 512 entries = 64 Bytes 2
Therefore, 9 bits are required for index eld, and 6 bits are required for block oset. As illustrated in Table 1, we can nd all the cache accesses are misses for the rst four iterations (miss rate 100%). Since the block size is 64 bytes, each block contains 64 = 16 array elements. Therefore, for every 4 16 iterations, there are 3 compulsory misses (6.25%) and conict 45 conict misses (93.75%). 2. AverageM emoryAccessT ime = hit time + miss rate miss penalty = (1 cycle + 100% 20cycles = 21cycles 3. The 8-entry victim cache can helps to capture all the evicted blocks due to conict misses. Therefore, the average access time will become: AverageM emoryAccessT ime = (1 cycle + 6.25% 20cycles + 93.75% 1cycles = 3.1875cycles 4. Combining write buers and victim cache together allows more ecient usage of space since evicted dirty block may be reused soon. Besides, recently written data might also be re-used by a following read in the near future, and this design allows such read miss to response faster since we do not need to fetch the whole block from lower level memory. address 0x10000 0x20000 0x30000 0x10004 0x20004 0x30004 0x10008 0x20008 0x30008 0x1000C 0x2000C 0x3000C tag 0x2 0x4 0x6 0x2 0x4 0x6 0x2 0x4 0x6 0x2 0x4 0x6 index 0 0 0 0 0 0 0 0 0 0 0 0 type of miss compulsory compulsory compulsory (evict 0x2) conict (evict 0x4) conict (evict 0x6) conict (evict 0x2) conict (evict 0x4) conict (evict 0x6) conict (evict 0x2) conict (evict 0x4) conict (evict 0x6) conict (evict 0x2)
a[0][0] b[0][0] c[0][0] a[0][1] b[0][1] c[0][1] a[0][2] b[0][2] c[0][2] a[0][3] b[0][3] c[0][3]
Table 1: cache access for the rst four iterations 1
However, for some applications, such design may result in the contention between write and frequently read data, and cause the performance degradation due to the raising read miss rate. In addition, merging this two units also increase the hardware design complexity, and the access time of this new victim cache would be increased because the increased number of entries. Problem 2: 1. For this problem, we assume that both our L1-cache and L2-cache are write-allocate. All units of memory hierarchy are not designed as multiported, that is, no two memory accesses can be overlapped. Besides, a read/write request to memory is available only after the whole block is arrived in memory. Under the above constrains, the miss penalties for each level in memory hierarchy are: For L2M iss penalty , because the data bus is only 128-bit-wide, a whole 64-byte block requires 648 bus cycles 128 1 (cycle time is 133106 sec) to be sent to L2. Besides of fetching requested block from DRAM, evicting a dirty block also incurs a write back to DRAM, so we multiply the miss penalty by 1.5 since 50% of replaced blocks are dirty on average. Therefore, L2M iss
penalty
= (DRAM access time + Data T ransf er T imeL2DRAM ) DRAM accesses per miss 64 8 1 = (60 ns + ( ) sec) 1.5 128 133 106 = (60 ns + 30 ns) 1.5 = 135 ns
Then, we calculate miss penalty of L1 instruction cache (IL1M iss penalty ). Similar to L2M iss penalty , the data bus between L1 and L2 is only 128-bit-wide, a whole 32-byte block requires 328 bus cycles (cycle time 128 1 is 200106 sec) to be sent to L1. IL1M iss penalty is shown as below: IL1M iss
penalty
= L2hit
L2M iss penalty + Data T ransf er T imeL1L2 32 8 1 = 10 ns + 0.2 135 ns + ( ) sec 128 200 106 = 10 ns + 27 ns + 10 ns = 47 ns
time rate penalty )
+ L2M iss
The calculation of L1 data cache miss penalty (DL1M iss except that DL1 cache is designed to be 16-byte-blocked. DL1M iss
penalty
is almost the same as IL1M iss
penalty
L2miss penalty + Data T ransf er T imeL1L2 16 8 1 = 10 ns + 0.2 135 ns + ( sec ) 128 200 106 = 10 ns + 27 ns + 5 ns = 42 ns
time rate
= L2hit
+ L2M iss
With the above miss penalties, we can calculate the average memory stall latency for the target system. Notice that the L1 is write-through, write-allocate, and write-buered. Therefore, besides of the miss penalty for fetching block from L2 (DCache writes per instructionaverage M iss rateDL1 M iss penalty DL1 ), write through also incurs a write to write buer. But about 5% of writes buer cannot be satised by write buer, and we assume that a read/write request to memory is available only after the whole block is arrived in memory, an additional delay for waiting an evicted block in write buer to be written to L2 is
required(DCache writes per instructionaverage W rite stall rateDL1 M iss penalty DL1 ). memory stall latency average = ICache accesses per instructionaverage M iss rateIL1 IL1M iss penalty +DCache reads per instructionaverage DL1M iss rate DL1M iss penalty +DCache writes per instructionaverage DL1M iss rate DL1M iss penalty +DCache writes per instructionaverage DL1W rite stall rate DL1M iss penalty = 1 0.02 47 ns + 0.2 0.05 42 ns + 0.05 0.05 42 ns +0.05 0.05 42 ns = 0.94 ns + 0.42 ns + 0.105 ns + 0.105 ns = 1.57 ns = 1.57 cycles Finally, We have the overall CPI of the system as: CP Ioverall1GHz = CP I without memory stall1GHz + M emory Stall Latency average = 0.7 + 1.57 = 2.27 cycles
2. We then calculate the miss penalties for each level of memory hierarchy on 2 GHz processor. Since the memory hierarchy is unchanged, we simply need to transform the memory stall latency into the clock cycles of the system with 2 GHz processor. M emory Stall Latencyaverage 2GHz The overall CPI of 2 GHz system is: CP Ioverall 2GHz = CP I without memory stall2GHz + M emory Stall Latency average = 0.7 + 3.14 = 3.84 cycles The performance the 2 GHz system is faster than 1 GHz system as follows: Speedup2GHz
than 1GHz
= 1.57 ns = 3.14 cycles
= =
CP Ioverall1GHz Cycle time1GHz CP Ioverall2GHz Cycle time2GHz 1 2.27 1109

1 2109
3.84 1.18
You may also use Amdalhs Law to solve the problem. The speedup of the 2 GHz system is: Speedup2GHz
than 1GHz
= (1
1
0.7 2.27 )
0.7 2.27
1.18
Problem 3: 1. Suppose that a program consequently traverse N blocks in a loop, and the cache can hold only M blocks, where M < N < 2M . If the cache is direct-mapped, N M blocks are mapped to sets shared with one M other data block. The miss rate for this loop is about 2(NN ) . However, for a fully associative LRU cache that can also hold M blocks, when a miss happened, the requested block is always replaced in previous cache misses since the block was the least recently used block at that time, and the hit rate would become 3
0. For example, consider that we have a cache which can hold 4 blocks, and the loop traverses ve blocks in the order of A,B,C,D, and E. A requested block always misses since the block is always the least recently used block in the previous one cache miss. 2. Lets try explaining the above phnomenon according to the denitions of 3Cs. We rst examine the compulsory miss. Compulsory miss means the misses due to the very rst access to a block. In the condition we states in previous subproblem, both DM and FA caches can hold the same number of blocks, the number of compulsory misses should be the same for these two caches. In the aspect of capacity miss, capacity miss means those miss will occurs even when fully associative cache is used. Therefore, capacity miss would not be the reason since we indeed use FA cache. Finally, we examine conict miss. According to the denition of conict miss, this kind of misses only happen on direct-mapped or set-associative cache. Therefore, conict miss must not be one of the reason. 3. The Most-recently-used (MRU) can do better. For the case stated in subproblem 1, there are N M misses M in every M cache accesses. So the miss rate can be reduced to N M , since MRU tends to evict the one that is just accessed. Problem 4: 1. With way-prediction, the processor rst fetches the tag of predicted way from L2 cache into the comparison circuitry and compares if it is a hit. If the predicted way is not a hit, then the processor fetches the tag of the other way. Under such scheme, only pins for transmission one-way tag is required since we only compare one tag once. A processor with full complement of pins interface to L2 cache allows tags in both ways to be fetched into on-chip comparison circuitry at the same time. Comparing the performance of way-prediction with full complement package processor, when way prediction result is correct, the hit time is the same as full complement of pins interface to L2 cache since we dont need to fetch tag from the other way. However, if the predicted way does not contain the data we requested, an extra delay for fetching tag from the other way is required. 2. Without the help with way-prediction, the on-chip logic can only compare the tags from L2 in a xed order. Therefore, the possiblity that the processor need to fetch the second tag for a hit become larger, and this would result in a higher average L2 access time. 3. The entries required is equal to the number of sets: N umber of sets = Cache Size Block Size Associativity 512 KB = 4K entries = 64 Bytes 2
Therefore, only 4K entries is required in the case. Since MIPS R10K contains 8K entries for way prediction table, the cache conguration is fully supported. 4. The entries required is equal to the number of sets: N umber of sets = Cache Size Block Size Associativity 4096 KB = 16K entries = 128 Bytes 2 4
MIPS R10K contains only 8K entries for way prediction table, half of the total required entries for traditional way prediction scheme. An option to make way prediction under such space constraint is to use the concept of branch prediction based on global history. We can add a register to keep track of the last 13 results of way accessing, Then we can use the history as the index to the way prediction history table, in which each entry stores the last outcome of a certain history pattern. Problem 5: 1. For physical cache, every virtual address needs to be translated by TLB before going to cache: hit time = 5ns + max((8ns + 4ns), (10ns)) + 2ns = 19ns 2. For virtual cache, the translation of physical address is required only when cache misses. The TLB access is not on the critical path of cache access: hit time = max((8ns + 4ns), (10ns)) + 2ns = 14ns 3. For virtually-indexed, physically-tagged cache, the address translation can be performed in parallel with tag and data lookup of cache. When both the physical page number and tag result are out, the cache can compare them immediately to determine if the access is a hit or miss. hit time = max((max(5ns, 8ns) + 4ns), (10ns)) + 2ns = 14ns

Solution of CSE 240A Assignemnt 3

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Solution of CSE 240A Assignemnt 3

Enviado por

Direitos autorais:

Formatos disponíveis

Solution of CSE 240A Assignemnt 3

Table 1: cache access for the rst four iterations 1

is almost the same as IL1M iss

= 1.57 ns = 3.14 cycles

CP Ioverall1GHz Cycle time1GHz CP Ioverall2GHz Cycle time2GHz 1 2.27 1109

Você também pode gostar