Você está na página 1de 17

Dynamically Variable Line-Size Caches Exploiting High On-Chip Memory Bandwidth of Merged DRAM/Logic LSIs

Ko ji Inoue, Ko ji Kai , and Kazuaki Murakami Mar. 1998

PPRAM-TR-33 Merged DRAM/logic LSIs could provide high on-chip memory bandwidth by interconnecting logic portions and DRAM with wider on-chip buses. For merged DRAM/logic LSIs with the memory hierarchy including cache memory, we can exploit such high on-chip memory bandwidth by means of replacing a whole cache line (or cache block) at a time on cache misses. This approach tends to increase the cacheline size if we attempt to improve the attainable memory bandwidth. Larger cache lines, however, might worsen the system performance if programs running on the LSIs do not have enough spatial locality of references and cache misses frequently take place. This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, called \dynamically variable line-size cache (D-VLS cache)", and evaluates the cost/performance improvements attainable by the D-VLS cache. The D-VLS cache can make good use of the high on-chip memory bandwidth by means of larger cache lines and. At the same time, it can alleviate the negative e ects of larger cache-line size by partitioning the large cache line into multiple small cache lines, and by adjusting the number of sublines to be involved in cache replacements. The line-size determinator for the D-VLS cache selects adequate line-sizes based on recently observed data reference behavior, and it can be implemented with a little addition of hardware cost. In our evaluation, it is observed that the performance improvement achieved by a direct-mapped D-VLS cache is more than 22% while it increases the hardware cost by only 17%, compared to a conventional direct-mapped cache with xed 32-byte lines.
1 Introduction

Recent remarkable advances of VLSI technology have been increasing processor speed and DRAM size dramatically. However, the advances also have introduced a large and growing performance gap between processor and DRAM, which problem is referred to as \Memory Wall" [1][14], resulting in poor total system performance in spite of higher processor performance. Integrating processors and DRAM on the same chip, or merged DRAM/logic LSI, is a good approach to resolve the \Memory Wall" problem. Merged DRAM/logic LSIs provide high on-chip memory bandwidth by interconnecting processors and DRAM with wider on-chip buses. For merged DRAM/logic LSIs with the memory hierarchy including cache memory, we can exploit the high on-chip memory bandwidth by means of replacing a whole cache line (or cache block) at a time on cache misses [9][10][12]. This approach tends to increase the cache-line size if we attempt to improve the attainable memory bandwidth. In general, large cache lines can bene t some application programs with rich spatial locality of references, because they provide the e ect of prefetching. Larger cache lines, however, might worsen the system performance if programs do not have enough spatial locality and
Department of Computer Science and Communication Engineering Graduate School of Information Science and Electrical Engineering Kyushu University y: Institutes of Systems and Information Technologies/Kyushu 6-1 Kasuga-Koen, Kasuga, Fukuoka 816 JAPAN E-mail: ppram@c.csce.kyushu-u.ac.jp URL: http://kasuga.csce.kyushu-u.ac.jp/~ppram/

PPRAM-TR-33

| Kyushu University

cache misses frequently take place. This kind of cache misses (i.e., con ict misses) could be reduced, assuming the constant cache size, by increasing the cache associativity. But, this approach usually makes the cache access time longer. There has been a proposal of the concept of \variable line-size cache (VLS cache)"[9], which was originally devised for use in the reference PPRAM (P P RAM R ) but is applicable to any merged DRAM/logic LSIs. The VLS cache can make good use of the high on-chip memory bandwidth by means of larger cache lines and. At the same time, it can alleviate the negative e ects of larger cache-line size by partitioning the large cache line into multiple small cache lines (sublines), and by adjusting the number of sublines to be involved in cache replacements according to the memory reference characteristics of programs. In [9], however, it was not discussed how the cache line-size, or the number of sublines, should be determined. The performance of the VLS cache depends largely on whether or not cache replacements can be performed with adequate line-sizes. That is, inadequate line-sizes could produce the large performance degradation. The prefetching e ect, which could be given by larger cache lines, might be decreased if smaller cache lines were selected improperly. On the other hand, the cache con ict misses, which could be reduced by using small cache lines, might increase if larger cache lines were chosen inadequately. There are at least two approaches to optimizing the cache-line sizes: one is a static determination based on compiler analysis; the other is a dynamic determination using a special hardware support. It may be possible to adopt the former approach when target programs have regular access patterns within well-structured loops. However, a number of programs have non-regular access patterns. In addition, When a number of programs run concurrently, the amount of spatial locality will vary both intra- and inter-programs. This paper proposes one of the latter approaches, or \dynamically variable line-size cache (D-VLS cache)" architecture, and evaluates the cost/performance improvements attainable by the D-VLS cache. The D-VLS cache changes its cache-line size on runtime according to the characteristics of target application programs. The line-size determinator for the D-VLS cache selects adequate line-sizes based on recently observed data reference behavior, and it can be implemented with a little addition of hardware cost. Since this scheme does not need any modi cation of instruction set architectures, the full compatibility of existing object codes is maintained. In our evaluation, it is observed that the performance improvement achieved by a direct-mapped D-VLS cache is more than 22% while it increases the hardware cost by only 17%, compared to a conventional direct-mapped cache with xed 32-byte lines. The rest of this paper is organized as follows. Section 2 describes related work. Section 3 de nes several terms, and explains the principle of the VLS cache operations. Section 4 presents the organization of the D-VLS cache. In addition, an algorithm for detecting spatial locality and determining the adequate line-size are described. Section 5 presents some simulation results and discusses the cache access time, hardware cost, miss rate, and performance. Moreover, we investigates the e ects of cache associativity and cache capacity to the D-VLS cache performance. Section 6 concludes this paper.
2 Related Work

For the memory hierarchy on merged DRAM/logic LSIs, several studies have evaluated the e ectiveness of data caches with large cache lines[10][12]. These caches increase the cache associativity in order to reduce con ict misses which are caused by frequent evictions of the large cache lines . Although this is a good approach to avoid cache con icts, increasing the cache associativity makes the cache access time longer[5][13]. On the other hand, the VLS cache attempts to avoid cache con icts by changing its cache-line size, and this approach does not make the cache access time longer. Several studies have proposed coherent caches in order to produce the performance improvement of shared-memory multiprocessor systems[3][2]. The cache proposed in [3] can adjust the amount of data stored in a cache line, and aims to produce fewer invalidations of shared data and reduce bus or network transactions. On the other hand, the VLS cache aims at improving the system performance of merged DRAM/logic LSIs by partitioning a large cache line into multiple independently small cache sublines, and adjusting the number of sublines to be enrolled on cache replacements. The xed and adaptive sequential prefetching proposed in [2] allows to fetch more than one consecutive cache lines. The cache presented in [2] has a counter for indicating the number of lines to be fetched. Regardless of the value of reference address, this counter is referred for fetching on read misses. On the other hand, the D-VLS cache has several ags indicating the cache-line size. Which ag should be referred depends on the value
1

1 The data cache presented in [10] also attempts to compensate for a disadvantage of large lines using a victim cache.

PPRAM-TR-33

| Kyushu University

of the current reference address. In other words, the D-VLS cache can change the cache-line size not only across the advance of program execution but also across di erent data. Excellent cache architectures exploiting both spatial and temporal locality have been proposed in [7] and [4]. The cache presented in [7] needs a table, called MAT, for recording the memory access history of not only cached data but also evicted data from the cache. Similarly, the cache presented in [4] has a table for noting the situations of past load/store operations. In addition, the detection of spatial locality in [4] relies on the memory access behavior of constant stride vectors. On the other hand, the D-VLS cache determines a suitable cache-line size based on only the state of the line which is currently being accessed by the processor. Consequently, the D-VLS cache has no large tables for recording or noting of the memory access history. (A single bit is added to each tag eld for recording the memory access history.) Furthermore, the D-VLS cache attempts to make good use of the high on-chip memory bandwidth available on merged DRAM/logic LSIs. Since the high on-chip memory bandwidth allows to transfer any number of data (up to the width of on-chip memory bus) at a time, the D-VLS cache can always complete a cache replacement in a constant time regardless of the cache-line sizes selected.
3
3.1

Variable Line-Size Cache


Terminology

In the VLS cache, an SRAM cell array for caching and a DRAM cell array for main memory are divided into several subarrays. The number of SRAM subarrays equals to that of DRAM subarrays. Data transfer for cache replacement is performed between corresponding SRAM and DRAM subarrays. Figure 1 summarizes the de nition of terms. Address-block, or subline, is a block of data associated with a single tag in the cache. Transfer-block, or line, is a block of data transferred at once between the cache and main memory. The address-blocks from every SRAM subarray, which have the same cache-index, form a cache-sector, or just a sector. A cache-sector and an address-block which are being accessed during a cache lookup are called a reference-sector and a reference-subline, respectively. When a memory reference from the processor is found a cache hit, referenced data resides in the reference-subline. Otherwise, referenced data is not in the reference-subline but only the main memory. A memory-sector is a block of data in the main-memory, and corresponds to the cache-sector. Adjacent-subline is de ned as follows.

  

It resides in a reference-sector, but is not the reference-subline. Its home location in the main-memory is in the same memory-sector as that of the data which is currently being referenced by the processor. It has been referenced at least once since it is fetched into the cache.

For example, the VLS cache in Figure 1 has four SRAM sub-arrays. Assuming that the size of the address-block is 32 bytes, both the cache-sector size and memory-sector size are 128 bytes. The VLS cache is connected to the main memory by means of 128-byte (=32 bytes 2 4) on-chip memory bus.
3.2 Principle of Operations

The VLS cache has the duality of its goals of improving the system performance of merged DRAM/logic LSIs: one is to exploit the spatial locality of references inherent in target programs; the other is to reduce the number of cache con icts caused by large cache lines without making the cache access time longer. To achieve these goals at the same time, the VLS cache adjusts its transfer-block size according to the characteristics of target programs. When target programs have rich spatial locality, for example, the VLS cache would determine to use larger transfer-blocks, each of which consists of several address-blocks, and could obtain the e ect of prefetching. In this case, the high on-chip memory bandwidth can be utilized at its maximum, and the advantage of spatial locality is positively utilized. Conversely, when target programs have poor spatial locality, the VLS cache would determine to use smaller transfer-blocks, each of which consists of a single address-block, and could try to avoid cache con icts. In this case, the high on-chip memory bandwidth is underutilized but the frequent evictions caused by large cache-lines can be avoided. Since the VLS cache tries to avoid cache con icts not by increasing the cache associativity but by adjusting the transfer-block size to small one, the access time of the VLS cache (i.e., hit time)

PPRAM-TR-33

| Kyushu University

Reference to 2 Reference to 3

Reference to 1

D-VLS cache
Reference-Subline reference-sector
SRAM Sub-Array 0 SRAM Sub-Array 1

Adjacent-Subline 3
SRAM Sub-Array 3

2
SRAM Sub-Array 2

Index 0 Index 1 Index 2

Cache-Sector
(Sector)

Address-Block
(Subline)
0 4
DRAM Sub-Array 0

Transfer-Block
(Line)
2 5
DRAM Sub-Array 1 DRAM Sub-Array 2 DRAM Sub-Array 3

Memory-Sector

Main Memory
Figure 1: Terms for VLS Caches is shorter than that of conventional caches with higher associativity (we will discuss this issue later in Section 5.1). The VLS cache works as follows; 1. When a memory access takes place, the cache tag array is looked up in the same manner as normal caches, except that every SRAM sub-array has its own tag memory and the lookup is performed on every tag memory. 2. On cache hit, the hit address-block has the required data, and the memory access performs on this address-block in the same manner as normal caches. 3. On cache miss, a cache re ll takes place as follows ;
2

(a) According to the designated transfer-block size, one or more address-blocks are written back from the indexed cache-sector to their home locations in the DRAM main memory. (b) According to the designated transfer-block size, one or more address-blocks (one of which contains the required data) are fetched from the memory-sector to the cache-sector. For the example D-VLS cache shown in Figure 1, there are fteen (= 2 0 1) possible combinations of address-blocks which are involved in cache replacement. Since treating all these combinations is quite complex, we place some restriction on the combinations to consider, as follows;
4

2 We assume that the cache employs the write-back policy.

PPRAM-TR-33

| Kyushu University
2

Since the example D-VLS cache in Figure 1 has four (= 2 ) sub-arrays for the SRAM cache and DRAM main memory, we introduce the following three transfer-block sizes;
{ { {

Minimum transfer-block size (=32 bytes) involving just one (= 2 ) address-block, Medium transfer-block size (=64 bytes) involving two (= 2 ) address-blocks, and Maximum transfer-block size (=128 bytes) involving four (= 2 ) address-blocks.
0 1 2

According to the designated transfer-block size, the address-blocks to be involved in cache replacement are determined as follows;
{

For minimum transfer-block size, only the designated address-block is involved in cache replacement (see Figure 2 (a)). For medium transfer-block size, the designated address-block and one of its neighborhood in the corresponding sector are involved (see Figure 2 (b)). For maximum transfer-block size, the designated address-block and all of its neighborhood in the corresponding sector are involved (see Figure 2 (c)).

Thus, the number of combinations of address-blocks to be involved in cache replacement is just seven rather than fteen. The performance of the VLS cache depends heavily on how well the cache replacement is performed with optimal transfer-block size. However, the amount of spatial locality may vary both within and across program executions. There are at least two approaches to optimizing the transfer-block size: static and dynamic. In the static approach, based on static program analysis, a compiler might specify the transfer-block size at any granularity (e.g., program by program, procedure by procedure, code by code, data by data, and so on). This could work well for some category of programs such as those with regular access patterns within well-structured loops. However, there are a number of programs which have non-regular access patterns, hence it is worthwhile to consider the other, or dynamic, approach. The rest of the paper discusses one of such approaches, which we call dynamic VLS, or D-VLS, cache.
4
4.1

Dynamically VLS Cache


Hardware Organization

Figure 3 illustrates the block diagram of a sample direct-mapped D-VLS cache with four subarrays. The address-block size is 32 bytes, and three transfer-block sizes (32 bytes, 64 bytes, and 128 bytes) are provided. The cache lookup for determining cache hit or miss is carried out as follows; 1. The address generated by the processor is divided into the following four elds;

   

Byte o set within an address-block, Subarray eld designating the subarray, Index eld used for indexing the tag memory, and Tag eld.

2. Each cache subarray has its own tag memory and comparator, and it can perform the tag-memory lookup using the index and tag elds independently with each other. 3. One of the tag-comparison results is selected by the subarray eld of the address, and then the cache hit or miss is determined. As can be seen from Figure 3, the organization and cache lookup process of direct-mapped D-VLS caches of four subarrays with 32-byte address-blocks is quite similar to those of normal 4-way setassociative caches with 32-byte lines. Subarray and cache-sector of D-VLS caches correspond to way and set of set-associative caches, respectively. The di erences are below;

The subarrays of D-VLS caches are indexed explicitly by addresses, while the ways of set-associative caches are not.

PPRAM-TR-33

| Kyushu University

Cache

Cache

32-byte

Line
0 4 1 2 3

64-byte Line
0 1 2 3

Main Memory (a) Replace with Minmum Lines Cache

Main Memory (b) Replace with Middle Lines

128-byte Line
0 1 2 3

Legend

Main Memory (c) Replace with Maxmum Lines

data transfer occurs

no data transfer occurs

Figure 2: Three Di erent Transfer-Block Sizes on Cache Replacement

The maximum transfer-block size of D-VLS caches is given by the number of subarrays 2 the address-block size (e.g., 128 bytes=4 2 32 bytes), while that of set-associative caches is equal to or less than the line-size (e.g.,  32 bytes). A reference- ag bit per address-block: This ag bit is reset to 0 when the corresponding addressblock is fetched into the cache, and is set to 1 when the address-block is accessed by the processor again. It is used for determining whether the corresponding address-block is an adjacent-subline. On cache lookup, if the tag of the corresponding address-block matches with the tag eld of the address and the reference- ag bit is 1, then the address-block is an adjacent-subline. A line-size speci er per cache-sector (LSS): This speci es the transfer-block size of the corresponding cache-sector. As described in Section 3.2, each cache-sector is in one of three states: minimum, medium, and maximum transfer-block-size states. To identify these states, every cache-sector provides a 2-bit state information as the line-size speci er. A line-size determinator (LSD): On every cache lookup, the LSD determines the state of the linesize speci er of the reference-sector (i.e., the cache-sector indexed by the address used for the lookup). The algorithm is given in the next section.

In addition, D-VLS caches provide the following for optimizing the transfer-block sizes at runtime;

 

PPRAM-TR-33

| Kyushu University

Processor
Address
Tag Index SA Offset

Load/Store Data

reference-flag

MUX

Line-Size Spacifier

Tag

Data

VLS Cache

Comparator next current line-size line-size

32Bytes

32Bytes

32Bytes

32Bytes

Line-Size Determinator
MUX Hit / Miss?

SA : Subarray

Main Memory

Figure 3: Block Diagram of Direct-Mapped D-VLS Cache


4.2 Algorithm for Line-Size Determinator

On every cache lookup, the LSD determines the state of the line-size speci er of the reference-sector, as follows; 1. The LSD investigates how many adjacent-subline exist in the reference-sector using all the reference ag bits and the tag-comparison results. 2. Based on the above-mentioned investigation result and the current state of the line-size speci er, the LSD determines the next state of the line-size speci er. The state-transition diagram is shown in Figure 4. This state-transition rule was derived from the following heuristics;

If there are many adjacent-sublines, the reference-sector has good spatial locality. Because the data which is currently being referenced by the processor and the adjacent-sublines are fetched from the same memory-sector, and they have been accessed by the processor recently. In this case, the transfer-block size should become larger, so the state transits from the minimun state to the medium state or from the medium state to the maximum state.
3

In contrast, if the reference-sector has been accessed sparsely before the current access, there should be few adjacent-sublines in the reference-sector. This means that the reference-sector has

3 On hits, the referenced data resides in the reference-subline. On misses, the referenced data is fetched in the referencesubline after the cache replacement.

PPRAM-TR-33

| Kyushu University

Other Paperns

Middle Line

Reference-Sector

Initial Minimum Line Maximum Line

Other Paperns

Other Paperns

: reference-subline or adjacent-subline

Figure 4: State Transition Diagram poor spatial locality at that time. In this case, the transfer-block size should become smaller, so the state transits from the maximun state to the medium state or from the medium state to the minimum state.
5 Evaluations

In this section, we discuss the e ectiveness of the D-VLS cache. Before presenting the performance improvements produced by the D-VLS cache, we consider the cache access time and the hardware costs. In this evaluations, we allow cache-sectors to share a single LSS in order to reduce the hardware overhead. We represent conventional and D-VLS caches as follows:

 

F32D, F64D, F128D : Conventional direct-mapped caches, each of which has a xed line size, 32 bytes, 64 bytes, and 128 bytes, respectively. V128-32D1, V128-32D8 : Direct-mapped D-VLS caches which have three line sizes, 32 bytes, 64 bytes, and 128 bytes. The last numerals, 1 and 8, indicate the number of the cache-sectors which share a single LSS.

Unless stated otherwise, we assume that the size of each cache is 16KB caches. (We discuss the e ect of the associativity and the capacity to the D-VLS cache performance in Section 5.3.5 and Section 5.3.6, respectively.) In addition, it is assumed that each cache has a wide on-chip memory bus between the cache and main memory which operates synchronously with the processor clock. Therefore, the data transfer for a line between the cache and main memory completes in a single processor clock cycle. (In this section, we call \transfer-block" to \line". )

PPRAM-TR-33

| Kyushu University
Match DataOut Out Match DataOut Out Match DataOut Out

ReferenceAddress
Offset Address

ReferenceAddress

ReferenceAddress

MatchDrive

DataDrive

MatchDrive MatchDriver

DataDrive DataDriver

MatchDrive MatchDriver

DataDrive DataDriver

MatchDriver DataDriver MuxDrive Mux Driver TagSide-Path DataSide-Path Compare TagRead Tag Decode 128Bytes DataRead Data Decode

MuxDrive Mux Driver MuxSide Path Compare 32Bytes

MuxDrive Mux Driver

Compare

32Bytes

Tag 0

TagRead DataRead Tag Data Data 4 0 4 Decode

Tag 0

TagRead DataRead Tag Data Data 4 0 4

(a) Conventioanl direct-mapped cache with 128-byte lines

(b) Conventioanl 4-way set-associative cache with 32-byte lines

(c) Direct-mapped D-VLS cache having three line-sizes

Figure 5: Critical Timing Paths on Conventional and D-VLS Caches

5.1

Access Time and Cycle Time

Cache access time, or hit time, is very sensitive to cache organizations. Figure 5 illustrates critical timing paths on conventional and D-VLS caches. In Figure 5, MatchOut and DataOut are outputs of the caches, both of which are driven by output drivers (MatchDriver and DataDriver). The access time consists of the delay for decoder, tag read, data read, comparators, multiplexor drivers, multiplexors, and output drivers[13]. In conventional direct-mapped caches, as shown in Figure 5(a), the access time is determined by either the TagSide-Path or DataSide-Path, while the longer path of MuxSide-Path and DataSide-Path determines the access time of set-associative caches, as shown in Figure 5(b). The structure of the D-VLS cache is similar to that of 4-way set-associative cache with 32-byte lines, as shown in Figure 5(b) and (c). In conventional set-associative caches, the MuxSide-Path often determines the access time, because control signals for selecting a word data, activation of multiplexor drivers, are made after tag comparisons. However, this critical path does not appear in the direct-mapped D-VLS cache, because the control signals for the data selection are made from the reference address directly. In the D-VLS cache, the delay for changing line sizes does not appear on TagSide-Path and DataSidePath. Where LSS is realized by an SRAM array, however, tow accesses from LSD to LSS (one for reading the current line size and one for writing the changed line size) will make the cache cycle time longer. Because it is very hard for the SRAM array to complete the tow accesses, one read and one write, in a single processor clock cycle. There are two methods to resolve this problem: one is pipelining of accesses to LSS, and the other is to implement LSS using ip- ops in spite of the SRAM array. In this paper, the latter approach is employed for the implementation of LSS, because the former approach makes both the structure of LSS more complex, and the control for changing the line size harder.
5.2 Hardware Cost

A cache consists mainly of an SRAM portion (data-array and tag-array) and several logic portions (decoders, comparators, and multiplexors). Additionally, the D-VLS cache requires the special hardware support, LSS and LSD, for adjusting the line size to the memory access behavior of programs. For F32D, V128-32D1, and V128-32D8, we calculated the size of the SRAM portion, and designed the logic portions in order to nd the number of transistors. In this design, each caches are described in RT-Level using VHDL (VHSIC Hardware Description language), and transfered to Gate-Level descriptions by Synopsys VHDL Compiler. For V128-32D1 and V128-32D8, each tag eld includes the 1-bit reference- ag. Since each D-VLS cache has 128 sectors (16 KB=128 bytes), V128-32D1 and V128-32D8 requires 256-bits (2-bits2128) and 32-bits (2-bits216) ip- ops, respectively.

PPRAM-TR-33

| Kyushu University

10

LSD is independent of the number of sectors which share the transfer-block size information, and can be composed of simple combinational logics due to the simple state-machine for changing the line size, as explained earlier in Section 4.2. For example, a condition which causes a state movement of the line-size state-machine can be detected by an AND logic, the number of inputs of which is ve (one for current line size and four for locations of a reference-subline and adjacent-sublines). Table.1 shows the capacity of the tag-array and the number of transistors for each logic portion. In all caches, the size of each data-array is 16 KB. The right-most column describes the total number of transistors including the data-array where a 2-bit SRAM is assumed to be one transistor . V128-32D8 can be implemented with about 17% higher hardware cost than F32D. Although V12832D8 requires more transistors than the conventional cache, the hardware overhead may be trivial for merged DRAM/logic LSIs which have not only the cache but also main memory (DRAM arrays) in the same chip. However, V128-32D1 requires hardware overhead about 32% from F32D. V128-32D1 requires a number of multiplexors in order to select a single LSS. The delay for the large multiplexor may be make the cache cycle time longer. Hence, we regard V128-32D1 and V128-32D8 as an ideal model and a realistic model, respectively.
4

Table 1: Hardware Costs of Various Cache Models


Model Tag Logic [Tr] 8,354 18,922 18,922 LSD [Tr] { 230 230 LSS [Tr] { 14,020 2,056

HW

[bits] F32D V128-32D1 V128-32D8 9,216 9,728 9,728

Cost[Tr] 78,498 103,572 91,608

5.3
5.3.1

Performance
Benchmarks

In our experiments, two integer programs and a oating-point program from the SPEC92 benchmark suit are executed using SPEC reference inputs. In addition, seven integer programs and four oatingpoint programs from the SPEC95 benchmark suite are executed using SPEC training input, and SPEC test inputs, respectively. We also simulated mpeg2encode and mpeg2decode programs from [8] using veri cation pictures. These programs were compiled by GNU CC with the \{O2" option, and executed on Ultra SPARC architecture. Furthermore, in order to assume more realistic execution on general purpose processors, four benchmark program sets are used, each of which consists of three programs as follows:

   

mix-int1 : 072.sc, 126.gcc, and 134.perl. mix-int2 : 124.m88ksim, 130.li, and 147.vortex. mix-fp : 052.alvinn, 101.tomcatv, and 103.su2cor. mix-intfp : 132.ijpeg, 099.go, and 104.Hydro2d.

The programs in each set are assumed to run concurrently on a uni-processor system, and a context switch occurs per execution of one million instructions. Mix-int1 and mix-int2 contain integer programs only, while mix-fp consists of three oating-point programs. Mix-intfp is formed by two integer and one oating-point programs. We captured address traces of each set on the execution of three billion instructions.
5.3.2 Performance Model

Miss rate is the most popular metric of cache performance. However, it is very important to consider not only the miss rate but also the miss penalty, or M P . We introduce a performance model including M P due to more strict evaluations. Execution time , or E T , can be approximated as follows:
ET

= =

IC IC

2 2(

CP I

CCT

CP I

ideal + C P Imem )

C CT ; LogicT ransistors=cm

4 \Overall Roadmap Technology Characteristics" in [11] shows that the rate of the 2 C acheSRAM B its=cm from 2001 to 2007 is approximately 1:2.

2 to

PPRAM-TR-33

| Kyushu University

11

where I C , C P I , and C C T are the instruction count, clock per instruction, and clock cycle time, respectively. As shown in the above formula, C P I has two parts. One is a C P I where the processor has ideal memory system, this means that any load/store operations always complete in a single processor clock cycle. And another is the memory access penalty per instruction result from the real memory system. The former and the latter are referred to C P Iideal and CP Imem, respectively. C P Imem can be represented as follows:
CP I

mem

= =

RP I RP I

2 2

MR MR

2 222(
MP

DRAM stup

+d

LineS ize BusW idth

e)

where RP I and M R are the reference per instruction and miss rate, respectively. When the cache detects a miss, two accesses to main memory (one for write-back and one for fetch) are devoted to the cache replacement. As shown in the above formula, M P has two parts: the latency for an access to the main memory (DRAM stup) and the transfer time for the cache line (dLineSize=BusW idthe). LineS ize and B usW idth indicate the line size (transfer-block size) and the width of bus between the cache and main memory, respectively. Even if LineSize increases within the range of BusW idth, assuming a constant DRAM stup, M P will not be grown. Technological trends have produced a large and growing gap between processor speeds and DRAM speed[1][14]. Hence, we consider that ET becomes more sensitive to C P Imem in future processors. Consequently, we evaluate the D-VLS cache using CP Imem as cache performance. In this section, it is assumed that processor speed and DRAM start-up time are 200MHz and 40ns, respectively.
5.3.3 Miss Rates

We measured miss rates for each benchmark program using two cache simulators written in C: one is for conventional caches with xed line-sizes, and the other is for D-VLS caches with 32-byte, 64-byte, and 128-byte variable line-size. These simulators are given address traces which are captured by QPT [6]. Figure 6 presents simulation results. The left three bars for each benchmark program are miss rates given by conventional caches, each of which has 32-byte, 64-byte, and 128-byte xed line-sizes. While the remaining bars to the right are results given by D-VLS caches, V128-32D1 and V128-32D8. For each benchmark program, simulation results are normalized to a miss rate produced by the a conventional cache with the best line-size. As shown in Figure 6, the best line-size is highly application-dependent. For more than half of all benchmark programs, the miss rates of V128-32D1 and V128-32D8 are nearly equal or lower than that of a conventional cache with the best line-size. Especially, for some benchmark programs, 132.ijpeg, 134.perl, 052.alvinn, and 104.hydro2d, D-VLS caches have signi cant performance advantages over conventional caches. In the other benchmark programs but one (072.sc), D-VLS caches produce better results than a conventional cache with the second best line-size. For oating-point benchmark programs, V128-32D1 produce always better performance than V128-32D8. However, for integer benchmark programs, the miss rates of V128-32D1 are not always lower than that of V128-32D8.
5.3.4 Performance Improvements

We calculated C P Imem , as described earlier in Section 5.3.2, for both a conventional cache with 32byte xed line-size (F32D) and two D-VLS caches with 32-byte, 64-byte, and 128-byte variable line-size (V128-32D1 and V128-32D8). In order to compare the performance of the D-VLS cache with that of a traditional cache which is separated from main memory, we use a new cache model indicated as F32D-NB (Narrow Bus). F32D-NB is the same as F32D except that it does not have a high on-chip memory bandwidth. We assume that F32D-NB has a 64-bit o -chip narrow memory bus which operates synchronously with the processor clock, so the data transfer time for a line on F32D-NB is four processor clock cycles (32 bytes=64 bits). In addition, in order to clarify the e ectiveness of the D-VLS cache, we de ne a doubled cache model indicated as F32D-double, which is the same as F32D except that its capacity is 32 KB. Furthermore, in order to examine the e ects of the dynamically variable linesize, we simulated a VLS cache represented as V128-32D-static, which has three line sizes and has also the high on-chip memory bandwidth, as well as V128-32D1 and V128-32D8. While the D-VLS cache determines an adequate line-size based on dynamic analysis, V128-32D-static determines that based on static analysis. In V128-32D-static, target programs are analyzed individually by three times prior simulations, each of which assuming a conventional cache with 32-byte, 64-byte, and 128-byte xed line-sizes. V128-32D-static executes each program with a single line-size, which is the best line-size for

PPRAM-TR-33

| Kyushu University

12

1.792

1.775

1.633

1.50 1.40 1.30

Normalized Miss Rates

1.20 1.10

1.00
0.90 0.80 0.70 0.60 072.sc 124.m88ksim 026.compress 099.go 126.gcc 130.li 134.perl 132.ijpeg 147.vortex

Int-Benchmarks
1.795 2.031 1.546 1.891
2.931

1.638 F32D F64D F128D V128-32D1 V128-32D8

1.50 1.40

Normalized Miss Rates

1.30 1.20 1.10

1.00
0.90 0.80 0.70 0.60 mpeg2 mpeg2 encode decode Int-Benchmarks 102.swim 104.hydro2d 052.alvinn 101.tomcatv 103.su2cor FP-Benchmarks

Figure 6: Miss Rates the program, based on the results of prior simulations. When a context switch occurs, V128-32D-static changes the line size. Namely, V128-32D-static adjusts the line size across programs, whereas V12832D1 and V128-32D8 can detect the amount of spatial locality and adjust the line size within and across programs. Figure 7 presents simulation results for benchmark program sets, mix-int1, mix-int2, mix-fp, and mixfpint. The \average" shown in Figure 7 is the average of C P I mem for the four benchmark program sets. For each benchmark program sets, all results are normalized to C P Imem of F32D. In any benchmark program sets, D-VLS caches (V128-32D1 and V128-32D8) have a signi cant performance advantage over F32D. Especially, for mix-fp and mix-intfp, C P Imem of V128-32D1 is smaller than that of F32D by more than 30%. Moreover, in all but one benchmark program set (mix-int1), the performance improvements achieved by the D-VLS caches are almost the same or larger than that achieved by V128-32D-static. In average, C P Imem of V128-32D1 and V128-32D8 are about 28% and 22% smaller than that of F32D, respectively. Also, the D-VLS caches produce almost the same or better performance than doubled conventional cache (F32D-double). The performance bene ts of the high on-chip memory bandwidth and the variable line-size is very dramatic, for example in average the C P Imem of V128-32D1 is 45:3% smaller than that of F32D-NB.

PPRAM-TR-33

| Kyushu University

13

1.33

1.33

1.33

1.33

1.33 F32D-NB F32D (base) F32D-double V128-32D-static V128-32D1 V128-32D8

1.10

1.00
0.90

Normalized CPImem

0.80 0.70 0.60 0.50 0.40 0.30 0.20 Mix-int1 Mix-int2


Figure 7:
CP I

Mix-fp

Mix-intfp

Average

mem for Direct-Mapped Caches

5.3.5

Associativity

In this section, we investigate the e ects of cache associativity on the D-VLS cache performance. Figure 8 presents C P Imem for four benchmark program sets on 2-way set-associative conventional caches (F128S, F32S, and F32S-double) and VLS caches (V128-32S-static, V128-32S1, and V128-32S8). For each benchmark program sets, all results are normalized to C P Imem of F32S. In average, D-VLS caches (V128-32S1 and V128-32S8) produce signi cant performance improvements over F32S. However, F128S reduce CP Imem from F32S by about 20%, and its performance is better than doubled F32S (F32Sdouble) in spite of the xed line-size. The D-VLS cache aims producing the performance improvement by obtaining the e ect of prefetching given by large lines and, at the same time, eliminating con ict misses using small sublines. Nevertheless F128S has xed line-size, it can naturally achieve the D-VLS cache's attempts. Because F128S has large lines to obtain the e ect of prefetching and, at the same time, reduces con ict misses by not using small sublines but the high associativity. Hence, in case that we assume a D-VLS cache and a conventional large-line cache has the same cache associativity , increasing the cache associativity reduces the performance improvement produced by the D-VLS cache. However, the e ectiveness of the D-VLS caches remains even if the caches have 2-way set-associativity. In average, as shown in Figure 8, V12832S1 and V128-32S8 produce the better performances than F128S by 26:2% and by 16:9%, respectively.
5

5.3.6

Cache Size

Generally, in conventional caches, the adequate line-size depends on not the amount of spatial locality inherent in target programs but also the cache size. Where the cache size is very small, the number of sublines (address-blocks) is very few. In this case, negative e ect of large lines which is caused by
5 It is also assumed that the maxmum line-size of the D-VLS cache is equal to the conventional cache-line size

PPRAM-TR-33

| Kyushu University

14

1.10

1.00
0.90

F128S F32S (base) F32S-double V128-32S-static V128-32S1 V128-32S8

Normalized CPImem

0.80 0.70 0.60 0.50 0.40 0.30 0.20 Mix-int1 Mix-int2 Mix-fp Benchmarks Mix-intfp Average

Figure 8:

CP I

mem versus Associativity

frequent evictions exceeds the e ect of prefetching. Hence, the adequate lin-size becomes to small in order to reduce con ict misses. In contrast, where the cache has enough capacity to avoid cache con icts, the large lines give a signi cant performance improvement by virtue of the e ect of prefetching. In order to investigate the e ect of the cache size on the D-VLS cache performance, we simulated conventional direct-mapped caches and direct-mapped D-VLS caches varying the capacities from 4 KB to 128 KB. Figure 9 presents the average of C P Imem for four benchmark program sets, mix-int1, mix-int2, mix-fp, and mix-intfp. V128-32D1 and V128-32D8 are superior to the other conventional caches even though the cache size is varied from 4 KB to 128 KB. Figure 10 shows the breakdown of cache replacements attributed to replaced line-sizes, as the percentage of the total number of cache replacements for all benchmark program sets. As shown in Figure 10, V128-32D8 can certainly adjust the line size to the varying cache size. When the cache size is small, for example, almost all cache replacements are performed with 32-byte small lines. While a number of cache replacements with 128-byte large lines come to appear along with the increase in cache capacity. The D-VLS cache attempts to improve the performance by utilizing the di erence of adequate line-sizes across and within target programs. However, since increasing the cache capacity enlarges the adequate line-size, the amount of the di erence of adequate line-sizes gets fewer. As the results, the performance improvement produced by the D-VLS cache becomes to small when the cache has enough capacity for target programs.
6 Conclusions

In this paper, we describes the variable line-size cache (VLS cache) in details, which is a novel cache architecture suitable for merged DRAM/logic LSIs. In addition, we have presented a realistic VLS cache, called dynamically variable line-size cache (D-VLS cache). The D-VLS cache has an adjustable line-size

PPRAM-TR-33

| Kyushu University

15

0.80

F32D-NB F32D F128D V128-32D1 V128-32D8

0.70

0.60

CPImem [cycle]

0.50

0.40

0.30

0.20

0.10

0.00 4 8 16 32 64 128

Cache Size [KB]


Figure 9:
CP I

mem versus Cache Size

to the characteristics of target application programs, and the line size is modi ed by the transfer-block size determinator (LSD). LSD can nd adequate line-sizes according to the amount of spatial locality inherent in the target programs. Experimental results shown that a 16 KB D-VLS cache which requires more than 30% hardware overhead reduces the memory access penalty per instruction (C P Imem ) by about 28% from a 16 KB conventional cache. In addition, a realistic 16 KB D-VLS cache improves C P Imem by about 22% while it increases the hardware cost by 17%, compared to the 16 KB conventional cache. We have investigated the e ects of cache associativity on the D-VLS cache performance. As the result, the 2-way set-associative DVLS caches give better results than conventional 2-way set-associative caches. Furthermore, we observed the performance improvements achieved by the D-VLS caches where the varying cache size from 4 KB to 128 KB. Integrating processors and main memory produces greatly performance improvements by virtue of the high on-chip memory bandwidth. Merged DRAM/logic LSIs will become to core devices on future system LSIs. The VLS cache is applicable to any merged DRAM/logic LSIs, so we believe that the cache management using the VLS cache is a very useful to improve the future system LSIs.
References

[1] Burger, D., Goodman, J. R., and K gi, A., \Memory Bandwidth Limitations of Future Microproa cessors," Proc. of the 23rd Annual International Symposium on Computer Architecture, pp.78{89, May. 1996.

PPRAM-TR-33

| Kyushu University

16

100.0 90.0

: 128-byte lines : 64-byte lines : 32-byte lines

Total Number of Replace Lines [%]

80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 4 8 16 32 64 128

Cache Size [KB]


Figure 10: Breakdown of Replaced Lines (V128-32D8) [2] Dahlgren, F., Dubois, M, and Stenstrom, P., \Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors," Proc. of the 1993 International Conference on Paralled Processing, pp.56{63, Aug. 1993. [3] Dubnicki, C., and LeBlanc, T. J., \Adjustable Block Size Coherent Caches," Proc. of the 19th Annual International Symposium on Computer Architecture, pp.170{180, May. 1992. [4] Gonzalez, A. Aliagas, C. and Valero, M., \A Data Cache with Multiple Caching Strategies Tuned to Di erent Types of Locality," Proceedings of International Conference on Supercomputing, pp.338{ 347, July. 1995. [5] Hill, M. D., \A Case for Direct-Mapped Caches," IEEE Computer, vol.21, no.12, pp.25{40, Dec. 1988. [6] Hill, M. D., Larus, J. R., Lebeck, A. R., Talluri, M., and Wood, D. A., \WARTS: Wisconsin Architectural Research Tool Set," http://www.cs.wisc.edu/ ~ larus/warts.html, University of Wisconsin - Madison. [7] Johnson, T. L., Merten, M. C, and Hwu, W. W., \Run-time Spatial Locality Detection and Optimization," Proc. of the 30th Annual International Symposium on Microarchitecture, pp.?{?, Dec. 1997.

PPRAM-TR-33

| Kyushu University

17

[8] MPEG Software Simulation Group., \Free MPEG Softwares MPEG-2 Encoder / Decoder, Version 1.2", http://www.mpeg.org/ tristan/MPEG/MSSG/, July. 1996. [9] Murakami, K., Shirakawa, S., and Miyajima, H., \Parallel Processing RAM Chip with 256Mb DRAM and Quad Processors," 1997 ISSCC Digest of Technical Papers, pp.228{229, Feb. 1997. [10] Saulsbury, A., Pong, F., and Nowatzyk, A., \Missing the Memory Wall: The Case for Processor/Memory Integration," Proc. of the 23rd Annual International Symposium on Computer Architecture, pp.90{101, May. 1996. [11] Semiconductor Industry Association. The National Technology Roadmap for Semiconductors,, 1994. [12] Wilson, K. M. and Olukotun, K., \Designing High Bandwidth On-Chip Caches," Proc. of the 24th Annual International Symposium on Computer Architecture, pp.121{132, June. 1997. [13] Wilton, S. J. E. and Jouppi, N. P., \CACTI:An Enhanced Cache Access and Cycle Time Model," IEEE Journal of Solid-State Circuits, vol.31, no.5, pp.677{688, May. 1996. [14] Wulf, W. A. and McKee, S. A., \Hitting the Memory Wall: Implications of the Obvious," ACM Computer Architecture News, vol.23, no.1, Mar. 1995.

Você também pode gostar