Distributed Prefetch-Bufer - Cache Design For High Performance

Distributed Prefetch-buer/Cache Design for High Performance Memory Systems
Thomas Alexander and Gershon Kedem Departments of Computer Science and Electrical Engineering Duke University Durham, NC 27708
Microprocessor execution speeds are improving at a rate of 50%-80% per year while DRAM access times are improving at a much lower rate of 5%-10% per year. Computer systems are rapidly approaching the point at which overall system performance is determined not by the speed of the CPU but by the memory system speed. We present a high performance memory system architecture that overcomes the growing speed disparity between high performance microprocessors and current generation DRAMs. A novel prediction and prefetching technique is combined with a distributed cache architecture to build a high performance memory system. We use a table based prediction scheme with a prediction cache to prefetch data from the on-chip DRAM array to an on-chip SRAM prefetch buer. By prefetching data we are able to hide the large latency associated with DRAM access and cycle times. Our experiments show that with a small (32 KB) prediction cache we can get an eective main memory access time that is close to the access time of larger secondary caches.
Abstract
1 Introduction
The combined eect of higher microprocessor clock speeds and super-scaler instructions are dramatically improving microprocessor execution speeds. However, main memory access time is not keeping pace with the improvements in the processor's cycle time. As technology improves, the gap between instruction cycle time and memory access time grows. In the last few years microprocessor's speed has improved at 50% - 80% per year while memory cycle time has improved by 5%-10% per year. We infer from these trends that the processing power of a computer system will be determined by the latency and bandwidth of the memory system and not by the CPU speed. Wulf et al [21] concur that within a decade the execution time of programs will be limited only by the memory sys-
tem speed. Presently, programs that require a large amount of memory are especially aected by the latency and bandwidth of the memory system. Two common techniques are used to reduce the access time to main memory. The rst is to build a fast SRAM memory. This approach is used by Cray in its large supercomputer vector machines. However, the cost and power requirements of fast SRAM make it unsuitable for use in desk-top or desk-side systems. The second is to use a caching scheme [18]. While instruction streams exhibit good locality of reference, data references of many programs do not exhibit locality that benet enough from conventional caching. For example, programs that solve large systems of equations, simulation programs, and placement and routing programs, all exhibit similar behavior. These programs repeatedly sweep through large data structures having neither temporal nor spatial locality of access that can be captured by a conventional cache. Even large secondary data caches of 1-4 M-bytes get only 70%-90% hit rate for such programs. Data prefetching is one of the most promising techniques to overcome high DRAM latency. Both software and hardware prefetching techniques have been proposed previously. Software prefetching is a static scheme in which the compiler attempts to predict which memory accesses will cause cache misses and embeds the necessary prefetch instructions in the program code. [15]. This scheme is inexpensive to implement because the complicated prediction heuristics are performed oline in software. However, since the prediction information is embedded into the program at compile time, this scheme lacks the exibility to account for dynamic behavior of programs. Many programs do not access memory with regular and static patterns. Sparse matrix computations, graph algorithms, event driven simulations, and other linked list based programs are dicult to optimize
with this technique. Hardware prefetching can dynamically adjust the prediction according to the program execution but is more expensive to implement. One block lookahead (OBL) [18] [11] and stream buers [9] work on the assumption that the next most likely reference will be in the next block and prefetches consecutive blocks. These schemes perform poorly with programs that have either large strides or programs that traverse linked lists. Baer and Chen [7] improve upon OBL by incorporating hardware to dynamically detect constant strides. The stride and the present address is used to predict a future reference which is then prefetched into the rst level cache if necessary. However, if the program behavior is more complex than constant stride, the prediction accuracy, and consequently the performance, drops drastically. Compiler techniques and programming techniques have been used to increase the locality of data reference [15, 12, 7]. However, these techniques are limited to array data structures and well structured algorithms (linear algebra). We propose a hardware prefetching scheme that performs well on large scale programs with repeated but complex data references. Unlike earlier hardware prefetching, our scheme captures not only constant stride accesses but also more complex data references. We use a simple prediction algorithm to prefetch data into a distributed array of prefetch buers that are integrated onto the DRAM array. Existing Enhanced DRAMs (EDRAMs) and Cached DRAM (CDRAM) technologies are used to design a memory subsystem with a low eective access time. A CDRAM prediction table that is less than 1% of physical memory, a small SRAM prediction cache and, CDRAM or EDRAM main memory are used to build the memory system. Our architecture performs better than a memory system with a large conventional secondary cache. This paper is organized into ve sections. In Section 2 we describe our prediction method. We develop, in Section 3, the memory architecture. We present our experimental framework and simulation results in Section 4. Conclusions and future work are in Section 5.
and RAMBUS memories ll a cache line of the processor in a few clock cycles. However, these memory designs do not address the large memory latency problem. Both memories require at least 60ns to start sending the data to the processor i.e even if the transfer of data took no time at all, the processor still has to wait 60ns before the data in memory is available. Typically, rst level cache lines are relatively small (32 bytes). Therefore, the benet of the higher bandwidth of SDRAM and RAMBUS does not yield large overall performance gains over conventional memories. It is harder to compensate for the large latency than it is to provide high bandwidth memory. The most effective method to hide the large DRAM latency is to predict future accesses and prefetch the data into a fast buer. Typically, the processor references data from main memory every six to thirty cycles. Consequently, the amount of time available to predict and prefetch data is very short. We have developed an effective prediction scheme that is very fast. The scheme uses a table-lookup technique using the current readrequest address to predict the next read-request address.
2.1 Table based prediction
2 Prediction & Prefetching
Two related problems must be solved to bridge the growing gap between processor and memory speed: the memory bandwidth problem and the memory latency problem. The rst problem of memory bandwidth, is currently being addressed by Synchronous DRAM (SDRAM) and RAMBUS. Both design techniques allow the memory to send a block of data to the processor at a very high speed. Both SDRAM
Large programs such as simulation programs, scientic computation, CAD and others, have large data structures that the programs traverse repeatedly. As a result the programs repeatedly generates memory read-requests that tend to follow the data structure. While these memory read-request patterns are complex, they exhibit a large amount of repetition [1]. In the past, attempts were made to capture these patterns using hardware stride prediction [7, 10]. We propose a mechanism for predicting memory access patterns that is more general than stride prediction and more powerful. The prediction mechanism is based on the previous execution history of the program. For each memory-read-access, the memory access that followed it is stored in a table. Once the access information is stored in the table, if an access to the same address recurs, the address of the last access that followed is used as a prediction of the address to follow. Several immediate objections arise. First, accesses do not follow a strict pattern, so remembering the one address that follows might not be a good guess. Second, the table that stores all accesses might be huge, making the prediction impractical. Third, the prediction must be fast. Reading the prediction using DRAM is too slow. To overcome the potential problems mentioned above we use the following design:The memory is divided into blocks. Rather then trying to predict indi-
vidual addresses, block addresses are predicted. Since memory accesses tend to cluster, predicting block address works better than predicting individual addresses. Also, since each read-request is used to load a cache line (rather than a single word), it makes sense to make each block at least as large as a cache line. Our experiments [1] show that block sizes of 128 to 512 bytes work best. To improve the prediction accuracy, multiple guesses are stored per block. Our simulations show that four predictions for each entry yield a very high prediction rate. Table 1 shows the correct prediction rate for ten benchmark programs. We have used a block size of 256 bytes and a prediction table with 256K entries with four predictions per entry. The prediction and prefetching must happen before the next read-request arrives. A suciently fast access to the prediction table is obtained by using a small SRAM prediction-cache. Our experiments show that a small 32KB cache works well for all our benchmark programs. Assume a main memory of size M bytes, 512 byte blocks and four predictions per entry. A prediction table of size 4 M (log(M ) 9)=(8 512) bytes, about 2% of memory, will hold predictions for all the blocks. However, there is no need to have a full size table. Our experiments show that a table that is about 1% of total memory size is sucient. Moreover, if large prefetch blocks are used, only a prediction cache is needed, and the prediction table can be eliminated (see Section 3). Figure 1 is used to illustrate how the table based prediction works. Assume that the CPU memory read-requests are for addresses: a, b, c, a, d, a, b, a. Part-A shows a table with "random" data all symbolically denoted as x. When the CPU requests address a, the prediction hardware looks up the table entry for block a. The addresses in that entry are x, so block(s) x will be prefetched (prediction miss). The address a is stored in a two word buer for future use. The next read-address is b. The values in the table entry for block b are used as the prefetch addresses for this access (another prediction miss). Since the prediction for a was a miss, the block address for b replaces one of the entries for a. The next memory read-request is address c (prediction miss). Now c is pushed into the buer, it's table entries are used as prefetching addresses and the table entry associated with block b is updated to block c. When address a repeats, block b is prefetched and so on. Note that if the prediction table had at least two entries per block address, the third time block address a occurs, the table will correctly predict b as the block address to
<a>
x x x x x x x x x a x x x x x x x x x x
<b>
x x x x b x x x x b a x x x x x x x x x
<c>
x c x x b x x x x c b x x x x x x x x x
A. <a>
x c x x b x a x x a c x x x x x x x x x d a
B. <d>
x c x x b x a x x x x x x d x x x x a d
C. <a>
x c x x b x a x a x x x x d x x x x
D.
Read requests stack Current read address
E.
Predictions buffer Prediction table
F.
<b>
x c x x b x a x a b a x x x x d x x x x
<a>
x c x x b x a x a a b x a x x d x x x x
G.
H.
Figure 1: Prediction Table Algorithm
follow.
Prog su2cor turb3d tomcatv swim mgrid tt wave5 ldvpc meteor 1 0.893 0.300 0.360 0.801 0.413 0.950 0.628 0.624 0.155 0.569 2 0.951 0.824 0.724 0.955 0.753 0.954 0.921 0.878 0.301 0.807 3 0.958 0.830 0.882 0.955 0.793 0.954 0.961 0.948 0.415 0.855 4 0.959 0.835 0.941 0.955 0.802 0.954 0.975 0.955 0.513 0.877 5 0.960 0.837 0.942 0.955 0.804 0.954 0.982 0.957 0.602 0.888 6 0.960 0.838 0.948 0.955 0.805 0.954 0.993 0.958 0.680 0.899
Average
Table 1: Prediction Accuracy vs. Number of Predictions

Experiment parameters: Prediction Table Size Cluster Size Prediction Block Size Number of Predictions Replacement Policy -256K entries -256 bytes -256 bytes -Variable -LRU
Table 1 shows the correct prediction ratio for the benchmark programs as a function of the number of guesses per entry. From the table one can see that the number of correct guesses increases with the number of guesses per entry. Also, the table shows that 4 guesses per entry capture most of the benet of multiple guesses. In the next section we describe how to use prediction and prefetching to build a memory system with existing memory parts.
3 Distributed Prefetch-buer/Cache Architecture
The previous section explained the prediction methodology used in our memory system. In this section we develop the architecture of the memory system. For the prediction scheme to be eective, large blocks of data must be prefetched within a short period. Conventional memory architecture where the microprocessor is connected to a second level cache and main memory through a 64 bit data bus does not provide sucient bandwidth to eectively prefetch data from the DRAM in main memory into the microprocessors on-chip L1 cache. Consequently, we use a prediction mechanism to prefetch data from DRAM
memory into prefetch buers that are integrated into the DRAM ICs. We take advantage of the huge onchip bandwidth to prefetch large blocks of data simultaneously without loading the memory-CPU bus. To the microprocessor the prefetch buers appear as a distributed L2 cache. To reduce the large disparity between memory cycle time and processor cycle time we combined several ideas together to design a high performance memory system: On-chip Prefetch Buers: We use the huge onchip bandwidth of DRAM to SRAM communication to prefetch data from the large, but slow DRAM array into small (but fast) on-chip SRAM prefetch buers. Since the prefetching is done onchip, prefetching unused data has only a small impact on overall performance. Prediction & Prefetching: We use a simple table based prediction technique to predict the next address request from the CPU. Note that we are predicting cache misses only. We view the primary (and secondary) cache(s) as a lter that lets through only requests that miss the cache. Address clusters: Memory is divided into clusters. All references into the same cluster are considered to be the same. The linear clusters are formed by truncating the low order bits of the address. Block Prefetch: The prediction algorithm prefetches blocks that are likely to be referenced next rather than the words or cache lines that are likely to be referenced. The prefetch block size depends on the memory part used to implement the memory system. The blocks can be the same size as or larger than a cache line. Prediction Cache: We use a small prediction cache together with the prediction table. The prediction-cache holds the most active portion of the prediction table. Caching the prediction table speeds up data prefetching. The predictionprefetching scheme can also be used with a prediction cache only, without using a prediction table. Method of Prediction: For each memory-read request, we store in the prediction table (and/or cache) the four most likely references to follow it. The hardware monitors the read-requests and stores the request following the current request in the prediction cache entry associated with the current read-request. Up to four such requests are stored in each cache entry.
leaved (at least four). Each independent memory module has separate address and control lines. Interleaving the memory increases the memory bandwidth and reduces the prefetching time. The prefetch controller prefetches all four guesses at the same time. Figure 2 shows a block diagram of the memory system. The system is made up of a DRAM array with integrated prefetch buers (distributed cache), and a Prediction & Prefetching Unit. The DRAM array is conventional memory with one exception. Integrated on each memory IC is a small SRAM buer that acts as a distributed cache. The distributed cache is built like an ordinary direct-mapped L2 cache. The cache controller has a tag table with an entry for each cache line. When a memory request is issued, the cache controller uses the tag table to check for a cache hit or a miss. On a cache hit the data is provided from the SRAM buer at SRAM speed. If the cache is missed, a normal row address strobe (RAS) brings the data into the buer. Unlike conventional cache, the distributed prefetch-buer/cache is not used to hold data that was recently referenced, but is used as a prefetch buer that has data that is predicted to be accessed next. The data is preloaded into the SRAM prefetch buers "just in time" before the CPU requests the data. In parallel with the distributed cache controller, the prediction unit uses the current address to get out of the prediction-cache (or prediction table) four block addresses that are likely to be requested next. The prefetching unit initiates simultaneous prefetches of these four blocks from the DRAM array to the SRAM prefetch buers. In addition, the prediction unit monitors the access predictions. If the current address was incorrectly predicted using the previous address ( a prediction miss), the prediction cache and prediction table are updated. More detailed descriptions of each of the units is given below.
Memory Interleave: Memory is multi-way inter-
memory Address and controls
Prediction & Prefetching Unit
DRAM SRAM
DRAM SRAM
DRAM SRAM
DRAM SRAM
MicroProcessor
Cache
Memory Data Bus
Figure 2: Memory Architecture Block, Diagram municate through a fast and wide internal bus. Essentially, a single DRAM row access (RAS) transfers a whole row from DRAM to SRAM. Since the data bus is internal to each chip, the bandwidth between SRAM and DRAM increases proportionally when a new EDRAM module is added. We use four-way memory interleave to further increase the eective memory bandwidth. The design presented here is based on Ramtron's EDRAM [17]. The cache size, line size, interleave factor and timing constraints are based on the DM2203/2213 EDRAM architecture. However, we have simulated memory congurations using Mitsubishi CDRAM and other parts and were able to get similar results. Moreover, our architecture when used with conventional (fast page mode) EDO-DRAMs yields signicant gains. We digress brie y to describe the EDRAM architecture. The memory is organized as four modules each a 512 x 256 x 8 bit DRAM array, supported by four 256 x 8 bit SRAM prefetch buers. Access time to the DRAM array is 35ns and access time to the SRAM is 15ns. A row access to memory (RAS) transfers 2K bits from DRAM to SRAM. Once the data is in the SRAM buer it can be accessed in 15ns. In addition to the memory modules, the memory system has a small direct-mapped tag table that remembers what data is stored in the distributed cache. The memory controller functions as a combined cache and memory controller. When a read-request arrives, the memory controller rst reads the cache buers and the tag table to see if the data is in the cache. If there is a cache miss, the memory controller reads the data from the DRAM array into the SRAM cache and sends the data to the CPU.
3.1 Memory Architecture Distributed Cache
and
Any prediction scheme, unless 100% accurate, increases the data trac between DRAM and cache. A critical issue is how to provide sucient bandwidth so that the extra trac caused by prediction does not result in bus contention and increase, rather than decrease, the eective access time. EDRAM [17] memory parts provide a cost eective high bandwidth connection between DRAM and an on-chip SRAM buer. EDRAM integrates a small SRAM prefetch buer together with a large DRAM memory array. The prefetch buers and DRAM com-
3.2 The Prediction Unit
For each read-request up to four dierent prefetch block addresses of read-requests that in the past fol-
Address
Current Address
8 0
Prediction cache
Memory In
Tag DRAM Prediction Table

Memory Address
Row Address Latch
Column Address Latch
Cache Line
1K Cache Lines
Memory Controls
1M Array
1M Array
1M Array
1M Array
Update MUXs
=
256 x 8 cache 256 x 8 cache 256 x 8 cache 256 x 8 cache
Tag Comp. Unit Prediction Addresses Stack
Predict-1 Predict-1 Predict-2 Predict-2
Memory Out
1 of 4 selector
2341: Current Address 2342: Prev. Address
hit/miss
Prediction Update Controller

match Pred_1 Match Pred_2
Access Address Stack
data out latch
Data-in Latch
Figure 4: Prediction Unit Block Diagram ture the working set of the active programs. Our experiments show that a prediction table that is about 1% of physical memory size is very eective. The prediction scheme is general. It does not depend on a-priori knowledge of the access patterns such as one block look-ahead [18] or stride detection [3, 15]. It predicts well almost any repeated access pattern. For example, the prediction accuracy for 8 of the 9 programs below is about 94%. (See more details in the results section). Figure 4 shows a block diagram of the prediction unit. The prediction unit is made of a prediction table (optional), a prediction cache (see next subsection), a prediction correction controller and two buers: the address buer and the prediction buer. When the CPU issues a new read-request, the prediction unit fetches from the prediction cache four guesses for the next address request. If the prediction cache does not have an entry for the current address, (a cache miss) the cache is updated from the prediction table. The current address is pushed into the address buer. At the same time, the prediction update controller compares the current address with the previous guesses to see if there is a match. If there is no match we have a prediction miss. In that case the prediction update controller updates the prediction cache and the prediction table.
Figure 3: EDRAM Architecture lowed the current request are stored in a directmapped table. Each prediction address references a dierent interleaved module. The table is used to predict the block address of the next memory request. When the CPU issues a read-request to main memory, the table is queried for the set of four dierent block addresses that in the recent past were the next read-request to memory. When the next request arrives the accuracy of the previous predictions is checked. If all four predicted addresses are wrong, the table is updated. Thus, the scheme dynamically adjusts to dierent phases within a program and to transient behavior. Note that the prediction scheme only looks at memory misses. All memory references that hit the primary cache(s) are not considered! The prediction scheme need only predict which block is being accessed next. Clearly, blocks should be larger than the primary (or secondary) cache line. We found that blocks of 512-2048 bytes work well. A memory system with M bytes of DRAM and a block size of B bytes has M B distinct blocks. Thus, the prediction table size is only a small fraction of the total physical memory size. For example a 64 Megabyte memory with 2K byte blocks needs a prediction table with 32K 56-bit entries. Moreover, the prediction table does not have to cover the physical address space. The prediction table need only be large enough to cap-
3.3 Prediction Cache
To keep the cost of the prediction table down, we use DRAM to store the table and use a small prediction-cache to keep the active working set in fast
memory. A small portion of the main system memory may be used for the DRAM storage. The idea is similar to the TLB as a cache for the page table. In our experiments we used a 32KB prediction cache. Since we are caching the block addresses, there is a high locality of reference which is exploited by the cache to obtain fast predictions. Our experiments show that the prediction data exhibits very good locality. The prediction cache hit rate on all the programs we tried is better than 98%. Since the prediction table itself is not essential for the actual operation of the prediction unit, we tried running the prediction unit with only a 32KB prediction cache. For all the programs we tried, the cache only scheme had an essentially identical performance to the scheme that used a prediction table and a prediction-cache.
Prog Description MB ldvsim VLSI switch level simulator 17 meteor Graph Based Theorem prover 22 wave5 Electromagnetic Simulator-SPEC95 14 turb3d Turbulence Simulator-SPEC95 25 su2cor Quantum Physics Monte-Carlo Simulator-SPEC95 24 tt Fast Fourier Transform code-SPEC95 37 swim Finite differences PDE solver-SPEC95 57 mgrid Multigrid equation solver-SPEC95 56 tomcatv Fluid dynamics equation solver-SPEC95 30 Table 2: Workload Each program was traced for 2 billion instructions using the SHADE simulator [8] which simulates the SPARC instruction set [19]. To skip the initialization phase of the programs, memory references were collected starting after one billion instructions. We used a primary cache simulation to compute miss rates, eective execution rates, and to lter out cache hits. We collected rst-level cache misses and timing information in large trace les. We used the trace les to drive the multimode, event driven simulator ldvsim [14]. We used ldvsim to accurately simulate normal L2 cache and our distributed cache architecture. We assumed an execution rate of one instruction per cycle and processor stalls on data misses only. We simulated only data misses and only considered data read references. Two measures were used to compare second level cache architectures. The rst measure is the average cycles per instruction (CPI),
CP I
4 Simulation and Performance Evaluation
To evaluate the performance of the proposed memory subsystem, we used a simulation model approximating a server class machine such as the Sun Microsystems' Sparc-Center SC2000E. We made the following simplifying assumptions: The processor executes one instruction every 10ns (100 MHz). We did not account for instructions that take more than one cycle, branching delays, and instruction cache misses. There is a separate 16K-byte primary data cache with a 32-byte cache line. The secondary cache (L2) bus cycle times are 20ns. The data bus is 64 bits wide. To obtain accurate results we had to overcome two major diculties: rst - most benchmark programs are too small; second - to properly evaluate large scale programs either a hardware emulator has to be used or a huge amount of computing time is needed to get good size traces. Since we did not have access to an emulator, we had to compromise. We traced programs with moderate data sizes and we simulated only 2 billion instructions to evaluate memory performance. We used some proposed SPEC95 programs and additional programs of our own. We present results of seven programs (tt, swim, mgrid, tomcatv, su2cor, turb3d, wave5 ) being considered for the SPEC95 benchmark suite, a switch-level event-driven simulator, (ldvsim [5]) and a graph search algorithm (meteor [2]). The last two programs traverse large link-list data structures and test our architecture on programs that have dynamic and irregular memory access. All the programs use over 14 M-bytes of memory. Table 2 shows the data size for each program.
= TTtime
where Ttime is the total execution time in cycles and Tnumber is the total number of executed instructions. The second is the average Cycles Per Memory Request (CPR) Tacc CP R = T
accnum
number
where Tacc is the total number of cycles spent on memory accesses and Taccnum is the total number of memory accesses.
4.1 Results
We simulated many dierent possible implementations using CDRAM, SDRAM, Fast page mode DRAM and EDRAM. We present here results for the architecture implemented using EDRAM and CDRAM. The EDRAM simulations were done using:
The simulation for the memory system implemented with CDRAM was done using: 64 Megabytes of CDRAM main memory. Prediction cache of 32K bytes. Four dierent predictions. Prediction table with 256K entries. Four way interleaved memory. Prefetch block size of 64 bytes. Cluster size of 64 bytes. The simulations of conventional L2 cache were done assuming 20ns cache cycle time. Tables 3 and 4 tabulates the eective memory access time (in 10ns cycles) for a memory system using conventional memory without an external cache, memory with a 256K byte conventional L2 cache, a distributed cache with prediction using EDRAM parts and a distributed cache using CDRAM parts. We show two entries per architecture: cycles per memory request (CPR) and cycles per instruction (CPI). CPR is a measurement of the memory system performance, while CPI is a measurement of the eect of the memory performance on overall execution speed. Figures 5 and 6 show the same information as a bar graph. We see that the distributed predictive cache architecture can reduce the CPR for tt by a factor of more than 7 and for su2cor by a factor of 3.2 compared to a memory system with a conventional L2 cache. This reduction in eective memory access time translates into a large speedup factor for some of the program. Table 3 compares the the CPI with and without prediction. The prediction scheme doubles the performance for tt and improves the performance for su2cor by 40%. The prediction scheme only marginally improves meteor, turb3d, tomcatv, wave5, swim and mgrid. These programs perform well with traditional second level caches.
Effective Cycles Per Memory Reference
64 Megabytes of EDRAM main memory. Prediction cache of 32K bytes. Four dierent predictions. No prediction table. Four way interleaved memory. Prefetch block size of 2K bytes. Cluster size of 256 bytes.
15.4
L1 L1+L2 CDRAM+PREDICTION EDRAM+PREDICTION
10.4
5.4
0.4
su d id tv m or 2c turb3 mca swi mgr to tfft ve5 vpc teor Avg ld me wa
Figure 5: Results:- Cycles Per Memory Reference
L1 L1+L2 CDRAM+PREDICTION EDRAM+PREDICTION Cycles Per Instruction 2.6 Memory Overhead
1.6
0.6
su d tv id m or 2c turb3 mca swi mgr to tfft wa r c 5 ve ldvp eteo m Av g
5 Conclusion and Future Work
We have designed and evaluated a secondary cache architecture that substantially reduces the eective memory access time for a large class of scientic and other large scale programs. The method is effective for programs that repeatedly sweep through large and complex data structures. We use a sim-
Figure 6: Results:- Cycles Per Instruction
Prog su2cor turb3d tomcatv swim mgrid tt wave5 ldvsim meteor
Avg
DRAM CPR CPI 15.401 2.066 6.788 1.040 7.853 1.624 10.546 3.234 4.186 1.038 14.417 2.315 5.214 1.008 8.247 1.797 5.786 1.062 8.72 1.69
CPR 6.404 2.585 2.320 2.410 2.508 14.416 2.073 3.964 2.474 4.35
L2
CPI 1.400 1.011 1.120 1.330 1.018 2.315 1.002 1.326 1.019 1.28
Prog su2cor turb3d tomcatv swim mgrid tt wave5 ldvpc meteor
Average
1st 0.959 0.835 0.941 0.955 0.802 0.954 0.975 0.955 0.513 0.877
2nd 0.957 0.906 0.925 0.946 0.801 0.939 0.923 0.952 0.379 0.859
3rd 0.942 0.783 0.792 0.943 0.756 0.927 0.896 0.950 0.322 0.812
4th 0.940 0.896 0.781 0.946 0.731 0.916 0.880 0.949 0.288 0.814
5th 0.929 0.778 0.715 0.937 0.727 0.912 0.868 0.950 0.274 0.788
Table 5: Prediction Accuracy as a function of i'th Future Prediction The following parameters are used: Prediction Table Size: Cluster Size: Prediction Block Size: Number of Predictions: Replacement Policy: 256K entries 256 bytes 256 bytes 4 LRU
Table 3: Comparison of the 4 memory schemes
Prog su2cor turb3d tomcatv swim mgrid tt wave5 ldvsim meteor
Avg
CDRAM+MAP CPR CPI 4.651 1.270 2.368 1.010 2.119 1.102 2.058 1.248 2.217 1.015 4.293 1.323 2.205 1.002 2.911 1.210 3.021 1.026 2.87 1.13
EDRAM+MAP CPR CPI 2.031 1.076 2.035 1.007 2.232 1.112 2.006 1.235 2.190 1.014 2.047 1.103 2.744 1.003 2.335 1.147 2.418 1.018 2.23 1.08
Table 4: Comparison of the 4 memory schemes
ple table-based prediction and prefetching. The success of the prefetching scheme is independent of any a-priori knowledge about the program. Data prefetching allows us to hide large memory access latency. We use EDRAMs to increase memory-to-cache bandwidth and overcome the increase in trac between main memory and the cache. For some programs, the latency is reduced by half as compared to a conventional data cache. Our scheme is compatible with today's microprocessors, it is transparent to the user and can be implemented in today's technology. We demonstrated that our hardware prediction scheme gives high prediction accuracy, up to 97%, even though the prediction alphabet is large. The experiments described here were limited to one prediction scheme and a xed distributed cache architecture. We plan to explore and evaluate other alternatives. For example, rather than predicting the next memory access we can predict further into the future. Preliminary simulations show that the prediction accuracy is only slightly eected by predicting two or three accesses into the future. Table 5 shows the prediction accuracy when using our method to predict accesses further into the future. We see that in most cases the prediction accuracy falls only slightly. By predicting further into the future we should have more time to prefetch data and get an even better
performance. We are also investigating more complex prediction schemes that should yield better prediction consistency.
Acknowledgments
This work was supported in part by a research gift from Sun Microsystems. The authors thank Sanjay Vishin of Sun Microsystems for his help in setting up the tracing environment.
References
[1] Thomas Alexander. A Distributed Predictive Cache for High Preformance Computer Systems. PhD thesis, Duke University, 1995. [2] O.L. Astrachan and M.E. Stickel. Caching and lemmaizing in model elimination theorem provers. In Proceedings of the Eleventh International Conference on Automated Deduction. Springer Verlag, 1992. [3] J.L Baer and T.F Chen. An eective on chip preloading scheme to reduce data access penalty. SuperComputing `91, 1991. [4] A. Borg and D.W. Wall. Generation and analysis of very long address traces. 17th ISCA, 5 1990. [5] J. V. Briner, J. L. Ellis, and G. Kedem. Breaking the Barrier of Parallel Simulation of Digital Systems. Proc. 28th Design Automation Conf., 6 1991. [6] H.O Bugge, E.H. Kristiansen, and B.O Bakka. Trace-driven simulation for a two-level cache design on the open bus system. 17th ISCA, 5 1990. [7] Tien-Fu Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes . Proceedings of 21 International Symposium on Computer Architecture, 1994. [8] R.F Cmelik and D. Keppel. SHADE: A fast instruction set simulator for execution proling. Sun Microsystems, 1993. [9] K.I. Farkas, N.P. Jouppi, and P. Chow. How useful are non-blocking loads, stream buers and speculative execution in multiple issue processors. Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture, 1995.
[11] E. H. Gornish. Adaptive and Integrated Data Cache Prefetching for Shared-Memory Multiprocessors. PhD thesis, University of Illinois at Urbana-Champaign, 1995. [12] M.S. Lam. Locality optimizations for parallel machines . Proceedings of International Conference on Parallel Processing: CONPAR '94, 1994. [13] M.S Lam, E.E. Rothberg, and M.E. Wolf. The cache peformance and optimization of block algorithms. ASPLOS IV, 4 1991. [14] MCNC. Open Architecture Silicon Implementation Software User Manual. MCNC, 1991. [15] T.C. Mowry, M.S Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. ASPLOS V, 1992. [16] Betty Prince. Memory in the fast lane. IEEE Spectrum, 2 1994. [17] Ramtron. Speciality Memory Products. Ramtron, 1995. [18] A. J. Smith. Cache memories. Computing Surveys, 9 1982. [19] The SPARC Architecture Manual, 1992. [20] W. Wang and J. Baer. Ecient trace-driven simulation methods for cache performance analysis. ACM Transactions on Computer Systems, 8 1991. [21] Wm. A. Wulf and Sally A. McKee. Hitting the Memory Wall:Implications of the Obvious . Computer Architecture News, 12 1994.
[10] J.W.C. Fu, J.H. Patel, and B.L. Janssens. Stride directed prefetching in scalar processors . SIGMICRO Newsletter vol.23, no.1-2 p.102-10 , 12 1992.

Distributed Prefetch-Bufer - Cache Design For High Performance

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Distributed Prefetch-Bufer - Cache Design For High Performance

Enviado por

Direitos autorais:

Formatos disponíveis

Distributed Prefetch-bu er/Cache Design for High Performance Memory Systems

2.1 Table based prediction

2 Prediction & Prefetching

Figure 1: Prediction Table Algorithm

Table 1: Prediction Accuracy vs. Number of Predictions

3 Distributed Prefetch-bu er/Cache Architecture

 Memory Interleave: Memory is multi-way inter-

memory Address and controls

Prediction & Prefetching Unit

Memory Data Bus

3.1 Memory Architecture Distributed Cache

3.2 The Prediction Unit

Tag DRAM Prediction Table

Row Address Latch

Column Address Latch

2341: Current Address 2342: Prev. Address

Prediction Update Controller

Access Address Stack

data out latch

3.3 Prediction Cache

4 Simulation and Performance Evaluation

Effective Cycles Per Memory Reference

L1 L1+L2 CDRAM+PREDICTION EDRAM+PREDICTION

Figure 5: Results:- Cycles Per Memory Reference

L1 L1+L2 CDRAM+PREDICTION EDRAM+PREDICTION Cycles Per Instruction 2.6 Memory Overhead

5 Conclusion and Future Work

Figure 6: Results:- Cycles Per Instruction

Prog su2cor turb3d tomcatv swim mgrid t t wave5 ldvsim meteor

Prog su2cor turb3d tomcatv swim mgrid t t wave5 ldvpc meteor

Table 3: Comparison of the 4 memory schemes

Prog su2cor turb3d tomcatv swim mgrid t t wave5 ldvsim meteor

Table 4: Comparison of the 4 memory schemes

Você também pode gostar

Distributed Prefetch-buer/Cache Design for High Performance Memory Systems

3 Distributed Prefetch-buer/Cache Architecture

Memory Interleave: Memory is multi-way inter-

Prog su2cor turb3d tomcatv swim mgrid tt wave5 ldvsim meteor

Prog su2cor turb3d tomcatv swim mgrid tt wave5 ldvpc meteor

Prog su2cor turb3d tomcatv swim mgrid tt wave5 ldvsim meteor