Caches in Multicore Systems: Universitatea Politehnica Din Timisoara Facultatea de Automatica Şi Calculatoare

Universitatea Politehnica din Timisoara Facultatea de Automatica i Calculatoare
Project for Computer Engineering Fundaments
Caches in multicore systems
Rotar Calin Viorel An universitar 2010- 2011
Introduction In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere. If requested data is contained in the cache (cache hit), this request can be served by simply reading the cache, which is comparatively faster. Otherwise (cache miss), the data has to be recomputed or fetched from its original storage location, which is comparatively slower. Hence, the more requests can be served from the cache the faster the overall system performance is. As opposed to a buffer, which is managed explicitly by a client, a cache stores data transparently: This means that a client who is requesting data from a system is not aware that the cache exists, which is the origin of the name cache (from French "cacher", to conceal). To be cost efficient and to enable an efficient use of data, caches are relatively small. Nevertheless, caches have proven themselves in many areas of computing because access patterns in typical computer applications have locality of reference. References exhibit temporal locality if data is requested again that has been recently requested already. References exhibit spatial locality if data is requested that is physically stored close to data that has been requested already. Hardware implements cache as a block of memory for temporary storage of data likely to be used again. CPUs and hard drives frequently use a cache, as do web browsers and web servers. A cache is made up of a pool of entries. Each entry has a datum (a nugget of data) - a copy of the same datum in some backing store. Each entry also has a tag, which specifies the identity of the datum in the backing store of which the entry is a copy. When the cache client (a CPU, web browser, operating system) needs to access a datum presumed to exist in the backing store, it first checks the cache. If an entry can be found with a tag matching that of the desired datum, the datum in the entry is used instead. This situation is known as a cache hit. So, for example, a web browser program might check its local cache on disk to see if it has a local copy of the contents of a web page at a particular URL. In this example, the URL is the tag, and the content of the web page is the datum. The percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache. The alternative situation, when the cache is consulted and found not to contain a datum with the desired tag, has become known as a cache miss. The previously uncached datum fetched from the backing store during miss handling is usually copied into the cache, ready for the next access. During a cache miss, the CPU usually ejects some other entry in order to make room for the previously uncached datum. The heuristic used to select the entry to eject is known as the replacement policy. One popular replacement policy, "least recently used" (LRU), replaces the least recently used entry (see cache algorithms). More efficient caches compute use frequency against the size of the stored contents, as well as the latencies and throughputs for both the cache and the backing store. While this works well for larger amounts of data, long latencies and slow throughputs, such as experienced with a hard drive and the Internet, it is not efficient for use with a CPU cache. When a system writes a datum to the cache, it must at some point write that datum to the backing store as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a synchronous write to the backing store.
Alternatively, in a write-back (or write-behind) cache, writes are not immediately mirrored to the store. Instead, the cache tracks which of its locations have been written over and marks these locations as dirty. The data in these locations is written back to the backing store when those data are evicted from the cache, an effect referred to as a lazy write. For this reason, a read miss in a write-back cache (which requires a block to be replaced by another) will often require two memory accesses to service: one to retrieve the needed datum, and one to write replaced data from the cache to the store. Other policies may also trigger data write-back. The client may make many changes to a datum in the cache, and then explicitly notify the cache to write back the datum. No-write allocation (a.k.a. write-no-allocate) is a cache policy which caches only processor reads, i.e. on a write-miss: * Data is written directly to memory, * Data at the missed-write location is not added to cache. This avoids the need for write-back or write-through when the old value of the datum was absent from the cache prior to the write. Entities other than the cache may change the data in the backing store, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the client updates the data in the cache, copies of that data in other caches will become stale. Communication protocols between the cache managers which keep the data consistent are known as coherency protocols. Structure Cache row entries usually have the following structure: tag data blocks valid bit The data blocks (cache line) contain the actual data fetched from the main memory. The valid bit (dirty bit) denotes that this particular entry has valid data. An effective memory address is split (MSB to LSB) into the tag, the index and the displacement (offset), tag index displacement The index length is - ceiling (log2(cache_rows)) - bits and describes which row the data has been put in. The displacement length is - ceiling (log2(cache_rows)) - and specifies which block of the ones we have stored we need. The tag length is address_length index_length displacement_length and contains the most significant bits of the address, which are checked against the current row (the row has been retrieved by index) to see if it is the one we need or another, irrelevant memory location that happened to have the same index bits as the one we want. Associativity Associativity is a trade-off. If there are ten places to which the replacement policy could have mapped a memory location, then to check if that location is in the cache, ten cache entries must be searched. Checking more places takes more power, chip area, and potentially time. On the other hand, caches with more associativity suffer fewer misses (see conflict misses, below), so that the CPU wastes less time reading from the slow main memory. The rule of thumb is that doubling the associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the
same effect on hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on the hit rate, and are generally done for other. In order of increasing (worse) hit times and decreasing (better) miss rates, * direct mapped cachethe best (fastest) hit times, and so the best tradeoff for "large" caches * 2-way set associative cache * 2-way skewed associative cache "the best tradeoff for .... caches whose sizes are in the range 4K-8K bytes" Andr Seznec * 4-way set associative cache * fully associative cache the best (lowest) miss rates, and so the best tradeoff when the miss penalty is very high One of the advantages of a direct mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index which might have a copy of that datum is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address. The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address. This datum can then be used in parallel with checking the full tag. The hint technique works best when used in the context of address translation, as explained below. Other schemes have been suggested, such as the skewed cache, where the index for way 0 is direct, as above, but the index for way 1 is formed with a hash function. A good hash function has the property that addresses which conflict with the direct mapping tend not to conflict when mapped with the hash function, and so it is less likely that a program will suffer from an unexpectedly large number of conflict misses due to a pathological access pattern. The downside is extra latency from computing the hash function. Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; LRU tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewedassociative caches have major advantages over conventional set-associative ones. Multi-level caches Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger slower caches. Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is checked, and so on, before external memory is checked. As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. For example, the Alpha 21164 (1995) had 1 to 64MB off-chip L3 cache; the IBM POWER4 (2001) had a 256 MB L3 cache off-chip, shared among several processors; the Itanium 2 (2003) had a 6 MB unified level 3 (L3) cache on-die; the Itanium 2 (2003) MX 2 Module incorporates two Itanium2 processors along with a shared 64 MB L4 cache on a MCM that was pin compatible with a Madison processor; Intel's Xeon MP product code-named "Tulsa" (2006) features 16 MB of ondie L3 cache shared between two processor cores; the AMD Phenom II (2008) has up to 6 MB
on-die unified L3 cache; and the Intel Core i7 (2008) has an 8 MB on-die unified L3 cache that is inclusive, shared by all cores. The benefits of an L3 cache depend on the application's access patterns. Finally, at the other end of the memory hierarchy, the CPU register file itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in softwaretypically by a compiler, as it allocates registers to hold values retrieved from main memory. Register files sometimes also have hierarchy: The Cray-1 (circa 1976) had 8 address "A" and 8 scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.) The use of large multilevel caches can substantially reduce memory bandwidth demands of a processor. This has made it possible for several (micro)processors to share the same memory through a shared bus. Caching supports both private and shared data. For private data, once cached, it's treatment is identical to that of a uniprocessor. For shared data, the shared value may be replicated in many caches. Replication has several advantages: * Reduced latency and memory bandwidth requirements. * Reduced contention for data items that are read by multiple processors simultaneously. However, it also introduces a problem: Cache coherence. Cache coherence With multiple caches, one CPU can modify memory at locations that other CPUs have cached. For example: * CPU A reads location x, getting the value N . * Later, CPU B reads the same location, getting the value N . * Next, CPU A writes location x with the value N - 1 . * At this point, any reads from CPU B will get the value N , while reads from CPU A will get the value N - 1 . This problem occurs both with write-through caches and (more seriously) with writeback caches. Coherence defines the behavior of reads and writes to the same memory location. The coherence of caches is obtained if the following conditions are met: 1. A read made by a processor P to a location X that follows a write by the same processor P to X, with no writes of X by another processor occurring between the write and the read instructions made by P, X must always return the value written by P. This condition is related with the program order preservation, and this must be achieved even in monoprocessed architectures. 2. A read made by a processor P1 to location X that follows a write by another processor P2 to X must return the written value made by P2 if no other writes to X made by any processor occur between the two accesses. This condition defines the concept of coherent view of memory. If processors can read the same old value after the write made by P2, we can say that the memory is incoherent. 3. Writes to the same location must be sequenced. In other words, if location X received two different values A and B, in this order, by any two processors, the processors can
never read location X as B and then read it as A. The location X must be seen with values A and B in that order. These conditions are defined supposing that the read and write operations are made instantaneously. However, this doesn't happen in computer hardware given memory latency and other aspects of the architecture. A write by processor P1 may not be seen by a read from processor P2 if the read is made within a very small time after the write has been made. The memory consistency model defines when a written value must be seen by a following read instruction made by the other processors. Coherence protocols: * Directory-based coherence: In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry. * Snooping is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location. Write invalidate. It is the most common protocol, both for snooping and for directory schemes. The basic idea behind this protocol is that writes to a location invalidate other caches' copies of the block. Reads by other processors on invalidated data cause cache misses. If two processors write at the same time, one wins and obtains exclusive access. Write broadcast (write update). An alternative is to update all cached copies of the data item when it is written. To reduce bandwidth requirements, this protocol keeps track of whether or not a word in the cache is shared. If not, no broadcast is necessary. Performance Differences between Bus Snooping Protocols Write invalidate is much more popular. This is due primarily to the performance differences. Multiple writes to the same word with no intervening reads require multiple broadcasts. With multiword cache blocks, each word written requires a broadcast. For write invalidate, the first word written invalidates. Also write invalidate works on blocks , while write broadcast must work on individual words or bytes. The delay between writing by one processor and reading by another is lower in the write broadcast scheme. For write invalidate, the read causes a miss. Since bus and memory bandwidth are more important in a bus-based multiprocessor, write invalidation performs better. * Snarfing is where a cache controller watches both address and data in an attempt to update its own copy of a memory location when a second master modifies a location in main memory. When a write operation is observed to a location that a cache has a copy of, the cache controller updates its own copy of the snarfed memory location with the new data. Conclusion Cache memory is a vital element in allowing more programs to be run faster and in parallel threads of execution on multicore processors. The key condition to success is that the different cores using the data are able to understand each other and perform the operations in a correct way.
Bibliography: Ulrich Drepper: What Every Programmer Should Know About Memory A course on Multi-core architectures by Jernej Barbic http://www.ece.unm.edu/~jimp/611/slides/chap8_2.html Patterson & Hennessy - Computer Organization and Design. The Hardware-Software Interface http://en.wikipedia.org/wiki/CPU_cache http://en.wikipedia.org/wiki/Cache_coherency

Caches in Multicore Systems: Universitatea Politehnica Din Timisoara Facultatea de Automatica Şi Calculatoare

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Caches in Multicore Systems: Universitatea Politehnica Din Timisoara Facultatea de Automatica Şi Calculatoare

Enviado por

Direitos autorais:

Formatos disponíveis

Universitatea Politehnica din Timisoara Facultatea de Automatica i Calculatoare

Project for Computer Engineering Fundaments

Caches in multicore systems

Rotar Calin Viorel An universitar 2010- 2011

Você também pode gostar