Você está na página 1de 12

Shared-Memory Multiprocessors

UMA (Uniform-memory-access) Model All processors have equal access time to all memory
words. Each processor may use a private cache. Peripherals are also shared in some fashion. These are tightly coupled systems due to the high degree of resource sharing. The system interconnect takes the form of a common bus, a crossbar switch or a multistage networks. Most computer manufacturers have multiprocessor (MP) extensions of their uniprocessor product line. The UMA model is suitable for general-purpose and time-sharing applications by multiple users. It can be used to speed up the execution of a single large program in time-critical applications. To coordinate parallel events, synchronization and communication among processors are done through using shared variables in the common memory. Symmetric All processors have equal access to all peripheral devices. All the processors are equally capable of running the executive programs, such as the OS kernel and I/O service routines. Asymmetric Only one or a subset of processors are executive capable. An executive or a master processor can execute the OS and handle I/O. The remaining processors have no I/O capability and thus are called attached processors(APs). APs execute user codes under the supervision of the master processor. In both MP and AP configurations, memory sharing among master and attached processors is still in place. |P1| |P2| ------- \Pn|

|System interconnect (Bus, crossbar, multistage network)| |IO| |SM1| ------ |SMm|

The NUMA(Non uniform-memory-access) Model It is a shared-memory system in which the


access time varies with the location of the memory word. The shared memory is physically distributed to all processors, called local memories. The collection of all local memories forms a global address space accessible by all processors.It is faster to access a local memory with a local processor. The access of remote memory attached to other processors takes longer due to the added delay through the interconnection network. Besides distributed memories, globally shared memory can be added to a multi-processor system. In this case, there are three memory-access patterns: local memory access, global memory access and remote memory access. Local memory acces is fastest whereas remote is slowest Hierarchically structured multiprocessor The processors are divided into several clusters. Each cluster is itself a UMA or NUMA. The clusters are connected to global shared-memory modules. The entire system is considered a NUMA multiprocessor. All processors belonging to the same cluster are allowed

to uniformly access the cluster shared-memory modules. All clusters have equal access to the global memory. The access time to the cluster memory is shorter than that to the global memory. One can specify the access right among intercluster memories in various ways.

P: processor.. GSM: global shared memory . CSM..cluster shared memory . Cin= cluster interconnection network

Complex Instruction Sets In the early days of computer history, most computer families started with an instruction set which was rather simple. The main reason for being simple then was the high cost of hardware. The hardware cost has dropped and the software cost has gone up steadily in the past three decades. Furthermore, the semantic gap between HLL features and computer architecture has widened.The net result is that more and more functions have been built into the hardware, making the instruction set very large and very complex. The growth of instruction sets was also encouraged by the popularity of micro-programmed control in the 1960s and 1970s. Even user-defined instruction sets were implemented sing microcodes in some processors for special-purpose applications.A typical CISC instruction set contains approximately 120 to 350 instructions using variable instruction/data formats,

uses a small set of 8 to 24 general-purpose registers (GPRs), and executes a large number of memory reference operations based on more than a dozen addressing modes. Many HLL statements are directly implemented in hardware/firmware in a CISC architecture. This may simplify the compiler development, improve execution efficiency, and allow an extension from scalar instructions to vector and symbolic instructions. Reduced Instruction Sets We started with RISC instruction sets and gradually we moved to CISC instruction sets during the 1980s. After two decades of using CISC processors, computer users began to reevaluate the performance relationship between instruction-set architecture and available hardware/software technology.Through many years of program tracing, computer scientists realized that only 25% of the instructions of a complex instruction set. are frequently used about 95% of the time. This implies that about 75% of hardware-supported instructions often' are not used at all. A natural question then popped up: Why should we waste valuable chip area for rarely used instructions? With low-frequency elaborate instructions demanding long microcodes to execute them, it may be more advantageous to remove them completely from the hardware and rely on software to implement them. Even if the software implementation is slow, the net result will be still a plus due to their low frequency of appearance. Pushing rarely used instructions into software will vacate chip areas for building more powerful RISC or superscalar processors, even with on-chip caches or floating-point units.A RISC instruction set typically contains less than 100 instructions with a fixed instruction format (32 bits). Only three to five simple addressing modes are used. Most instructions are register-based. Memory access is done by load/store instructions only. A large register file (at least 32) is used to improve fast context switching among multiple users, and most instructions execute in one cycle with hardwired control.Because of the reduction in instruction-set complexity, the entire processor is implementable on a single VLSI chip. The resulting benefits include a higher clock rate and a lower CPI, which lead to higher MIPS ratings as reported on commercially available RISC/superscalar processors.

The CPU in the VAX 8600 consists of two functional units for concurrent execution of integer and floating-point instructions. The unified cache is used for holding both instructions and data. There are 16 GPRs in theinstruction unit. Instruction pipelining has been built with six stages in the VAX 8600, as in

most CISC machines. The instruction unit pre-fetches and decodes instructions, handles branching operations, and supplies operands to the two functional units in a pipelined fashion. A translation lookaside buffer (TLB) is used in the memory control unit for fast generation of a physical address from a virtual address. Both integer and floating point units are pipelined. The performance of the processor pipelines relies heavily on the cache hit ratio and on minimal branching damage to the pipeline flow. The CPI of a VAX 8600 instruction varies within a wide range from 2 cycles to as high as 20 cycles. For example, both multiply and divide may tie up the execution unit for a large number of cycles. This is caused by the use of long sequences .of microinstructions to control hardware operations.The general philosophy of designing a CISC processor is to implement useful instructions in hardware/firmware which may result in a shorter program length with lower software overhead. However, thisadvantage has been obtained at the expense of a lower clock rate and a higher CPI, which may not payoff at all. The VAX 8600 was improved from the earlier VAX/ll Series. The system was later further upgraded to theVAX 9000 Series offering both vector hardware and multiprocessor options. All the VAX Series have used a pagingtechnique to allocate the physical memory to user programs.

In 1989, Intel Corporation introduced the i860 microprocessor. It is a 64-bit RISC processor fabricated on a single chip containing more than 1 million transistors. The peak performance of the i860 was designed to achieve 80 Mflops single precision or 60 Mflops double precision, or 40 MIPS in 32-bit integer operations at a 40-MHz clock rate. A schematic block diagram of major components in the i860 is shown in Figure. There are nine functional units (shown in nine boxes) interconnected by multiple data paths with widths ranging from 32 to 128 bits. All external or internal address buses are 32-bit wide, and the external data path or internal data bus is 64 bits wide. However, the internal RISC integer ALU is only 32 bits wide. The instruction cache has 4 Kbytes organized as a two-way set-associative memory with 32 bytes per cache block. It transfers 64 bits per clock cycle, equivalent to 320 Mbytes/s at 40 MHz.The data cache is a two-way set-associative memory of 8 Kbytes. It transfers 128 bits per clock cycle (640

Mbytes/s) at 40 MHz. A write-back policy is used. Cacheing can be inhibited by software, if needed. The bus control unit coordinates the 64-bit data transfer between the chip and the outside world.The MMU implements protected 4 Kbyte paged virtual memory of 232 bytes via a TLB. The paging and MMU structure of the i860 is identical to that implemented in the i486. An i86Q and an i486 can be used jointly in a heterogeneous multiprocessor system, permitting the development of compatible OS kernels. The RISC integer unit executes load, store, integer, bit, and control instructions and fetches instructions for the floating-point control unit as well.There are two floating-point units, namely, the multiplier unit and the adder unit, which can be used separately or simultaneously under the coordination of the floating-point control unit. Special dual-operation floating-point instructions such as add-and-multiply and subtract-and-multiply use both the multiplier and adder units in parallel. Furthermore, both the integer unit and the floating-point control unit can execute concurrently. In this sense, the i860 is also a superscalar RISC processor capable of executing two instructions, one integer and one floating-point, at the same time. The floating-point unit conforms to the IEEE 754 floating-point standard, operating with single-precision (32-bit) and double-precision (64-bit) operands.The graphics unit executes integer operations corresponding to 8-, 16-, or 32-bit pixel data types. This unit supports three-dimensional drawing in a graphics frame buffer, with color intensity, shading, and hidden surface elimination. The merge register is used only by vector integer instructions. This register accumulates the results of multiple addition operations.The i860 executes 82 instructions, including 42 RISC integer, 24 floating-point, 10 graphics, and 6 assembler pseudo operations. All the instructions execute in one cycle, which equals 25 ns for a 40-MHz clock rate. The i860 and its successor, the i860XP, are used in floatingpoint accelerators, graphics subsystems, workstations, multiprocessors, and multicomputers.

Hierarchical Memory Technology


.Storage devices such as registers, caches, main memory, disk devices, and tape units are often organized as a hierarchy as depicted in Figure. The memory technology and storage organization at each level are characterized by five parameters: the access time (ti), memory size (Si), cost per byte (Ci), transfer bandwidth (bi), and unit of transfer (Xi).The access time ti refers to the round-trip time from the CPU to the ith-Ievel memory. The memory size Si is the number of bytes or words in level i. The cost of the ith-Ievel memory is estimated by the product CiSi. The bandwidth bi refers to the rate at which information is transferred between adjacent levels. The unit of transfer Xi refers to the grain size for data transfer between levels i and i + 1. Memory devices at a lower level are faster to access, smaller in size, and more expensive per byte, having a higher bandwidth and using; a smaller unit of transfer as compared with those at a higher level. In other words, we have ti-l <ti, Si-l < Si, Ci-l >Ci, bi-l > bi, and Xi-l < Xi, for i = 1,2,3, and 4, in the hierarchy where i = 0 corresponds to the CPU register level. The cache is at level 1, main memory at level 2, the disks at level 3, and the tape unit at level 4.

Registers and Caches The register and the cache are parts of the processor complex, built either on the processor chip or on the processor board. Register assignment is often made by the compiler. Register transfer operations are directly controlled by the processor after instructions are decoded. Register transfer is conducted at processor speed, usually in one clock cycle.Therefore, many designers would not consider registers a level of memory. We list them here for comparison purposes. The cache is controlled by the MMU and is programmer-transpa.ent. The cache can also be implemented at one or multiple levels, depending on the speed and application requirements. Main Memory The main memory is sometimes called the primary memory of a computer system. It is usually much larger than the cache and often implemented by the most cost-effective RAM chips, such

as the 4-Mbit DRAMs used in 1991 and the 64-Mbit DRAMs projected for 1995.The main memory is managed by a MMU in cooperation with the operating system. Options are often provided to extend the main memory by adding more memory boards to a system. Sometimes, the main memory is itself divided into two sublevels using different memory technologies. Disk Drives and Tape Units Disk drives and tape units are handled by the OS with limited user intervention. The disk storage is considered the highest level of on-line memory. It holds the system programs such as the OS and compilers and some user programs and their data sets. The magnetic tape units are off-line memory for use as backup storage. They hold copies of present and past user programs and processed results and files.A typical workstation computer has the cache and main memory on a processor board and hard disks in an attached disk drive. In order to access the magnetic tape units, user intervention is needed. Table presents representative values of memory parameters for a typical 32-bit mainframe computer built in 1993. Peripheral Technology Besides disk drives and tape units, peripheral devices include printers, plotters, terminals, monitors, graphics displays, optical scanners, image digitizers, output microfilm devices, etc. Some I/O devices are tied to special-purpose or multimedia applications.The technology of peripheral devices has improved rapidly in recent years. For example, we used dot-matrix printers in the past. Now, as laser printers become so popular, in-house publishing becomes a reality. The high demand for multimedia I/O such as image, speech, video, and sonar will further upgrade I/O technology in the future.

Inclusion, Coherence, and Locality Information stored in a memory hierarchy (M1, M2 . . ,Mn) satisfies three important properties: inclusion, coherence, and locality as illustrated in Figure. We consider cache memory the innermost level M1, which directly communicates with the CPU registers. The outermost level Mn contains all the information words stored. In fact, the collection of all addressable words in Mn forms the virtual address space of a computer. Program locality is characterized below as the foundation for using a memory hierarchy effectively.

Inclusion Property The inclusion property is stated as M1 M2 M3 ... Mn. The set inclusion relationship implies that all information items are originally stored in the outermost level Mn. During the processing, subsets of Mn are copied into Mn-l. Similarly, subsets of Mn-l are copied into Mn-2, and so on.In other words, if an information word is found in Mi, then copies of the same word can be also found in all upper levels Mi+l, Mi+2, ..., Mn. However, a word stored in Mi+1 may not be found in Mi. A word miss in Mi implies that it is also missing from all lower levels Mi-l, Mi-2, ..., M1. The highest level is the backup storage, where everything can be found. Information transfer between the CPU and cache is in terms of words (4 or 8 bytes each depending on the word length of a machine). The cache (M1) is divided into cache blocks, also called cache lines by some authors. Each block is typically 32 bytes (8 words). Blocks (such as "a" and "b") are the units of data transfer between the cache and main memory. The main memory (M2) is divided into pages, say, 4 Kbytes each. Each page contains 128 blocks for the example in Figure. Pages are the units of information transferred between disk and main memory. Scattered pages are organized as a segment in the disk memory, for example, segment F contains page A, page B, and other pages. The size of a segment varies depending on the user's needs. Data transfer between the disk and the tape unit is handled at the file level, such as segments F and G illustrated in Figure.

Coherence Property The coherence property requires that copies of the same information item at successive memory levels be consistent. If a word is modified in the cache, copies of that word must be updated immediately or eventually at all higher levels. The hierarchy should be maintained as such. Frequently used information is often found in the lower levels in order to minimize the effective access time of the memory hierarchy. In general, there are two strategies for maintaining the coherence in a memory hierarchy.

The first method is called write-through (WT), which demands immediate update in Mi+l if a word is modified in Mi, for i = 1,2, ..., n - 1. The second method is write-back (WB), which delays the update in Mi+l until the word being modified in Mi is replaced or removed from Mi. Locality of References The memory hierarchy was developed based on a program behavior known as locality of references. Memory references are generated by the CPU for either instruction or data access. These accesses tend to be clustered in certain regions in time, space, and ordering.In other words, most programs act in favor of a certain portion of their address space at any time window. Hennessy and Patterson (1990) have pointed out a 90-10 rule which states that a typical program may spend 90% of its execution time on only 10% of the code such as the innermost loop of a nested looping operation.There are three dimensions of the locality property: temporal, spatial, and sequential. During the lifetime of a software process, a number of pages are used dynamically. The references to these pages vary from time to time, however, they follow certain access patterns as illustrated in Figure. These memory reference patterns are caused by the following locality properties: Temporal locality - Recently referenced items (instructions or data) are likely to be referenced again in the near future. This is often caused by special program constructs such as iterative loops, process stacks, temporary variables, or subroutines. Once a loop is entered or a subroutine is called, a small code segment will be referenced repeatedly many times. Thus temporal locality tends to cluster the access in the recently used areas. Spatial locality - This refers to the tendency for a process to access items whose addresses are near one another. For example, operations on tables or arrays involve accesses of a certain clustered area in the address space. Program segments, such as routines and macros, tend to be stored in the same neighborhood or the memory space. Sequential locality - In typical programs, the execution of instructions follows a sequential order (or the program order) unless branch instructions create out-of-order executions. The ratio of in-order execution to out-of-order execution is roughly 5 to 1 in ordinary programs. Besides, the access of a large data array also follows a sequential order

Memory Design Implications The sequentiality in program behavior also contributes to the spatial locality because sequentially coded instructions and array elements are often stored in adjacent locations. Each type of locality affects the design of the memory hierarchy. The temporal locality leads to the popularity of the least recently used (LRU) replacement algorithm. The spatial locality assists us in determining the size of unit data transfers between adjacent memory levels. The temporal locality also helps determine the size of memory at successive levels.The sequential locality affects the determination of grain size for optimal scheduling (grain packing). Prefetch techniques are heavily affected by the locality properties. The principle of localities will guide us in the design of cache, main memory, and even virtual memory organization. The Working Sets Figure shows the memory reference patterns of three programs or three software processes. As a function of time, the virtual address space (identified by page numbers) is clustered into regions due to the locality of references. The subset of addresses (or pages) referenced within a given time window (t, t + t) is called the working set by Denning (1968). During the execution of a program, the working set changes slowly and maintains a certain degree of continuity as demonstrated in figure. This implies that the working set is often accumulated at the innermost (lowest) level such as the cache in the memory hierarchy. This will reduce .the effective memory-access time with a higher hit ratio at the lowest memory level. The time window t is a critical parameter set by the OS kernel which affects the size of the working set and thus the desired cache size

Data Dependence.
Five types of data dependence are defined below: (1) Flow dependence: A statement S2 is flow-dependent on statement S1 if an execution path exists from S1 to S2 and if at least one output (variables assigned) of S1 feeds in as input (operands to be used) to S2. Flow Dependence is denoted as S1 S2. (2) Anti dependence: Statement S2 is anti dependent on statement S1 if S2 follows S1 in program order and if the output of S2 overlaps the input of S1. A direct arrow crossed with a bar as in S1 S2 indicates anti dependence from S1 to S2. (3) Output dependence: Two statements are output-dependent if they produce (write) the same output variable. S1 S2 indicates output dependence from S1 to S2. (4) I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same variable is involved but because the same file is referenced by both I/O statements. (5) Unknown dependence: The dependence relation between two statements cannot be determined in the following situations: The subscript of a variable is itself subscribed (indirect addressing). The subscript does not contain the loop index variable. A variable appears more than once with subscripts having different coefficients of the loop variable. The subscript is nonlinear in the loop index variable. Resource Dependence It is a conflict in using shared resources, such as integer units, floating-point units, registers, and memory areas, among parallel events. When the conflicting resource is an ALU, we call it ALU dependence. If the conflicts involve workplace storage, we call it storage dependence. To avoid storage dependence, we must work on independent storage locations or use protected access.The sequential program can be coded into parallel executable form by the programmer explicitly or by the compiler implicitly. Program partitioning is a process of splitting the program into pieces for parallel execution. Bernstein s Conditions He revealed a set of conditions based on which two processes can execute in parallel. A process is a software entity corresponding to the abstraction of a program fragment defined at various levels. Hardware Parallelism It often a function of cost (machine architecture) and performance (hardware multiplicity) tradeoffs. It displays the resource utilization pattern and the peak performance of the processor resource. If a processor issues k instructions per machine cycle, then it is called a k-issue processor. If a processor issues one or more machine cycles to issue a single instruction, then it is called

one-issue machines. Eg. Intel i960CA is a three-issue processor (arithmetic, memory and branch) and IBM RISC/System 6000 is a four-issue processor (arithmetic, memory, branch, floating). Software Parallelism - It is defined by the control and data dependence of programs which is revealed by the program profile or in the program flow graph. It is a function of algorithm, programming style and compiler optimization. Parallelism in a program varies during the execution period. It displays the patterns of simultaneously executable operations.

Você também pode gostar