Escolar Documentos
Profissional Documentos
Cultura Documentos
SIMD Supercomputers
Operational model is a 5-tuple (N, C, I, M, R).
= number of processing elements (PEs). C = set of instructions (including scalar and flow control) I = set of instructions broadcast to all PEs for parallel execution. M = set of masking schemes used to partion PEs into enabled/disabled states. R = set of data-routing functions to enable inter-PE communication through the interconnection network.
N
Interconnection Network
Clock Rate
Clock rate - CPU is driven by a clock with a constant cycle time
Cycle
= 1 in megahertz
The clock rate is the rate in cycles per second (measured in hertz) or the frequency of the clock in any synchronous circuit, such as a central processing unit (CPU).
Throughput rate
Reservation Table
11
Specifies utilization pattern of successive stages Follows a diagonal streamline Need k clock cycles to flow through One result emerges at each cycle if tasks are independent of each other
EENG-630
Sep 2010 Q 6
Explain TLB, Paging and Segmentation in virtual memory
Virtual Memory
To facilitate the use of memory hierarchies, the memory addresses normally generated by modern processors executing application programs are not physical addresses, but are rather virtual addresses of data items and instructions. Physical addresses, of course, are used to reference the available locations in the real physical memory of a system. Virtual addresses must be mapped to physical addresses before they can be used.
Mapping Efficiency
Efficient implementations are more difficult in multiprocessor systems where additional problems such as coherence, protection, and consistency must be addressed.
In this scheme, each processor has a separate virtual address space, but all processors share the same physical address space.
All processors share a single shared virtual address space, with each processor being given a portion of it.
Memory Allocation
Both the virtual address space and the physical address space are divided into fixed-length pieces.
In
the virtual address space these pieces are called pages. In the physical address space they are called page frames.
The purpose of memory allocation is to allocate pages of virtual memory using the page frames of physical memory.
The virtual address can be used with a hash function to locate the translation map (which is stored in the cache, an associative memory, or in main memory). The translation map is comprised of a translation lookaside buffer, or TLB (usually in associative memory) and a page table (or tables). The virtual address is first sought in the TLB, and if that search succeeds, not further translation is necessary. Otherwise, the page table(s) must be referenced to obtain the translation result. If the virtual address cannot be translated to a physical address because the required page is not present in primary memory, a page fault is reported.
Memory Generalities
It is almost always the case that memories at lowernumbered levels, when compare to those at highernumbered levels
are
faster to access, are smaller in capacity, are more expensive per byte, have a higher bandwidth, and have a smaller unit of transfer.
In general, then, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi, and xi-1 < xi.
Hit Ratios
When a needed item (instruction or data) is found in the level of the memory hierarchy being examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item must be obtained from a lower level in the hierarchy). The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found when sought in level memory Mi. The miss ratio is obviously just 1-hi. We assume h0 = 0 and hn = 1.
Access Frequencies
The access frequency fi to level Mi is (1-h1) v (1-h2) v v hi.
n
f
i !1
!1
A cache miss is typically 2 to 4 times as expensive as a cache hit (assuming success at the next level). A page fault (miss) is 3 to 4 magnitudes as costly as a page hit.
Teff ! fi ti
i !1
Hardware Parallelism
Hardware parallelism is defined by machine architecture and hardware multiplicity. It can be characterized by the number of instructions that can be issued per machine cycle. If a processor issues k instructions per machine cycle, it is called a k-issue processor. Conventional processors are one-issue machines. Examples. Intel i960CA is a three-issue processor (arithmetic, memory access, branch). IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access, branch). A machine with n k-issue processors should be able to handle a maximum of nk threads simultaneously.
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is revealed in the programs flow graph. It is a function of algorithm, programming style, and compiler optimization.
Cycle 1
L1
L2
L3
L4
Cycle 2
X1
X2
Cycle 3
+ -
There is an obvious tradeoff between the time spent scheduling and synchronizing parallel grains and the speedup obtained by parallel execution. One approach to the problem is called grain packing.
= { (n,s) }, where n = node name, s = size (larger s = larger grain size). Edges = { (v,d) }, where v = variable being communicated, and d = communication delay.
Packing two (or more) nodes produces a node with a larger grain size and possibly more edges to other nodes. Packing is done to eliminate unnecessary communication delays or reduce overall scheduling overhead.
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay requirements are observed, and no two nodes are executing on the same processor at the same time. Some general scheduling goals
Schedule
all fine-grain activities in a node to the same processor to minimize communication delays. Select grain sizes for packing to achieve better schedules for a particular parallel machine.
Latency
Latency is the time required for communication between different subsystems in a computer. Memory latency, for example, is the time required by a processor to access memory. Synchronization latency is the time required for two processes to synchronize their execution. Computational granularity and communicatoin latency are closely related.
Levels of Parallelism
Jobs or programs
Subprograms, job steps or related parts of a program Procedures, subroutines, tasks, or coroutines Non-recursive loops or unfolded iterations Instructions or statements
} } }
Coarse grain
Medium grain
Fine grain
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between iterations, it can be handled by a pipeline, or by a SIMD machine. Most optimized program construct to execute on a parallel or vector machine Some loops (e.g. recursive) are difficult to handle. Loop-level parallelism is still considered fine grain computation.
Procedure-level Parallelism
Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is more difficult than with smaller grains; interprocedural dependence analysis is difficult and history-sensitive. Communication requirement less than instructionlevel SPMD (single procedure multiple data) is a special case Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain level. Job steps can overlap across different jobs. Multiprograming conducted at this level No compilers available to exploit medium- or coarse-grain parallelism at present.
Summary
Fine-grain exploited at instruction or loop levels, assisted by the compiler. Medium-grain (task or job step) requires programmer and compiler support. Coarse-grain relies heavily on effective OS support. Shared-variable communication used at fine- and mediumgrain levels. Message passing can be used for medium- and coarse-grain communication, but fine-grain really need better technique because of heavier communication requirements.
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies attributed to machine architecture, technology, and communication patterns used. Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases as memory capacity increases, limiting the amount of memory that can be used with a given tolerance for communication latency.
Degree of Parallelism
The number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time. DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions. A plot of DOP vs. time is called a parallelism profile.
t1
Time p
t2
Average Parallelism
W ! ( DOP(t )dt
t1 t2
Available Parallelism
Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle). But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).
counts, hit ratios, effect of block size, effects of set number and others
Coherence Strategies
Write-through
As soon as a data item in Mi is modified, immediate update of the corresponding data item(s) in Mi+1, Mi+2, Mn is required. This is the most aggressive (and expensive) strategy. The update of the data item in Mi+1 corresponding to a modified item in Mi is not updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed. This is the most efficient approach, but cannot be used (without modification) when multiple processors share Mi+1, , Mn.
Write-back
Cycle count: # of m/c cycles needed for cache access, update, and coherence Hit ratio: how effectively the cache can reduce the overall memory access time Program trace driven simulation: present snapshots of program behavior and cache responses Analytical modeling: provide insight into the underlying processes
Cycle Counts
58
Cache speed affected by underlying static or dynamic RAM technology, organization, and hit ratios Write-thru/write-back policies affect count Cache size, block size, set number, and associativity affect count Directly related to hit ratio
EENG-630 Chapter 5
Hit Ratio
59
Affected by cache size and block size Increases w.r.t. increasing cache size Limited cache size, initial loading, and changes in locality prevent 100% hit ratio
EENG-630 Chapter 5
With fixed cache size, block size has impact As block size increases, hit ratio improves due to spatial locality Peaks at optimum block size, then decreases If too large, many words in cache not used
EENG-630 Chapter 5
Cache performance
61
EENG-630 Chapter 5
Mar 2010 Q6
Bus Systems
A bus system is a hierarchy of buses connection various system and subsystem components. Each bus has a complement of control, signal, and power lines. There is usually a variety of buses in a system:
Local bus (usually integral to a system board) connects various major system components (chips) Memory bus used within a memory board to connect the interface, the controller, and the memory cells Data bus might be used on an I/O board or VLSI chip to connect various components Backplane like a local bus, but with connectors to which other boards can be attached
Degree n: pipeline cycle time = 1/n base cycles A fixed point addition takes one cycle in a base scalar processor, takes n short cycles in a superpipelined processor Issue rate = 1, issue latency = 1/n, ILP = n Requires high speed clocking
EENG-630
67
EENG-630
Superpipelined Performance
68
EENG-630
Superpipelined Superscalar
69
Degree (m,n): executes m instructions every cycle with pipeline cycle = 1/n of base cycle Instruction issue latency = 1/n ILP = mn instructions
EENG-630
N independent instructions, degree (m,n) T(m,n) = k + (N-m) / mn S(m,n) = [mn (k + N 1)] / [mnk + N m]
EENG-630
Design Approaches
Superpipelined: Emphasizes temporal parallelism Faster transistors Design must minimize effects of clock skewing Superscalar: Depends on spatial parallelism More transistors Better match for CMOS technology
EENG-630
71
Mar 2010 Q4
Explain buses and interfaces in multiprocessor system.
Synchronous controlled by a global clock Asynchronous use handshaking or interlock mechanisms Circuit switching a pair of communicating devices control the path for the entire duration of data transfer Packet switching large data transfers broken into smaller pieces, each of which can compete for use of the path Centralized global controller receives and acts on requests Distributed requests handled by local devices independently
Switching Method
Network Control
Digital Buses
Digital buses are the fundamental interconnects adopted in most commercial multiprocessor systems with less than 100 processors. The principal limitation to the bus approach is packaging technology. Complete bus specifications include logical, electrical and mechanical properties, application profiles, and interface requirements.
Bus Systems
A bus system is a hierarchy of buses connection various system and subsystem components. Each bus has a complement of control, signal, and power lines. There is usually a variety of buses in a system:
Local bus (usually integral to a system board) connects various major system components (chips) Memory bus used within a memory board to connect the interface, the controller, and the memory cells Data bus might be used on an I/O board or VLSI chip to connect various components Backplane like a local bus, but with connectors to which other boards can be attached
Bridges
The term bridge is used to denote a device that is used to connect two (or possibly more) buses. The interconnected buses may use the same standards, or they may be different (e.g. PCI and ISA buses in a modern PC). Bridge functions include
Communication
Most systems use private caches for each processor Have an interconnection n/w b/t caches and main memory Address caches using either a physical address or virtual address
EENG-630 Chapter 5
Cache is indexed and tagged with the physical address Cache lookup occurs after address translation in TLB or MMU (no aliasing) After cache miss, load a block from main memory Use either write-back or write-through policy
EENG-630 Chapter 5
Disadvantage:
Slowdown
cache flushing No aliasing problems Simplistic design Requires little intervention from OS kernel
EENG-630 Chapter 5
83
EENG-630 Chapter 5
Cache indexed or tagged w/virtual address Cache and MMU translation/validation performed in parallel Physical address saved in tags for write back More efficient access to cache
EENG-630 Chapter 5
EENG-630 Chapter 5
Aliasing Problem
87
Different logically addressed data have the same index/tag in the cache Confusion if two or more processors access the same physical cache location Flush cache when aliasing occurs, but leads to slowdown Apply special tagging with a process key or with a physical address
EENG-630 Chapter 5
Performance depends upon cache access patterns, organization, and management policy Blocks in caches are block frames, and blocks in main memory Bi (i e m), Bj (i e n), n<<m, n=2s, m=2r Each block has b words b=2w, for cache total of mb=2r+w words, main memory of nb= 2s+w words
EENG-630 Chapter 5
Direct mapping of n/m memory blocks to one block frame in the cache Placement is by using modulo-m function Bj p Bi if i=j mod m Unique block frame Bi that each Bj loads to Simplest organization to implement
EENG-630 Chapter 5
EENG-630 Chapter 5
Disadvantages
Rigid
hardware No associative search No page replacement policy Lower cost Higher speed
mapping Poorer hit ratio Prohibits parallel virtual address translation Use larger cache size with more block frames to avoid contention
EENG-630 Chapter 5
91
Each block in main memory can be placed in any of the available block frames s-bit tag needed in each cache block (s > r) An m-way associative search requires the tag to be compared w/ all cache block tags Use an associative memory to achieve a parallel comparison w/all tags concurrently
EENG-630 Chapter 5
EENG-630 Chapter 5
Disadvantages:
hardware cost Only moderate size cache Expensive search process
Higher
most flexibility in mapping cache blocks Higher hit ratio Allows better block replacement policy with reduced block contention
EENG-630 Chapter 5
94
In a k-way associative cahe, the m cache block frames are divided into v=m/k sets, with k blocks per set Each set is identified by a d-bit set number Compare the tag w/the k tags w/in the identified set Bj p Bf Si if j(mod v) = i
EENG-630 Chapter 5
96
EENG-630 Chapter 5
Partition cache and main memory into fixed size sectors then use fully associative search Use sector tags for search and block fields within sector to find block Only missing block loaded for a miss The ith block in a sector placed into the th block frame in a destined sector frame Attach a valid/invalid bit to block frames
EENG-630 Chapter 5
System bus operates on contention basis Only one granted access to bus at a time Effective bandwidth available is inversely proportional to # of contending processors Simple and low cost (4 16 processors)
EENG-630 Chapter 5
Interconnects processors, data storage, and I/O devices Must allow communication b/t devices Timing protocols for arbitration Operational rules for orderly data transfers Signal lines grouped into several buses
EENG-630 Chapter 5
EENG-630 Chapter 5
Composed of data, address, and control lines Address lines broadcast data and device address
Data lines proportional to memory word length Control lines specify read/write, timing, and bus error conditions
EENG-630 Chapter 5
Arbitration: assign control of DTB Requester: master Receiving end: slave Interrupt lines for prioritized interrupts Dedicated lines for synchronizing parallel activities among processor modules Utility lines provide periodic timing and coordinate the power-up/down sequences Bus controller board houses control logic
EENG-630 Chapter 5
Functional Modules
104
Arbiter: functional module that performs arbitration Bus timer: measures time for data transfers Interrupter: generates interrupt request and provides status/ID to interrupt handler Location monitor: monitors data transfer Power monitor: monitors power source System clock driver: provides clock timing signal on the utility bus Board interface logic: matches signal line impedance, prop. time, and termination values
EENG-630 Chapter 5
Physical Limitations
105
Electrical, mechanical, and packaging limitations restrict # of boards Can mount multiple backplane buses on the same backplane chassis Difficult to scale a bus system due to packaging constraints
EENG-630 Chapter 5
Aug 2009 Q2
How do you classify pipelining processor? Give examples
arithmetic logic units of a computer can be segmentized for pipeline operations in various data formats. Well-known arithmetic pipeline examples are the four-stage pipes used in Star-100 execution of a stream of instruction can be pipelined by overlapping the execution of the current instruction with the fetch, decode, and operand fetch of subsequent instruction. This technique is also known as instruction lookahead.
Instruction Pipelining :
The
Processor Pipelining :
This
refers to the pipeline processing of the same data stream by a cascade of processors, each of which processes a specific task. The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor. The second processor then passes the refined results to the third, and so on
10
Pipeline throughput (the number of tasks per unit time) : note equivalence to IPC n [ k + (n-1)] X nf k + (n-1)
Hk =
Speedup for above case is: 280/100 = 2.8 !! Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns Sequential time = 1000*280ns Throughput= 1000/1003 What is the problem here ? How to improve performance ?
Aug 2009 Q4
What do you understand by loosely and tightly coupling of processors in a multi-processor system? Explain the salient features of each type. Discuss the processor characteristics for multiprocessor system.
Characteristics of Multiprocessors
Interconnection Network
P Characteristics
...
All processors have equally direct access to one large memory address space Example systems - Bus and cache-based systems: Sequent Balance, Encore Multimax - Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP - Crossbar switch-based systems: C.mmp, Alliant FX/8 Limitations Memory access latency; Hot spot problem
Characteristics of Multiprocessors
MESSAGE-PASSING MULTIPROCESSORS
Message-Passing Network P P ... P Point-to-point connections
M Characteristics
...
- Interconnected computers - Each processor has its own memory, and communicate via message-passing Example systems - Tree structure: Teradata, DADO - Mesh-connected: Rediflow, Series 2010, J-Machine - Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III Limitations - Communication overhead; Hard to programming
Aug 2009 / Q5
Write short notes on
Flynns
Flynns Classification
Classification based on notions of instruction and data stream Architecture Categories
SISD
SIMD
MISD
MIMD
SISD
IS
IS
DS
Uniprocessors
SIMD
P
IS DS
C P
DS
MISD
IS
IS
DS
M
IS
IS
DS
i. Same instruction executed by multiple processors using different data streams ii. Each processor has its data memory (hence multiple data) iii. Theres a single instruction memory and control processor
MIMD
IS
IS
DS
M
IS
IS
DS
MIMD
Each processor fetches its own instructions and operates on its own data MIMD current winner: Concentrate on major design emphasis <= 128 processors Use off-the-shelf microprocessors: cost-performance advantages
Flexible: high performance for one application, running many tasks simultaneously Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
element analysis Computational aerodynamics Remote sensing applications Artificial intelligence and automation
assisted tomography Genetic engineering Weapon research and defense Basic research problems
intermittent or permanent hardware faults software and hardware design errors operator errors externally induced upsets or physical damage
Aug 2009 / Q6
Describe the architectural features of any two of the following:
Pentium
Pentium III
The Pentium III[1] brand refers to Intel's 32bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitectureintroduced on February 26, 1999 The most notable difference was the addition of the SSE instruction set (to accelerate floating point and parallel calculations), and the introduction of a controversial serial number embedded in the chip during the manufacturing process.
Power PC
PowerPC (short for Performance Optimization With Enhanced RISC Performance Computing, sometimes abbreviated as PPC) is a RISCarchitecture created by the 1991 AppleIBMMotorola alliance, known as AIM. Originally intended for personal computers, PowerPC CPUs have since become popular as embedded and high-performance processors. PowerPC is largely based on IBM's earlier POWER architecture, and retains a high level of compatibility with it
Aug 2009 / Q7
Explain two bus arbitration schemes for multiprocessor. What is cache coherence? Explain the static coherence check mechanism
Arbitration
137
Process of selecting next bus master Bus tenure is duration of control Arbitrate on a fairness or priority basis Arbitration competition and bus transactions take place concurrently on a parallel bus over separate lines
EENG-630 Chapter 5
Central Arbitration
138
Potential masters are daisy chained Signal line propagates bus-grant from first master to the last master Only one bus-request line The bus-grant line activates the bus-busy line
EENG-630 Chapter 5
Central Arbitration
139
EENG-630 Chapter 5
Central Arbitration
140
Simple scheme Easy to add devices Fixed-priority sequence not fair Propagation of bus-grant signal is slow Not fault tolerant
EENG-630 Chapter 5
Provide independent bus-request and grant signals for each master Require a central arbiter, but can use a priority or fairness based policy More flexible and faster than a daisy-chained policy Larger number of lines costly
EENG-630 Chapter 5
142
EENG-630 Chapter 5
Distributed Arbitration
143
Each master has its own arbiter and unique arbitration number Use arbitration # to resolve competition Send # to SBRG lines and compare own # with SBRG # Priority based scheme
EENG-630 Chapter 5
Aug 2009 / Q8
How can data hazards be overcome by dynamic scheduling?
Dynamic Scheduling
145
3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4
Chap. 3 -ILP 1
Dynamic Scheduling
146
The idea:
Chap. 3 -ILP 1
Dynamic Scheduling
147
The idea:
Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. A scoreboard is a data structure that provides the information necessary for all pieces of the processor to work together. We will use In order issue, out of order execution, out of order commit ( also called completion) First used in CDC6600. Our example modified here for MIPS. CDC had 4 FP units, 5 memory reference units, 7 integer units. MIPS has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
Chap. 3 -ILP 1
Dynamic Scheduling
148
Using A Scoreboard
Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR
Queue both the operation and copies of its operands Read registers only during Read Operands stage
For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages
Chap. 3 -ILP 1
Dynamic Scheduling
149
Using A Scoreboard
Chap. 3 -ILP 1
Dynamic Scheduling
150
Using A Scoreboard
Chap. 3 -ILP 1
Dynamic Scheduling
151
Using A Scoreboard
Dynamic Scheduling
152
Using A Scoreboard
1. Instruction
2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for each functional unit BusyIndicates whether the unit is busy or not OpOperation to perform in the unit (e.g., + or ) FiDestination register Fj, FkSource-register numbers Qj, QkFunctional units producing source registers Fj, Fk Rj, RkFlags indicating when Fj, Fk are ready 3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Chap. 3 -ILP 1
Dynamic Scheduling
153
Using A Scoreboard
Issue
Rj and Rk Functional unit done f((Fj( f )Fi(FU) or Rj( f )=No) & (Fk( f ) Fi(FU) or Rk( f )=No))
Write result
f(if Qj(f)=FU then Rj(f)n Yes); f(if Qk(f)=FU then Rj(f)n Yes); Result(Fi(FU))n 0; Busy(FU)n No
Chap. 3 -ILP 1
Aug 2009 / Q9
With the help of the block diagram, explain the crossbar switch system organization for multiprocessors Derive expression for speed-up, efficiency and throughput of a pipelined processor. What are their ideal values.
May 2009 / Q1