Você está na página 1de 155

BEST OF ALL QUESTION PAPERS

All Question papers & assignments, Test

Sep 2010 - Question 1


Write the neat functional structure of an SIMD array processor with concurrent scalar processing in the control unit. (10 marks)

SIMD Supercomputers
Operational model is a 5-tuple (N, C, I, M, R).
= number of processing elements (PEs). C = set of instructions (including scalar and flow control) I = set of instructions broadcast to all PEs for parallel execution. M = set of masking schemes used to partion PEs into enabled/disabled states. R = set of data-routing functions to enable inter-PE communication through the interconnection network.
N

Operational Model of SIMD Computer


Control Unit

Interconnection Network

Sep 2010 - Question 2


Define clock rate, CPI, MIPS rate and throughput rate. (10 marks)

Clock Rate
Clock rate - CPU is driven by a clock with a constant cycle time
Cycle

time is represented using T in nanoseconds

Inverse of cycle time is the clock rate (f=1/T)


f

= 1 in megahertz

The clock rate is the rate in cycles per second (measured in hertz) or the frequency of the clock in any synchronous circuit, such as a central processing unit (CPU).

CPI (Cycles per instructions)

MIPS(Million of Instructions per second)

Throughput rate

Sep 2010 Question 5


What are the reservation tables in pipelining? Mention its advantages.

Reservation Table
11

Specifies utilization pattern of successive stages Follows a diagonal streamline Need k clock cycles to flow through One result emerges at each cycle if tasks are independent of each other

EENG-630

Sep 2010 Q 6
Explain TLB, Paging and Segmentation in virtual memory

Virtual Memory
To facilitate the use of memory hierarchies, the memory addresses normally generated by modern processors executing application programs are not physical addresses, but are rather virtual addresses of data items and instructions. Physical addresses, of course, are used to reference the available locations in the real physical memory of a system. Virtual addresses must be mapped to physical addresses before they can be used.

Mapping Efficiency
Efficient implementations are more difficult in multiprocessor systems where additional problems such as coherence, protection, and consistency must be addressed.

Virtual Memory Models (1)


Private Virtual Memory

In this scheme, each processor has a separate virtual address space, but all processors share the same physical address space.

Virtual Memory Models (2)


Shared Virtual Memory

All processors share a single shared virtual address space, with each processor being given a portion of it.

Memory Allocation
Both the virtual address space and the physical address space are divided into fixed-length pieces.
In

the virtual address space these pieces are called pages. In the physical address space they are called page frames.

The purpose of memory allocation is to allocate pages of virtual memory using the page frames of physical memory.

Address Translation Mechanisms


[Virtual to physical] address translation requires use of a translation map.

The virtual address can be used with a hash function to locate the translation map (which is stored in the cache, an associative memory, or in main memory). The translation map is comprised of a translation lookaside buffer, or TLB (usually in associative memory) and a page table (or tables). The virtual address is first sought in the TLB, and if that search succeeds, not further translation is necessary. Otherwise, the page table(s) must be referenced to obtain the translation result. If the virtual address cannot be translated to a physical address because the required page is not present in primary memory, a page fault is reported.

Define effective access time of a memory hierarchy

Hierarchical Memory Technology


Memory in system is usually characterized as appearing at various levels (0, 1, ) in a hierarchy, with level 0 being CPU registers and level 1 being the cache closest to the CPU. Each level is characterized by five parameters:
access time ti (round-trip time from CPU to ith level) memory size si (number of bytes or words in the level) cost per byte ci transfer bandwidth bi (rate of transfer between levels) unit of transfer xi (grain size for transfers)

Memory Generalities
It is almost always the case that memories at lowernumbered levels, when compare to those at highernumbered levels
are

faster to access, are smaller in capacity, are more expensive per byte, have a higher bandwidth, and have a smaller unit of transfer.

In general, then, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi, and xi-1 < xi.

Hit Ratios
When a needed item (instruction or data) is found in the level of the memory hierarchy being examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item must be obtained from a lower level in the hierarchy). The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found when sought in level memory Mi. The miss ratio is obviously just 1-hi. We assume h0 = 0 and hn = 1.

Access Frequencies
The access frequency fi to level Mi is (1-h1) v (1-h2) v v hi.
n

Note that f1 = h1, and

f
i !1

!1

Effective Access Times


There are different penalties associated with misses at different levels in the memory hierarcy.

A cache miss is typically 2 to 4 times as expensive as a cache hit (assuming success at the next level). A page fault (miss) is 3 to 4 magnitudes as costly as a page hit.

The effective access time of a memory hierarchy can be expressed as


n

Teff ! fi ti
i !1

! h1t1  (1  h1 )h2t2    (1  h1 )(1  h2 ) (1  hn 1 )hntn


The first few terms in this expression dominate, but the effective access time is still dependent on program behavior and memory design choices.

Sep 2010 Q8 (10 marks)


With a suitable example, distinguish between hardware and software parallelism

Hardware Parallelism
Hardware parallelism is defined by machine architecture and hardware multiplicity. It can be characterized by the number of instructions that can be issued per machine cycle. If a processor issues k instructions per machine cycle, it is called a k-issue processor. Conventional processors are one-issue machines. Examples. Intel i960CA is a three-issue processor (arithmetic, memory access, branch). IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access, branch). A machine with n k-issue processors should be able to handle a maximum of nk threads simultaneously.

Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is revealed in the programs flow graph. It is a function of algorithm, programming style, and compiler optimization.

Mismatch between software and hardware parallelism - 1

Cycle 1

L1

L2

L3

L4

Maximum software parallelism (L=load, X/+/- = arithmetic).

Cycle 2

X1

X2

Cycle 3

Mismatch between software and hardware parallelism - 2


L1 Same problem, but considering the parallelism on a two-issue superscalar processor. L2 X1 L3 L4 X2 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 A B

+ -

Sep 2010 Q 9 (10 marks)


What is grain packing? With a suitable example, write a program graph before and after grain packing?

Grain Packing and Scheduling


Two questions:
can I partition a program into parallel pieces to yield the shortest execution time? What is the optimal size of parallel grains?
How

There is an obvious tradeoff between the time spent scheduling and synchronizing parallel grains and the speedup obtained by parallel execution. One approach to the problem is called grain packing.

Program Graphs and Packing


A program graph is similar to a dependence graph
Nodes

= { (n,s) }, where n = node name, s = size (larger s = larger grain size). Edges = { (v,d) }, where v = variable being communicated, and d = communication delay.

Packing two (or more) nodes produces a node with a larger grain size and possibly more edges to other nodes. Packing is done to eliminate unnecessary communication delays or reduce overall scheduling overhead.

Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay requirements are observed, and no two nodes are executing on the same processor at the same time. Some general scheduling goals
Schedule

all fine-grain activities in a node to the same processor to minimize communication delays. Select grain sizes for packing to achieve better schedules for a particular parallel machine.

Mar 2010 Q1 (10 marks)


Explain program partitioning and scheduling with example.

Program Partitioning & Scheduling


The size of the parts or pieces of a program that can be considered for parallel execution can vary. The sizes are roughly classified using the term granule size, or simply granularity. The simplest measure, for example, is the number of instructions in a program part. Grain sizes are usually described as fine, medium or coarse, depending on the level of parallelism involved.

Latency
Latency is the time required for communication between different subsystems in a computer. Memory latency, for example, is the time required by a processor to access memory. Synchronization latency is the time required for two processes to synchronize their execution. Computational granularity and communicatoin latency are closely related.

Levels of Parallelism
Jobs or programs

Increasing communication demand and scheduling overhead Higher degree of parallelism

Subprograms, job steps or related parts of a program Procedures, subroutines, tasks, or coroutines Non-recursive loops or unfolded iterations Instructions or statements

} } }

Coarse grain

Medium grain

Fine grain

Instruction Level Parallelism


This fine-grained, or smallest granularity level typically involves less than 20 instructions per grain. The number of candidates for parallel execution varies from 2 to thousands, with about five instructions or statements (on the average) being the average level of parallelism. Advantages:
There are usually many candidates for parallel execution Compilers can usually do a reasonable job of finding this parallelism

Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between iterations, it can be handled by a pipeline, or by a SIMD machine. Most optimized program construct to execute on a parallel or vector machine Some loops (e.g. recursive) are difficult to handle. Loop-level parallelism is still considered fine grain computation.

Procedure-level Parallelism
Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is more difficult than with smaller grains; interprocedural dependence analysis is difficult and history-sensitive. Communication requirement less than instructionlevel SPMD (single procedure multiple data) is a special case Multitasking belongs to this level.

Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain level. Job steps can overlap across different jobs. Multiprograming conducted at this level No compilers available to exploit medium- or coarse-grain parallelism at present.

Job or Program-Level Parallelism


Corresponds to execution of essentially independent jobs or programs on a parallel computer. This is practical for a machine with a small number of powerful processors, but impractical for a machine with a large number of simple processors (since each processor would take too long to process a single job).

Summary
Fine-grain exploited at instruction or loop levels, assisted by the compiler. Medium-grain (task or job step) requires programmer and compiler support. Coarse-grain relies heavily on effective OS support. Shared-variable communication used at fine- and mediumgrain levels. Message passing can be used for medium- and coarse-grain communication, but fine-grain really need better technique because of heavier communication requirements.

Communication Latency
Balancing granularity and latency can yield better performance. Various latencies attributed to machine architecture, technology, and communication patterns used. Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases as memory capacity increases, limiting the amount of memory that can be used with a given tolerance for communication latency.

Interprocessor Communication Latency


Needs to be minimized by system designer Affected by signal delays and communication patterns Ex. n communicating tasks may require n (n - 1)/2 communication links, and the complexity grows quadratically, effectively limiting the number of processors in the system.

Mar 2010 Q2 (10 Marks)


Write notes on the following
Parallelism

profile of programs Harmonic mean performance

Degree of Parallelism
The number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time. DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions. A plot of DOP vs. time is called a parallelism profile.

Example Parallelism Profile


DOP Average Parallelism

t1

Time p

t2

Average Parallelism
W ! ( DOP(t )dt
t1 t2

Available Parallelism
Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle). But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).

Mar 2010 Q8 (10 Marks)


Explain cache performance issues Cache performance issues
cycle

counts, hit ratios, effect of block size, effects of set number and others

Coherence Strategies
Write-through

As soon as a data item in Mi is modified, immediate update of the corresponding data item(s) in Mi+1, Mi+2, Mn is required. This is the most aggressive (and expensive) strategy. The update of the data item in Mi+1 corresponding to a modified item in Mi is not updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed. This is the most efficient approach, but cannot be used (without modification) when multiple processors share Mi+1, , Mn.

Write-back

Cycle count: # of m/c cycles needed for cache access, update, and coherence Hit ratio: how effectively the cache can reduce the overall memory access time Program trace driven simulation: present snapshots of program behavior and cache responses Analytical modeling: provide insight into the underlying processes

Cycle Counts
58

Cache speed affected by underlying static or dynamic RAM technology, organization, and hit ratios Write-thru/write-back policies affect count Cache size, block size, set number, and associativity affect count Directly related to hit ratio

EENG-630 Chapter 5

Hit Ratio
59

Affected by cache size and block size Increases w.r.t. increasing cache size Limited cache size, initial loading, and changes in locality prevent 100% hit ratio

EENG-630 Chapter 5

Effect of Block Size


60

With fixed cache size, block size has impact As block size increases, hit ratio improves due to spatial locality Peaks at optimum block size, then decreases If too large, many words in cache not used

EENG-630 Chapter 5

Cache performance
61

EENG-630 Chapter 5

Mar 2010 Q6

Bus Systems
A bus system is a hierarchy of buses connection various system and subsystem components. Each bus has a complement of control, signal, and power lines. There is usually a variety of buses in a system:

Local bus (usually integral to a system board) connects various major system components (chips) Memory bus used within a memory board to connect the interface, the controller, and the memory cells Data bus might be used on an I/O board or VLSI chip to connect various components Backplane like a local bus, but with connectors to which other boards can be attached

Hierarchical Bus Systems


There are numerous ways in which buses, processors, memories, and I/O devices can be organized. One organization has processors (and their caches) as leaf nodes in a tree, with the buses (and caches) to which these processors connect forming the interior nodes. This generic organization, with appropriate protocols to ensure cache coherency, can model most hierarchical bus organizations.

Super pipelined Design


66

Degree n: pipeline cycle time = 1/n base cycles A fixed point addition takes one cycle in a base scalar processor, takes n short cycles in a superpipelined processor Issue rate = 1, issue latency = 1/n, ILP = n Requires high speed clocking

EENG-630

67

EENG-630

Superpipelined Performance
68

N instructions, degree n, k stages T(1,n) = k + 1/n (N-1) S(1,n) = [n (k + N 1)] / [nk + N 1]

EENG-630

Superpipelined Superscalar
69

Degree (m,n): executes m instructions every cycle with pipeline cycle = 1/n of base cycle Instruction issue latency = 1/n ILP = mn instructions

EENG-630

Superscalar Superpipelined Performance


70

N independent instructions, degree (m,n) T(m,n) = k + (N-m) / mn S(m,n) = [mn (k + N 1)] / [mnk + N m]

EENG-630

Design Approaches
Superpipelined: Emphasizes temporal parallelism Faster transistors Design must minimize effects of clock skewing Superscalar: Depends on spatial parallelism More transistors Better match for CMOS technology

EENG-630

71

Mar 2010 Q4
Explain buses and interfaces in multiprocessor system.

Generalized Multiprocessor System

Generalized Multiprocessor System


Each processor Pi is attached to its own local memory and private cache. Multiple processors connected to shared memory through interprocessor memory network (IPMN). Processors share access to I/O and peripherals through processor-I/O network (PION). Both IPMN and PION are necessary in a shared-resource multiprocessor. An optional interprocessor communication network (IPCN) can permit processor communication without using shared memory.

Interconnection Network Choices


Timing

Synchronous controlled by a global clock Asynchronous use handshaking or interlock mechanisms Circuit switching a pair of communicating devices control the path for the entire duration of data transfer Packet switching large data transfers broken into smaller pieces, each of which can compete for use of the path Centralized global controller receives and acts on requests Distributed requests handled by local devices independently

Switching Method

Network Control

Digital Buses
Digital buses are the fundamental interconnects adopted in most commercial multiprocessor systems with less than 100 processors. The principal limitation to the bus approach is packaging technology. Complete bus specifications include logical, electrical and mechanical properties, application profiles, and interface requirements.

Bus Systems
A bus system is a hierarchy of buses connection various system and subsystem components. Each bus has a complement of control, signal, and power lines. There is usually a variety of buses in a system:

Local bus (usually integral to a system board) connects various major system components (chips) Memory bus used within a memory board to connect the interface, the controller, and the memory cells Data bus might be used on an I/O board or VLSI chip to connect various components Backplane like a local bus, but with connectors to which other boards can be attached

Hierarchical Bus Systems


There are numerous ways in which buses, processors, memories, and I/O devices can be organized. One organization has processors (and their caches) as leaf nodes in a tree, with the buses (and caches) to which these processors connect forming the interior nodes. This generic organization, with appropriate protocols to ensure cache coherency, can model most hierarchical bus organizations.

Bridges
The term bridge is used to denote a device that is used to connect two (or possibly more) buses. The interconnected buses may use the same standards, or they may be different (e.g. PCI and ISA buses in a modern PC). Bridge functions include
Communication

protocol conversion Interrupt handling Serving as cache and memory agents

Cache addressing model

Cache Addressing Models


81

Most systems use private caches for each processor Have an interconnection n/w b/t caches and main memory Address caches using either a physical address or virtual address

EENG-630 Chapter 5

Physical Address Caches


82

Cache is indexed and tagged with the physical address Cache lookup occurs after address translation in TLB or MMU (no aliasing) After cache miss, load a block from main memory Use either write-back or write-through policy

EENG-630 Chapter 5

Physical Address Caches


Advantages:
No

Disadvantage:
Slowdown

cache flushing No aliasing problems Simplistic design Requires little intervention from OS kernel

in accessing cache until the MMU/TLB finishes translation

EENG-630 Chapter 5

83

Physical Address Models


84

EENG-630 Chapter 5

Virtual Address Caches


85

Cache indexed or tagged w/virtual address Cache and MMU translation/validation performed in parallel Physical address saved in tags for write back More efficient access to cache

EENG-630 Chapter 5

Virtual Address Model


86

EENG-630 Chapter 5

Aliasing Problem
87

Different logically addressed data have the same index/tag in the cache Confusion if two or more processors access the same physical cache location Flush cache when aliasing occurs, but leads to slowdown Apply special tagging with a process key or with a physical address

EENG-630 Chapter 5

Block Placement Schemes


88

Performance depends upon cache access patterns, organization, and management policy Blocks in caches are block frames, and blocks in main memory Bi (i e m), Bj (i e n), n<<m, n=2s, m=2r Each block has b words b=2w, for cache total of mb=2r+w words, main memory of nb= 2s+w words

EENG-630 Chapter 5

Direct Mapping Cache


89

Direct mapping of n/m memory blocks to one block frame in the cache Placement is by using modulo-m function Bj p Bi if i=j mod m Unique block frame Bi that each Bj loads to Simplest organization to implement

EENG-630 Chapter 5

Direct Mapping Cache


90

EENG-630 Chapter 5

Direct Mapping Cache


Advantages
Simple

Disadvantages
Rigid

hardware No associative search No page replacement policy Lower cost Higher speed

mapping Poorer hit ratio Prohibits parallel virtual address translation Use larger cache size with more block frames to avoid contention

EENG-630 Chapter 5

91

Fully Associative Cache


92

Each block in main memory can be placed in any of the available block frames s-bit tag needed in each cache block (s > r) An m-way associative search requires the tag to be compared w/ all cache block tags Use an associative memory to achieve a parallel comparison w/all tags concurrently

EENG-630 Chapter 5

Fully Associative Cache


93

EENG-630 Chapter 5

Fully Associative Caches


Advantages:
Offers

Disadvantages:
hardware cost Only moderate size cache Expensive search process
Higher

most flexibility in mapping cache blocks Higher hit ratio Allows better block replacement policy with reduced block contention

EENG-630 Chapter 5

94

Set Associative Caches


95

In a k-way associative cahe, the m cache block frames are divided into v=m/k sets, with k blocks per set Each set is identified by a d-bit set number Compare the tag w/the k tags w/in the identified set Bj p Bf Si if j(mod v) = i

EENG-630 Chapter 5

96

EENG-630 Chapter 5

Sector Mapping Cache


97

Partition cache and main memory into fixed size sectors then use fully associative search Use sector tags for search and block fields within sector to find block Only missing block loaded for a miss The ith block in a sector placed into the th block frame in a destined sector frame Attach a valid/invalid bit to block frames

EENG-630 Chapter 5

Backplane bus system

Backplane Bus Systems


99

System bus operates on contention basis Only one granted access to bus at a time Effective bandwidth available is inversely proportional to # of contending processors Simple and low cost (4 16 processors)

EENG-630 Chapter 5

Backplane Bus Specification


100

Interconnects processors, data storage, and I/O devices Must allow communication b/t devices Timing protocols for arbitration Operational rules for orderly data transfers Signal lines grouped into several buses

EENG-630 Chapter 5

Backplane Multiprocessor System


101

EENG-630 Chapter 5

Data Transfer Bus


102

Composed of data, address, and control lines Address lines broadcast data and device address

Proportional to log of address space size

Data lines proportional to memory word length Control lines specify read/write, timing, and bus error conditions

EENG-630 Chapter 5

Bus Arbitration and Control


103

Arbitration: assign control of DTB Requester: master Receiving end: slave Interrupt lines for prioritized interrupts Dedicated lines for synchronizing parallel activities among processor modules Utility lines provide periodic timing and coordinate the power-up/down sequences Bus controller board houses control logic

EENG-630 Chapter 5

Functional Modules
104

Arbiter: functional module that performs arbitration Bus timer: measures time for data transfers Interrupter: generates interrupt request and provides status/ID to interrupt handler Location monitor: monitors data transfer Power monitor: monitors power source System clock driver: provides clock timing signal on the utility bus Board interface logic: matches signal line impedance, prop. time, and termination values
EENG-630 Chapter 5

Physical Limitations
105

Electrical, mechanical, and packaging limitations restrict # of boards Can mount multiple backplane buses on the same backplane chassis Difficult to scale a bus system due to packaging constraints

EENG-630 Chapter 5

Aug 2009 Q2
How do you classify pipelining processor? Give examples

Classification of Pipeline Processors


Arithmetic Pipeline
The

arithmetic logic units of a computer can be segmentized for pipeline operations in various data formats. Well-known arithmetic pipeline examples are the four-stage pipes used in Star-100 execution of a stream of instruction can be pipelined by overlapping the execution of the current instruction with the fetch, decode, and operand fetch of subsequent instruction. This technique is also known as instruction lookahead.

Instruction Pipelining :
The

Processor Pipelining :
This

refers to the pipeline processing of the same data stream by a cascade of processors, each of which processes a specific task. The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor. The second processor then passes the refined results to the third, and so on

Next question part


Prove that the linear pipeline speedup is K times faster than the non-pipelined processor

Speedup and Efficiency


k-stage pipeline processes n tasks in k + (n-1) clock cycles: k cycles for the first task and n-1 cycles for the remaining n-1 tasks Total time to process n tasks Tk = [ k + (n-1)] X For the non-pipelined processor T1 = n k X Speedup factor
Sk = T1 Tk nkX = [ k + (n-1)] X = nk k + (n-1)

10

Efficiency and Throughput

Efficiency of the k-stages pipeline : Ek = Sk k n = k + (n-1)

Pipeline throughput (the number of tasks per unit time) : note equivalence to IPC n [ k + (n-1)] X nf k + (n-1)

Hk =

Pipeline Performance: Example


Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns (nanoseconds) latch delay = 10 Pipeline cycle time = 90+10 = 100 ns For non-pipelined execution

time = 60+50+90+80 = 280 ns

Speedup for above case is: 280/100 = 2.8 !! Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns Sequential time = 1000*280ns Throughput= 1000/1003 What is the problem here ? How to improve performance ?

Aug 2009 Q4
What do you understand by loosely and tightly coupling of processors in a multi-processor system? Explain the salient features of each type. Discuss the processor characteristics for multiprocessor system.

Characteristics of Multiprocessors

SHARED MEMORY MULTIPROCESSORS


M M ... M Buses, Multistage IN, Crossbar Switch P

Interconnection Network

P Characteristics

...

All processors have equally direct access to one large memory address space Example systems - Bus and cache-based systems: Sequent Balance, Encore Multimax - Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP - Crossbar switch-based systems: C.mmp, Alliant FX/8 Limitations Memory access latency; Hot spot problem

Characteristics of Multiprocessors

MESSAGE-PASSING MULTIPROCESSORS
Message-Passing Network P P ... P Point-to-point connections

M Characteristics

...

- Interconnected computers - Each processor has its own memory, and communicate via message-passing Example systems - Tree structure: Teradata, DADO - Mesh-connected: Rediflow, Series 2010, J-Machine - Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III Limitations - Communication overhead; Hard to programming

Aug 2009 / Q5
Write short notes on
Flynns

architectural classification Applications of parallel processing Fault tolerant computing

Taxonomy of Parallel Architectures


Flynn(1996) based on Instruction and Data SISD Single Instruction Single Data Stream
Conventional uniprocessor system Still a lot of intra-CPU parallelism options

SIMD Single Instruction Multiple DataStream


vector and array style computers First accepted multiple PE style systems Now has fallen behind MIMD option

MISD Multiple Instruction Single Data Stream


no commercial products

MIMD Multiple Instructions Multiple Data Stream - Intrinsic parallel computers


- Lots of options - todays winner

Legends in Flynns Classification


CU: Control Unit PU: Processor Unit MM: Memory Module SM: Shared Memory IS: Instruction Stream DS: Data Stream

Flynns Classification
Classification based on notions of instruction and data stream Architecture Categories

SISD

SIMD

MISD

MIMD

SISD
IS

IS

DS

Uniprocessors

SIMD
P
IS DS

C P
DS

Processors that execute same instruction on multiple pieces of data

MISD
IS

IS

DS

M
IS

IS

DS

i. Same instruction executed by multiple processors using different data streams ii. Each processor has its data memory (hence multiple data) iii. Theres a single instruction memory and control processor

MIMD
IS

IS

DS

M
IS

IS

DS

MIMD

Each processor fetches its own instructions and operates on its own data MIMD current winner: Concentrate on major design emphasis <= 128 processors Use off-the-shelf microprocessors: cost-performance advantages

Flexible: high performance for one application, running many tasks simultaneously Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

Applications of parallel processing - 1


Predictive Modeling and Simulations:
Numerical

weather forecasting. Oceanography and astrophysics. Socioeconomics and government use.

Engineering Design and Automation:


Finite

element analysis Computational aerodynamics Remote sensing applications Artificial intelligence and automation

Applications of parallel processing - 2


Energy Resources Exploration
Seismic

exploration Reservoir modeling Plasma fusion power Nuclear reactor safety

Medical, Military, and Basic Research


Computer

assisted tomography Genetic engineering Weapon research and defense Basic research problems

Fault Tolerant Computing 1/3


Fault-tolerant computing is the art and science of building computing systems that continue to operate satisfactorily in the presence of faults. A fault-tolerant system may be able to tolerate one or more fault-types including
transient,

intermittent or permanent hardware faults software and hardware design errors operator errors externally induced upsets or physical damage

Fault Tolerant Computing Basic Concept 2/3


Hardware Fault-Tolerance- Each module is backed up with protective redundancy Fault masking - A number of identical modules execute the same functions, and their outputs are voted to remove errors created by a faulty module Dynamic recovery - involves automated self-repair Software Fault-Tolerance - Efforts to attain software that can tolerate software design faults (programming errors) have made use of static and dynamic redundancy approaches similar to those used for hardware faults

Aug 2009 / Q6
Describe the architectural features of any two of the following:
Pentium

III SPARC Architecture Power PC

Pentium III
The Pentium III[1] brand refers to Intel's 32bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitectureintroduced on February 26, 1999 The most notable difference was the addition of the SSE instruction set (to accelerate floating point and parallel calculations), and the introduction of a controversial serial number embedded in the chip during the manufacturing process.

RISC Scalar Processors


Designed to issue one instruction per cycle RISC and CISC scalar processors should have same performance if clock rate and program lengths are equal. RISC moves less frequent operations into software, thus dedicating hardware resources to the most frequently used operations. Representative systems:

Sun SPARC Intel i860 Motorola M88100 AMD 29000

Power PC
PowerPC (short for Performance Optimization With Enhanced RISC Performance Computing, sometimes abbreviated as PPC) is a RISCarchitecture created by the 1991 AppleIBMMotorola alliance, known as AIM. Originally intended for personal computers, PowerPC CPUs have since become popular as embedded and high-performance processors. PowerPC is largely based on IBM's earlier POWER architecture, and retains a high level of compatibility with it

Aug 2009 / Q7
Explain two bus arbitration schemes for multiprocessor. What is cache coherence? Explain the static coherence check mechanism

Arbitration
137

Process of selecting next bus master Bus tenure is duration of control Arbitrate on a fairness or priority basis Arbitration competition and bus transactions take place concurrently on a parallel bus over separate lines

EENG-630 Chapter 5

Central Arbitration
138

Potential masters are daisy chained Signal line propagates bus-grant from first master to the last master Only one bus-request line The bus-grant line activates the bus-busy line

EENG-630 Chapter 5

Central Arbitration
139

EENG-630 Chapter 5

Central Arbitration
140

Simple scheme Easy to add devices Fixed-priority sequence not fair Propagation of bus-grant signal is slow Not fault tolerant

EENG-630 Chapter 5

Independent Requests and Grants


141

Provide independent bus-request and grant signals for each master Require a central arbiter, but can use a priority or fairness based policy More flexible and faster than a daisy-chained policy Larger number of lines costly

EENG-630 Chapter 5

142

EENG-630 Chapter 5

Distributed Arbitration
143

Each master has its own arbiter and unique arbitration number Use arbitration # to resolve competition Send # to SBRG lines and compare own # with SBRG # Priority based scheme

EENG-630 Chapter 5

Aug 2009 / Q8
How can data hazards be overcome by dynamic scheduling?

Dynamic Scheduling
145

3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4

Advantages of Dynamic Scheduling


Handles cases when dependences unknown at compile time (e.g., because they may involve a memory reference) It simplifies the compiler Allows code that compiled for one pipeline to run efficiently on a different pipeline Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling

Chap. 3 -ILP 1

Dynamic Scheduling
146

The idea:

HW Schemes: Instruction Parallelism


Why is this in Hardware at run time? Works when cant know real dependence at compile time Compiler simpler Code for one machine runs well on another Key Idea: Allow instructions behind stall to proceed. Key Idea: Instructions executing in parallel. There are multiple execution units, so use them. Even though ADDD stalls, the F0,F2,F4 DIVD SUBD has no dependencies ADDD F10,F0,F8 and can run. SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion

Chap. 3 -ILP 1

Dynamic Scheduling
147

The idea:

HW Schemes: Instruction Parallelism


Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards 2. Read operandswait until no data hazards, then read operands

Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. A scoreboard is a data structure that provides the information necessary for all pieces of the processor to work together. We will use In order issue, out of order execution, out of order commit ( also called completion) First used in CDC6600. Our example modified here for MIPS. CDC had 4 FP units, 5 memory reference units, 7 integer units. MIPS has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.

Chap. 3 -ILP 1

Dynamic Scheduling
148

Using A Scoreboard

Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR

Queue both the operation and copies of its operands Read registers only during Read Operands stage

For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages

Chap. 3 -ILP 1

Dynamic Scheduling
149

Using A Scoreboard

Four Stages of Scoreboard Control


1. Issue decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

Chap. 3 -ILP 1

Dynamic Scheduling
150

Using A Scoreboard

Four Stages of Scoreboard Control


2. Read operands wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

Chap. 3 -ILP 1

Dynamic Scheduling
151

Using A Scoreboard

Four Stages of Scoreboard Control


3. Execution operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4. Write result finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 Scoreboard would stall SUBD until ADDD reads operands
Chap. 3 -ILP 1

Dynamic Scheduling
152

Using A Scoreboard

Three Parts of the Scoreboard


statuswhich of 4 steps the instruction is in

1. Instruction

2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for each functional unit BusyIndicates whether the unit is busy or not OpOperation to perform in the unit (e.g., + or ) FiDestination register Fj, FkSource-register numbers Qj, QkFunctional units producing source registers Fj, Fk Rj, RkFlags indicating when Fj, Fk are ready 3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Chap. 3 -ILP 1

Dynamic Scheduling
153

Using A Scoreboard

Detailed Scoreboard Pipeline Control


Instruction status Wait until Bookkeeping Busy(FU)n yes; Op(FU)n op; Fi(FU)n `D; Fj(FU)n `S1; Fk(FU)n `S2; Qjn Result(S1); Qkn Result(`S2); Rjn not Qj; Rkn not Qk; Result(D)n FU; Rjn No; Rkn No

Issue

Not busy (FU) and not result(D)

Read operands Execution complete

Rj and Rk Functional unit done f((Fj( f )Fi(FU) or Rj( f )=No) & (Fk( f ) Fi(FU) or Rk( f )=No))

Write result

f(if Qj(f)=FU then Rj(f)n Yes); f(if Qk(f)=FU then Rj(f)n Yes); Result(Fi(FU))n 0; Busy(FU)n No
Chap. 3 -ILP 1

Aug 2009 / Q9
With the help of the block diagram, explain the crossbar switch system organization for multiprocessors Derive expression for speed-up, efficiency and throughput of a pipelined processor. What are their ideal values.

May 2009 / Q1

Você também pode gostar