ACA Answer Key Best of All v2

BEST OF ALL QUESTION PAPERS
All Question papers & assignments, Test
Sep 2010 - Question 1

Write the neat functional structure of an SIMD array processor with concurrent scalar processing in the control unit. (10 marks)
SIMD Supercomputers
Operational model is a 5-tuple (N, C, I, M, R).
= number of processing elements (PEs). C = set of instructions (including scalar and flow control) I = set of instructions broadcast to all PEs for parallel execution. M = set of masking schemes used to partion PEs into enabled/disabled states. R = set of data-routing functions to enable inter-PE communication through the interconnection network.
N
Operational Model of SIMD Computer

Control Unit
Interconnection Network
Sep 2010 - Question 2

Define clock rate, CPI, MIPS rate and throughput rate. (10 marks)
Clock Rate
Clock rate - CPU is driven by a clock with a constant cycle time
Cycle
time is represented using T in nanoseconds
Inverse of cycle time is the clock rate (f=1/T)

f
= 1 in megahertz
The clock rate is the rate in cycles per second (measured in hertz) or the frequency of the clock in any synchronous circuit, such as a central processing unit (CPU).
CPI (Cycles per instructions)
MIPS(Million of Instructions per second)
Throughput rate
Sep 2010 Question 5

What are the reservation tables in pipelining? Mention its advantages.
Reservation Table
11
Specifies utilization pattern of successive stages Follows a diagonal streamline Need k clock cycles to flow through One result emerges at each cycle if tasks are independent of each other
EENG-630
Sep 2010 Q 6
Explain TLB, Paging and Segmentation in virtual memory
Virtual Memory
To facilitate the use of memory hierarchies, the memory addresses normally generated by modern processors executing application programs are not physical addresses, but are rather virtual addresses of data items and instructions. Physical addresses, of course, are used to reference the available locations in the real physical memory of a system. Virtual addresses must be mapped to physical addresses before they can be used.
Mapping Efficiency
Efficient implementations are more difficult in multiprocessor systems where additional problems such as coherence, protection, and consistency must be addressed.
Virtual Memory Models (1)

Private Virtual Memory
In this scheme, each processor has a separate virtual address space, but all processors share the same physical address space.
Virtual Memory Models (2)

Shared Virtual Memory
All processors share a single shared virtual address space, with each processor being given a portion of it.
Memory Allocation
Both the virtual address space and the physical address space are divided into fixed-length pieces.
In
the virtual address space these pieces are called pages. In the physical address space they are called page frames.
The purpose of memory allocation is to allocate pages of virtual memory using the page frames of physical memory.
Address Translation Mechanisms

[Virtual to physical] address translation requires use of a translation map.
The virtual address can be used with a hash function to locate the translation map (which is stored in the cache, an associative memory, or in main memory). The translation map is comprised of a translation lookaside buffer, or TLB (usually in associative memory) and a page table (or tables). The virtual address is first sought in the TLB, and if that search succeeds, not further translation is necessary. Otherwise, the page table(s) must be referenced to obtain the translation result. If the virtual address cannot be translated to a physical address because the required page is not present in primary memory, a page fault is reported.
Define effective access time of a memory hierarchy
Hierarchical Memory Technology

Memory in system is usually characterized as appearing at various levels (0, 1, ) in a hierarchy, with level 0 being CPU registers and level 1 being the cache closest to the CPU. Each level is characterized by five parameters:
access time ti (round-trip time from CPU to ith level) memory size si (number of bytes or words in the level) cost per byte ci transfer bandwidth bi (rate of transfer between levels) unit of transfer xi (grain size for transfers)
Memory Generalities
It is almost always the case that memories at lowernumbered levels, when compare to those at highernumbered levels
are
faster to access, are smaller in capacity, are more expensive per byte, have a higher bandwidth, and have a smaller unit of transfer.
In general, then, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi, and xi-1 < xi.
Hit Ratios
When a needed item (instruction or data) is found in the level of the memory hierarchy being examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item must be obtained from a lower level in the hierarchy). The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found when sought in level memory Mi. The miss ratio is obviously just 1-hi. We assume h0 = 0 and hn = 1.
Access Frequencies
The access frequency fi to level Mi is (1-h1) v (1-h2) v v hi.
n
Note that f1 = h1, and
f
i !1
!1
Effective Access Times

There are different penalties associated with misses at different levels in the memory hierarcy.

A cache miss is typically 2 to 4 times as expensive as a cache hit (assuming success at the next level). A page fault (miss) is 3 to 4 magnitudes as costly as a page hit.
The effective access time of a memory hierarchy can be expressed as

n
Teff ! fi ti
i !1
! h1t1 (1 h1 )h2t2 (1 h1 )(1 h2 ) (1 hn 1 )hntn

The first few terms in this expression dominate, but the effective access time is still dependent on program behavior and memory design choices.
Sep 2010 Q8 (10 marks)

With a suitable example, distinguish between hardware and software parallelism
Hardware Parallelism
Hardware parallelism is defined by machine architecture and hardware multiplicity. It can be characterized by the number of instructions that can be issued per machine cycle. If a processor issues k instructions per machine cycle, it is called a k-issue processor. Conventional processors are one-issue machines. Examples. Intel i960CA is a three-issue processor (arithmetic, memory access, branch). IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access, branch). A machine with n k-issue processors should be able to handle a maximum of nk threads simultaneously.
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is revealed in the programs flow graph. It is a function of algorithm, programming style, and compiler optimization.
Mismatch between software and hardware parallelism - 1
Cycle 1
L1
L2
L3
L4
Maximum software parallelism (L=load, X/+/- = arithmetic).
Cycle 2
X1
X2
Cycle 3
Mismatch between software and hardware parallelism - 2

L1 Same problem, but considering the parallelism on a two-issue superscalar processor. L2 X1 L3 L4 X2 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 A B
+ -
Sep 2010 Q 9 (10 marks)

What is grain packing? With a suitable example, write a program graph before and after grain packing?
Grain Packing and Scheduling

Two questions:
can I partition a program into parallel pieces to yield the shortest execution time? What is the optimal size of parallel grains?
How
There is an obvious tradeoff between the time spent scheduling and synchronizing parallel grains and the speedup obtained by parallel execution. One approach to the problem is called grain packing.
Program Graphs and Packing

A program graph is similar to a dependence graph
Nodes
= { (n,s) }, where n = node name, s = size (larger s = larger grain size). Edges = { (v,d) }, where v = variable being communicated, and d = communication delay.
Packing two (or more) nodes produces a node with a larger grain size and possibly more edges to other nodes. Packing is done to eliminate unnecessary communication delays or reduce overall scheduling overhead.
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay requirements are observed, and no two nodes are executing on the same processor at the same time. Some general scheduling goals
Schedule
all fine-grain activities in a node to the same processor to minimize communication delays. Select grain sizes for packing to achieve better schedules for a particular parallel machine.
Mar 2010 Q1 (10 marks)

Explain program partitioning and scheduling with example.
Program Partitioning & Scheduling

The size of the parts or pieces of a program that can be considered for parallel execution can vary. The sizes are roughly classified using the term granule size, or simply granularity. The simplest measure, for example, is the number of instructions in a program part. Grain sizes are usually described as fine, medium or coarse, depending on the level of parallelism involved.
Latency
Latency is the time required for communication between different subsystems in a computer. Memory latency, for example, is the time required by a processor to access memory. Synchronization latency is the time required for two processes to synchronize their execution. Computational granularity and communicatoin latency are closely related.
Levels of Parallelism
Jobs or programs
Increasing communication demand and scheduling overhead Higher degree of parallelism
Subprograms, job steps or related parts of a program Procedures, subroutines, tasks, or coroutines Non-recursive loops or unfolded iterations Instructions or statements
} } }
Coarse grain
Medium grain
Fine grain
Instruction Level Parallelism

This fine-grained, or smallest granularity level typically involves less than 20 instructions per grain. The number of candidates for parallel execution varies from 2 to thousands, with about five instructions or statements (on the average) being the average level of parallelism. Advantages:
There are usually many candidates for parallel execution Compilers can usually do a reasonable job of finding this parallelism
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between iterations, it can be handled by a pipeline, or by a SIMD machine. Most optimized program construct to execute on a parallel or vector machine Some loops (e.g. recursive) are difficult to handle. Loop-level parallelism is still considered fine grain computation.
Procedure-level Parallelism
Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is more difficult than with smaller grains; interprocedural dependence analysis is difficult and history-sensitive. Communication requirement less than instructionlevel SPMD (single procedure multiple data) is a special case Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain level. Job steps can overlap across different jobs. Multiprograming conducted at this level No compilers available to exploit medium- or coarse-grain parallelism at present.
Job or Program-Level Parallelism

Corresponds to execution of essentially independent jobs or programs on a parallel computer. This is practical for a machine with a small number of powerful processors, but impractical for a machine with a large number of simple processors (since each processor would take too long to process a single job).
Summary
Fine-grain exploited at instruction or loop levels, assisted by the compiler. Medium-grain (task or job step) requires programmer and compiler support. Coarse-grain relies heavily on effective OS support. Shared-variable communication used at fine- and mediumgrain levels. Message passing can be used for medium- and coarse-grain communication, but fine-grain really need better technique because of heavier communication requirements.
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies attributed to machine architecture, technology, and communication patterns used. Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases as memory capacity increases, limiting the amount of memory that can be used with a given tolerance for communication latency.
Interprocessor Communication Latency

Needs to be minimized by system designer Affected by signal delays and communication patterns Ex. n communicating tasks may require n (n - 1)/2 communication links, and the complexity grows quadratically, effectively limiting the number of processors in the system.
Mar 2010 Q2 (10 Marks)

Write notes on the following
Parallelism
profile of programs Harmonic mean performance
Degree of Parallelism
The number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time. DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions. A plot of DOP vs. time is called a parallelism profile.
Example Parallelism Profile

DOP Average Parallelism
t1
Time p
t2
Average Parallelism
W ! ( DOP(t )dt
t1 t2
Available Parallelism
Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle). But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).
Mar 2010 Q8 (10 Marks)

Explain cache performance issues Cache performance issues
cycle
counts, hit ratios, effect of block size, effects of set number and others
Coherence Strategies
Write-through
As soon as a data item in Mi is modified, immediate update of the corresponding data item(s) in Mi+1, Mi+2, Mn is required. This is the most aggressive (and expensive) strategy. The update of the data item in Mi+1 corresponding to a modified item in Mi is not updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed. This is the most efficient approach, but cannot be used (without modification) when multiple processors share Mi+1, , Mn.
Write-back
Cycle count: # of m/c cycles needed for cache access, update, and coherence Hit ratio: how effectively the cache can reduce the overall memory access time Program trace driven simulation: present snapshots of program behavior and cache responses Analytical modeling: provide insight into the underlying processes
Cycle Counts
58
Cache speed affected by underlying static or dynamic RAM technology, organization, and hit ratios Write-thru/write-back policies affect count Cache size, block size, set number, and associativity affect count Directly related to hit ratio
EENG-630 Chapter 5
Hit Ratio
59
Affected by cache size and block size Increases w.r.t. increasing cache size Limited cache size, initial loading, and changes in locality prevent 100% hit ratio
EENG-630 Chapter 5
Effect of Block Size

60
With fixed cache size, block size has impact As block size increases, hit ratio improves due to spatial locality Peaks at optimum block size, then decreases If too large, many words in cache not used
EENG-630 Chapter 5
Cache performance
61
EENG-630 Chapter 5
Mar 2010 Q6
Bus Systems
A bus system is a hierarchy of buses connection various system and subsystem components. Each bus has a complement of control, signal, and power lines. There is usually a variety of buses in a system:

Local bus (usually integral to a system board) connects various major system components (chips) Memory bus used within a memory board to connect the interface, the controller, and the memory cells Data bus might be used on an I/O board or VLSI chip to connect various components Backplane like a local bus, but with connectors to which other boards can be attached
Hierarchical Bus Systems

There are numerous ways in which buses, processors, memories, and I/O devices can be organized. One organization has processors (and their caches) as leaf nodes in a tree, with the buses (and caches) to which these processors connect forming the interior nodes. This generic organization, with appropriate protocols to ensure cache coherency, can model most hierarchical bus organizations.
Super pipelined Design

66
Degree n: pipeline cycle time = 1/n base cycles A fixed point addition takes one cycle in a base scalar processor, takes n short cycles in a superpipelined processor Issue rate = 1, issue latency = 1/n, ILP = n Requires high speed clocking
EENG-630
67
EENG-630
Superpipelined Performance
68
N instructions, degree n, k stages T(1,n) = k + 1/n (N-1) S(1,n) = [n (k + N 1)] / [nk + N 1]
EENG-630
Superpipelined Superscalar
69
Degree (m,n): executes m instructions every cycle with pipeline cycle = 1/n of base cycle Instruction issue latency = 1/n ILP = mn instructions
EENG-630
Superscalar Superpipelined Performance

70
N independent instructions, degree (m,n) T(m,n) = k + (N-m) / mn S(m,n) = [mn (k + N 1)] / [mnk + N m]
EENG-630
Design Approaches
Superpipelined: Emphasizes temporal parallelism Faster transistors Design must minimize effects of clock skewing Superscalar: Depends on spatial parallelism More transistors Better match for CMOS technology
EENG-630
71
Mar 2010 Q4
Explain buses and interfaces in multiprocessor system.
Generalized Multiprocessor System
Generalized Multiprocessor System

Each processor Pi is attached to its own local memory and private cache. Multiple processors connected to shared memory through interprocessor memory network (IPMN). Processors share access to I/O and peripherals through processor-I/O network (PION). Both IPMN and PION are necessary in a shared-resource multiprocessor. An optional interprocessor communication network (IPCN) can permit processor communication without using shared memory.
Interconnection Network Choices

Timing

Synchronous controlled by a global clock Asynchronous use handshaking or interlock mechanisms Circuit switching a pair of communicating devices control the path for the entire duration of data transfer Packet switching large data transfers broken into smaller pieces, each of which can compete for use of the path Centralized global controller receives and acts on requests Distributed requests handled by local devices independently
Switching Method
Network Control

Digital Buses
Digital buses are the fundamental interconnects adopted in most commercial multiprocessor systems with less than 100 processors. The principal limitation to the bus approach is packaging technology. Complete bus specifications include logical, electrical and mechanical properties, application profiles, and interface requirements.
Bus Systems
A bus system is a hierarchy of buses connection various system and subsystem components. Each bus has a complement of control, signal, and power lines. There is usually a variety of buses in a system:

Local bus (usually integral to a system board) connects various major system components (chips) Memory bus used within a memory board to connect the interface, the controller, and the memory cells Data bus might be used on an I/O board or VLSI chip to connect various components Backplane like a local bus, but with connectors to which other boards can be attached
Hierarchical Bus Systems

There are numerous ways in which buses, processors, memories, and I/O devices can be organized. One organization has processors (and their caches) as leaf nodes in a tree, with the buses (and caches) to which these processors connect forming the interior nodes. This generic organization, with appropriate protocols to ensure cache coherency, can model most hierarchical bus organizations.
Bridges
The term bridge is used to denote a device that is used to connect two (or possibly more) buses. The interconnected buses may use the same standards, or they may be different (e.g. PCI and ISA buses in a modern PC). Bridge functions include
Communication
protocol conversion Interrupt handling Serving as cache and memory agents
Cache addressing model
Cache Addressing Models

81
Most systems use private caches for each processor Have an interconnection n/w b/t caches and main memory Address caches using either a physical address or virtual address
EENG-630 Chapter 5
Physical Address Caches

82
Cache is indexed and tagged with the physical address Cache lookup occurs after address translation in TLB or MMU (no aliasing) After cache miss, load a block from main memory Use either write-back or write-through policy
EENG-630 Chapter 5
Physical Address Caches

Advantages:
No
Disadvantage:
Slowdown
cache flushing No aliasing problems Simplistic design Requires little intervention from OS kernel
in accessing cache until the MMU/TLB finishes translation
EENG-630 Chapter 5
83
Physical Address Models

84
EENG-630 Chapter 5
Virtual Address Caches

85
Cache indexed or tagged w/virtual address Cache and MMU translation/validation performed in parallel Physical address saved in tags for write back More efficient access to cache
EENG-630 Chapter 5
Virtual Address Model

86
EENG-630 Chapter 5
Aliasing Problem
87
Different logically addressed data have the same index/tag in the cache Confusion if two or more processors access the same physical cache location Flush cache when aliasing occurs, but leads to slowdown Apply special tagging with a process key or with a physical address
EENG-630 Chapter 5
Block Placement Schemes

88
Performance depends upon cache access patterns, organization, and management policy Blocks in caches are block frames, and blocks in main memory Bi (i e m), Bj (i e n), n<<m, n=2s, m=2r Each block has b words b=2w, for cache total of mb=2r+w words, main memory of nb= 2s+w words
EENG-630 Chapter 5
Direct Mapping Cache

89
Direct mapping of n/m memory blocks to one block frame in the cache Placement is by using modulo-m function Bj p Bi if i=j mod m Unique block frame Bi that each Bj loads to Simplest organization to implement
EENG-630 Chapter 5

90
EENG-630 Chapter 5

Advantages
Simple
Disadvantages
Rigid
hardware No associative search No page replacement policy Lower cost Higher speed
mapping Poorer hit ratio Prohibits parallel virtual address translation Use larger cache size with more block frames to avoid contention
EENG-630 Chapter 5
91
Fully Associative Cache

92
Each block in main memory can be placed in any of the available block frames s-bit tag needed in each cache block (s > r) An m-way associative search requires the tag to be compared w/ all cache block tags Use an associative memory to achieve a parallel comparison w/all tags concurrently
EENG-630 Chapter 5
Fully Associative Cache

93
EENG-630 Chapter 5
Fully Associative Caches

Advantages:
Offers
Disadvantages:
hardware cost Only moderate size cache Expensive search process
Higher
most flexibility in mapping cache blocks Higher hit ratio Allows better block replacement policy with reduced block contention
EENG-630 Chapter 5
94
Set Associative Caches

95
In a k-way associative cahe, the m cache block frames are divided into v=m/k sets, with k blocks per set Each set is identified by a d-bit set number Compare the tag w/the k tags w/in the identified set Bj p Bf Si if j(mod v) = i
EENG-630 Chapter 5
96
EENG-630 Chapter 5
Sector Mapping Cache

97
Partition cache and main memory into fixed size sectors then use fully associative search Use sector tags for search and block fields within sector to find block Only missing block loaded for a miss The ith block in a sector placed into the th block frame in a destined sector frame Attach a valid/invalid bit to block frames
EENG-630 Chapter 5
Backplane bus system
Backplane Bus Systems

99
System bus operates on contention basis Only one granted access to bus at a time Effective bandwidth available is inversely proportional to # of contending processors Simple and low cost (4 16 processors)
EENG-630 Chapter 5
Backplane Bus Specification

100
Interconnects processors, data storage, and I/O devices Must allow communication b/t devices Timing protocols for arbitration Operational rules for orderly data transfers Signal lines grouped into several buses
EENG-630 Chapter 5
Backplane Multiprocessor System

101
EENG-630 Chapter 5
Data Transfer Bus

102
Composed of data, address, and control lines Address lines broadcast data and device address
Proportional to log of address space size
Data lines proportional to memory word length Control lines specify read/write, timing, and bus error conditions
EENG-630 Chapter 5
Bus Arbitration and Control

103
Arbitration: assign control of DTB Requester: master Receiving end: slave Interrupt lines for prioritized interrupts Dedicated lines for synchronizing parallel activities among processor modules Utility lines provide periodic timing and coordinate the power-up/down sequences Bus controller board houses control logic
EENG-630 Chapter 5
Functional Modules
104
Arbiter: functional module that performs arbitration Bus timer: measures time for data transfers Interrupter: generates interrupt request and provides status/ID to interrupt handler Location monitor: monitors data transfer Power monitor: monitors power source System clock driver: provides clock timing signal on the utility bus Board interface logic: matches signal line impedance, prop. time, and termination values
EENG-630 Chapter 5
Physical Limitations
105
Electrical, mechanical, and packaging limitations restrict # of boards Can mount multiple backplane buses on the same backplane chassis Difficult to scale a bus system due to packaging constraints
EENG-630 Chapter 5
Aug 2009 Q2
How do you classify pipelining processor? Give examples
Classification of Pipeline Processors

Arithmetic Pipeline
The
arithmetic logic units of a computer can be segmentized for pipeline operations in various data formats. Well-known arithmetic pipeline examples are the four-stage pipes used in Star-100 execution of a stream of instruction can be pipelined by overlapping the execution of the current instruction with the fetch, decode, and operand fetch of subsequent instruction. This technique is also known as instruction lookahead.
Instruction Pipelining :
The
Processor Pipelining :
This
refers to the pipeline processing of the same data stream by a cascade of processors, each of which processes a specific task. The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor. The second processor then passes the refined results to the third, and so on
Next question part

Prove that the linear pipeline speedup is K times faster than the non-pipelined processor
Speedup and Efficiency

k-stage pipeline processes n tasks in k + (n-1) clock cycles: k cycles for the first task and n-1 cycles for the remaining n-1 tasks Total time to process n tasks Tk = [ k + (n-1)] X For the non-pipelined processor T1 = n k X Speedup factor
Sk = T1 Tk nkX = [ k + (n-1)] X = nk k + (n-1)
10
Efficiency and Throughput
Efficiency of the k-stages pipeline : Ek = Sk k n = k + (n-1)
Pipeline throughput (the number of tasks per unit time) : note equivalence to IPC n [ k + (n-1)] X nf k + (n-1)
Hk =
Pipeline Performance: Example

Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns (nanoseconds) latch delay = 10 Pipeline cycle time = 90+10 = 100 ns For non-pipelined execution
time = 60+50+90+80 = 280 ns
Speedup for above case is: 280/100 = 2.8 !! Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns Sequential time = 1000*280ns Throughput= 1000/1003 What is the problem here ? How to improve performance ?
Aug 2009 Q4
What do you understand by loosely and tightly coupling of processors in a multi-processor system? Explain the salient features of each type. Discuss the processor characteristics for multiprocessor system.
Characteristics of Multiprocessors
SHARED MEMORY MULTIPROCESSORS

M M ... M Buses, Multistage IN, Crossbar Switch P
Interconnection Network
P Characteristics
...
All processors have equally direct access to one large memory address space Example systems - Bus and cache-based systems: Sequent Balance, Encore Multimax - Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP - Crossbar switch-based systems: C.mmp, Alliant FX/8 Limitations Memory access latency; Hot spot problem
Characteristics of Multiprocessors
MESSAGE-PASSING MULTIPROCESSORS
Message-Passing Network P P ... P Point-to-point connections
M Characteristics
...
- Interconnected computers - Each processor has its own memory, and communicate via message-passing Example systems - Tree structure: Teradata, DADO - Mesh-connected: Rediflow, Series 2010, J-Machine - Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III Limitations - Communication overhead; Hard to programming
Aug 2009 / Q5
Write short notes on
Flynns
architectural classification Applications of parallel processing Fault tolerant computing
Taxonomy of Parallel Architectures

Flynn(1996) based on Instruction and Data SISD Single Instruction Single Data Stream
Conventional uniprocessor system Still a lot of intra-CPU parallelism options
SIMD Single Instruction Multiple DataStream

vector and array style computers First accepted multiple PE style systems Now has fallen behind MIMD option
MISD Multiple Instruction Single Data Stream

no commercial products
MIMD Multiple Instructions Multiple Data Stream - Intrinsic parallel computers

- Lots of options - todays winner
Legends in Flynns Classification

CU: Control Unit PU: Processor Unit MM: Memory Module SM: Shared Memory IS: Instruction Stream DS: Data Stream
Flynns Classification
Classification based on notions of instruction and data stream Architecture Categories
SISD
SIMD
MISD
MIMD
SISD
IS
IS
DS
Uniprocessors
SIMD
P
IS DS
C P
DS
Processors that execute same instruction on multiple pieces of data
MISD
IS
IS
DS
M
IS
IS
DS
i. Same instruction executed by multiple processors using different data streams ii. Each processor has its data memory (hence multiple data) iii. Theres a single instruction memory and control processor
MIMD
IS
IS
DS
M
IS
IS
DS
MIMD
Each processor fetches its own instructions and operates on its own data MIMD current winner: Concentrate on major design emphasis <= 128 processors Use off-the-shelf microprocessors: cost-performance advantages
Flexible: high performance for one application, running many tasks simultaneously Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
Applications of parallel processing - 1

Predictive Modeling and Simulations:
Numerical
weather forecasting. Oceanography and astrophysics. Socioeconomics and government use.
Engineering Design and Automation:

Finite
element analysis Computational aerodynamics Remote sensing applications Artificial intelligence and automation
Applications of parallel processing - 2

Energy Resources Exploration
Seismic
exploration Reservoir modeling Plasma fusion power Nuclear reactor safety
Medical, Military, and Basic Research

Computer
assisted tomography Genetic engineering Weapon research and defense Basic research problems
Fault Tolerant Computing 1/3

Fault-tolerant computing is the art and science of building computing systems that continue to operate satisfactorily in the presence of faults. A fault-tolerant system may be able to tolerate one or more fault-types including
transient,
intermittent or permanent hardware faults software and hardware design errors operator errors externally induced upsets or physical damage
Fault Tolerant Computing Basic Concept 2/3

Hardware Fault-Tolerance- Each module is backed up with protective redundancy Fault masking - A number of identical modules execute the same functions, and their outputs are voted to remove errors created by a faulty module Dynamic recovery - involves automated self-repair Software Fault-Tolerance - Efforts to attain software that can tolerate software design faults (programming errors) have made use of static and dynamic redundancy approaches similar to those used for hardware faults
Aug 2009 / Q6
Describe the architectural features of any two of the following:
Pentium
III SPARC Architecture Power PC
Pentium III
The Pentium III[1] brand refers to Intel's 32bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitectureintroduced on February 26, 1999 The most notable difference was the addition of the SSE instruction set (to accelerate floating point and parallel calculations), and the introduction of a controversial serial number embedded in the chip during the manufacturing process.
RISC Scalar Processors

Designed to issue one instruction per cycle RISC and CISC scalar processors should have same performance if clock rate and program lengths are equal. RISC moves less frequent operations into software, thus dedicating hardware resources to the most frequently used operations. Representative systems:

Sun SPARC Intel i860 Motorola M88100 AMD 29000
Power PC
PowerPC (short for Performance Optimization With Enhanced RISC Performance Computing, sometimes abbreviated as PPC) is a RISCarchitecture created by the 1991 AppleIBMMotorola alliance, known as AIM. Originally intended for personal computers, PowerPC CPUs have since become popular as embedded and high-performance processors. PowerPC is largely based on IBM's earlier POWER architecture, and retains a high level of compatibility with it
Aug 2009 / Q7
Explain two bus arbitration schemes for multiprocessor. What is cache coherence? Explain the static coherence check mechanism
Arbitration
137
Process of selecting next bus master Bus tenure is duration of control Arbitrate on a fairness or priority basis Arbitration competition and bus transactions take place concurrently on a parallel bus over separate lines
EENG-630 Chapter 5
Central Arbitration
138
Potential masters are daisy chained Signal line propagates bus-grant from first master to the last master Only one bus-request line The bus-grant line activates the bus-busy line
EENG-630 Chapter 5
Central Arbitration
139
EENG-630 Chapter 5
Central Arbitration
140
Simple scheme Easy to add devices Fixed-priority sequence not fair Propagation of bus-grant signal is slow Not fault tolerant
EENG-630 Chapter 5
Independent Requests and Grants

141
Provide independent bus-request and grant signals for each master Require a central arbiter, but can use a priority or fairness based policy More flexible and faster than a daisy-chained policy Larger number of lines costly
EENG-630 Chapter 5
142
EENG-630 Chapter 5
Distributed Arbitration
143
Each master has its own arbiter and unique arbitration number Use arbitration # to resolve competition Send # to SBRG lines and compare own # with SBRG # Priority based scheme
EENG-630 Chapter 5
Aug 2009 / Q8
How can data hazards be overcome by dynamic scheduling?
Dynamic Scheduling
145
3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4
Advantages of Dynamic Scheduling

Handles cases when dependences unknown at compile time (e.g., because they may involve a memory reference) It simplifies the compiler Allows code that compiled for one pipeline to run efficiently on a different pipeline Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling
Chap. 3 -ILP 1
Dynamic Scheduling
146
The idea:
HW Schemes: Instruction Parallelism

Why is this in Hardware at run time? Works when cant know real dependence at compile time Compiler simpler Code for one machine runs well on another Key Idea: Allow instructions behind stall to proceed. Key Idea: Instructions executing in parallel. There are multiple execution units, so use them. Even though ADDD stalls, the F0,F2,F4 DIVD SUBD has no dependencies ADDD F10,F0,F8 and can run. SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion
Chap. 3 -ILP 1
Dynamic Scheduling
147
The idea:
HW Schemes: Instruction Parallelism

Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards 2. Read operandswait until no data hazards, then read operands
Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. A scoreboard is a data structure that provides the information necessary for all pieces of the processor to work together. We will use In order issue, out of order execution, out of order commit ( also called completion) First used in CDC6600. Our example modified here for MIPS. CDC had 4 FP units, 5 memory reference units, 7 integer units. MIPS has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
Chap. 3 -ILP 1
Dynamic Scheduling
148
Using A Scoreboard
Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR

Queue both the operation and copies of its operands Read registers only during Read Operands stage
For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages
Chap. 3 -ILP 1
Dynamic Scheduling
149
Using A Scoreboard
Four Stages of Scoreboard Control

1. Issue decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.
Chap. 3 -ILP 1
Dynamic Scheduling
150
Using A Scoreboard

2. Read operands wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.
Chap. 3 -ILP 1
Dynamic Scheduling
151
Using A Scoreboard

3. Execution operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4. Write result finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 Scoreboard would stall SUBD until ADDD reads operands
Chap. 3 -ILP 1
Dynamic Scheduling
152
Using A Scoreboard
Three Parts of the Scoreboard

statuswhich of 4 steps the instruction is in
1. Instruction
2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for each functional unit BusyIndicates whether the unit is busy or not OpOperation to perform in the unit (e.g., + or ) FiDestination register Fj, FkSource-register numbers Qj, QkFunctional units producing source registers Fj, Fk Rj, RkFlags indicating when Fj, Fk are ready 3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Chap. 3 -ILP 1
Dynamic Scheduling
153
Using A Scoreboard
Detailed Scoreboard Pipeline Control

Instruction status Wait until Bookkeeping Busy(FU)n yes; Op(FU)n op; Fi(FU)n `D; Fj(FU)n `S1; Fk(FU)n `S2; Qjn Result(S1); Qkn Result(`S2); Rjn not Qj; Rkn not Qk; Result(D)n FU; Rjn No; Rkn No
Issue
Not busy (FU) and not result(D)
Read operands Execution complete
Rj and Rk Functional unit done f((Fj( f )Fi(FU) or Rj( f )=No) & (Fk( f ) Fi(FU) or Rk( f )=No))
Write result
f(if Qj(f)=FU then Rj(f)n Yes); f(if Qk(f)=FU then Rj(f)n Yes); Result(Fi(FU))n 0; Busy(FU)n No
Chap. 3 -ILP 1
Aug 2009 / Q9
With the help of the block diagram, explain the crossbar switch system organization for multiprocessors Derive expression for speed-up, efficiency and throughput of a pipelined processor. What are their ideal values.
May 2009 / Q1

ACA Answer Key Best of All v2

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ACA Answer Key Best of All v2

Enviado por

Direitos autorais:

Formatos disponíveis

BEST OF ALL QUESTION PAPERS

All Question papers & assignments, Test

Sep 2010 - Question 1

Operational Model of SIMD Computer

Sep 2010 - Question 2

time is represented using T in nanoseconds

Inverse of cycle time is the clock rate (f=1/T)

CPI (Cycles per instructions)

MIPS(Million of Instructions per second)

Sep 2010 Question 5

Virtual Memory Models (1)

Virtual Memory Models (2)

Address Translation Mechanisms

Define effective access time of a memory hierarchy

Hierarchical Memory Technology

Note that f1 = h1, and

Effective Access Times

The effective access time of a memory hierarchy can be expressed as

! h1t1  (1  h1 )h2t2    (1  h1 )(1  h2 ) (1  hn 1 )hntn

Sep 2010 Q8 (10 marks)

Mismatch between software and hardware parallelism - 1

Maximum software parallelism (L=load, X/+/- = arithmetic).

Mismatch between software and hardware parallelism - 2

Sep 2010 Q 9 (10 marks)

Grain Packing and Scheduling

Program Graphs and Packing

Mar 2010 Q1 (10 marks)

Program Partitioning & Scheduling

Increasing communication demand and scheduling overhead Higher degree of parallelism

Instruction Level Parallelism

Job or Program-Level Parallelism

Interprocessor Communication Latency

Mar 2010 Q2 (10 Marks)

profile of programs Harmonic mean performance

Example Parallelism Profile

Mar 2010 Q8 (10 Marks)

Effect of Block Size

Hierarchical Bus Systems

Super pipelined Design

N instructions, degree n, k stages T(1,n) = k + 1/n (N-1) S(1,n) = [n (k + N 1)] / [nk + N 1]

Superscalar Superpipelined Performance

Generalized Multiprocessor System

Generalized Multiprocessor System

Interconnection Network Choices

Hierarchical Bus Systems

protocol conversion Interrupt handling Serving as cache and memory agents

Cache addressing model

Cache Addressing Models

Physical Address Caches

Physical Address Caches

in accessing cache until the MMU/TLB finishes translation

Physical Address Models

Virtual Address Caches

Virtual Address Model

Block Placement Schemes

Direct Mapping Cache

Direct Mapping Cache

Direct Mapping Cache

Fully Associative Cache

Fully Associative Cache

Fully Associative Caches

Set Associative Caches

Sector Mapping Cache

Backplane bus system

! h1t1 (1 h1 )h2t2 (1 h1 )(1 h2 ) (1 hn 1 )hntn