ACA T1 Solutions

TEST-1 SOLUTIONS
Subject: Advanced Computer Architecture
PART-1
Answer any one full question.
1) Give Flynns classification of various computer architectures. Clearly explain the features
of each with conceptual diagrams.
(10 Marks)
Sol: Michael Flynns introduced a classification of various computer architectures based on

notions of instruction and data streams. They are
1.
2.
3.
4.
1.
SISD (single instruction stream over a single data stream)

SIMD (single instruction stream over a multiple data stream)
MIMD (multiple instruction stream over a multiple data stream)
MISD (multiple instruction stream over a multiple data stream)
SISD (single instruction stream over a single data stream):
Conventional sequential machines are called SISD computers as shown in Fig 1a.
CU = control unit
PU = processing unit
IS
CU
MU = memory unit
IS = instruction stream
DS = data stream
IS
PU
DS
MU
I/O
Fig 1a: SISD Uniprocessor architecture

2.
SIMD (single instruction stream over a multiple data stream):

Vector computers are equipped with scalar and vector hardware are called SIMD
computers as shown in Fig 1b.
SIMD:
PE1
Program loaded
From
Host
CU
IS
DS
LM1
DS
Data sets loaded
IS
from host
PEn
DS
DS
LMn
DS
DS
PE = Processing Elements
LM = Local Memory
Fig 1b: SIMD architecture (with distributed memory)
3. MIMD (multiple instruction stream over a multiple data stream):
Parallel computers are reserved for MIMD machines is as shown in the Fig 1c.
IS
CU1
IS
PU1
DS
Share
d
memo
ry
I/O
I/O
CUn
IS
PUn
DS
IS
Fig 1c: MIMD architecture (with shared memory)

4. MISD (multiple instruction stream over a multiple data stream):
An MISD machines are modeled in Fig 1d.
The same data stream flows through an array of processors executing different
instruction streams.
IS
IS
CU2
CU1
IS
Memory
(Program
and
DS
data)
IS
PU1
DS
DS
DS
CUn
IS
PU2
DS
DS
PUn
IS
Fig 1d: MISD architecture (the systolic array)
I/O
Of the four machine models, most parallel computers assumed MIMD model
for general-purpose computations.
The SIMD and MISD are more suitable for special-purpose computations.
Therefore MIMD is the most popular model, SIMD next and MIMD is the
least popular model.
2) a) A 40 MHz processor was supposed to execute 200000 instructions with following
instruction mix and CPI needed for each instruction
Instruction type
CPI
Instruction count
Integer arithmetic
60%
Data transfer
18%
Floating point
12%
Control transfer
10%
Determine the effective CPI, MIPS rate and execution time.

(5 Marks)
Sol:
CPI = (no of clock cycles / no of instructions)
Effective CPI
= (60/100*Ic)*2 + (18/100*Ic)*4 + (12/100*Ic)*6 + (10/100*Ic)*5
Ic
Effective CPI
= 3.14 clock cycles/instruction.
MIPS rate
= Ic / (T106)
= Ic / Ic * CPI * * 106
= 1 / 3.14*1/40*106 * 106
MIPS rate
= 12.7388 MIPS.
Execution time:
T = Ic * CPI *
= 200 *103 * 3.14 * 1/40*106
T = 15.7msec.
2) b) Differentiate between implicit and explicit parallelism with a neat sketch.
(5 Marks)
Sol:
Implicit parallelism:
An implicit approach uses a conventional language, such as C, Fortran, Lips or Pascal
to write the source program.
The sequentially coded source program is translated into parallel object code by a
parallelizing compiler.
As shown in Fig 5 (a), this compiler must be able to detect parallelism and assign
target machine resources. Programmer
This compiler approach has been applied in programming shared memory
Source code written in
multiprocessors.
sequential
C,
This approach requires
less effort on languages,
the part of the programmer.
Fortran, Lips, or Pascal
Parallelizing
compiler
Parallel object
code
Execution by routine
Fig 5 (a): Implicit Parallelism

Explicit parallelism:
This approach as shown in Fig 5 (b) requires more effort by the programmer to
develop a source program.
Parallelism is explicitly specified in the user program.
This will significantly reduce the burden on the compiler to detect parallelism.
Instead the compiler needs to preserve parallelism and where possible, assigns target
machine resources.
Programmer
Source code written in
concurrent dialects of C,
Fortran, Lips, or Pascal
Concurrency
preserving
compiler
Concurrent object
code
Execution by routine
system
Fig 5 (b) : Explicit Parallelism
PART-2
3) Explain UMA and NUMA Model of Shared-Memory Multiprocessors with a neat

diagram.
(10 Marks)
Soln
The multiprocessor parallel models are
i) Uniform memory access model [UMA].
ii) Non-uniform memory access model [NUMA].
i)
Uniform memory access model [UMA]:
In this model physical memory is uniformly shared by all processors.

All processors have equal access time to all memory words.
Each processor uses a private cache.
Multiprocessors are tightly coupled systems due to high degree of resource sharing.
The system interconnect takes the form of a common bus, a crossbar switch or a
multistage network.
UMA model is suitable for general purpose, time sharing application by multiple
users.
Coordination of parallel events, synchronization and communication among
processors are done through shared variables.
In this type of architecture when all the processors have equal access time to all the
peripherals, the system is said to be symmetric multiprocessor.
In this case all the processors equally capable of running the executive programs.
In an asymmetric multiprocessor, only one or a subset of processors are executive
capable.
The remaining processors have no I/O capability and thus are called attached
processors.
An executive or a master processor can execute the OS and handle I/O.
Attached processors execute user codes under the supervision of the master processor.
Processors
P1
P2
pn
System interconnect (Bus, crossbar,

multistage networks)
I/O
SM1
SMn

Shared memory
Fig 2: The UMA multiprocessor model.
ii)
Non-Uniform memory access model [NUMA]:

A NUMA multiprocessor is a shared memory system in which the access time varies
with the location of the memory word.
Two NUMA machine models are as shown in Fig3 (a) & (b).
The shared memory is physically distributed to all processors, called local memories.
The collection of all local memories forms a global address space accessible by
processors.
It is faster to access a local memory with a local processor. The access of remote
memory attached to other processors takes longer due to the added delay through the
interconnected network.
Besides distributed memories, globally shared memory can be added to a
multiprocessor system.
In this case there are three memory access patterns. They are
a. Local memory access (fastest).
b. Global memory access.
c. Remote memory access (slowest).
In this model processors are divided into several clusters.
Each cluster is itself an UMA or an NUMA microprocessor.
The clusters are connected to global shared memory modules. The entire system is
considered a NUMA multiprocessor.

All processors belonging to the same clusters are allowed to uniformly access the
cluster shared memory modules. All clusters have equal access to the global memory.
The access time to the cluster memory is shorter than that to the global memory.
a. Shared local memories
LM1
P
1
LM2
P
2
LMn
Inter
conne
ction
netwo
b. A hierarchical cluster model

GSM
GSM
GSM
Global interconnect network
P
P
:P
C
I
N
CSM
CSM
: CSM
Cluster1
CSM
C
I
N:
CSM
CSM
Cluster N
Fig 3: Two NUMA models for multiprocessor systems.
Answer any two full questions.

4) Explain the architecture of vector super computer with a neat diagram.
Sol:
The architecture of vector super computer is as shown in the Fig4
Vector processor is built on top of the scalar processor.
The vector processor is attached to the scalar processor as an optional feature.
Program and data are first loaded into the main memory through a host computer.
All instructions are first decoded by the scalar control unit.
If the decoded instruction is the scalar operation or a program control operation, it
will be directly executed by the scalar processor using the scalar functional
pipelines.
If the instruction is decoded as a vector operation, it will be sent to the vector
control unit. This control unit will supervise the flow of vector data between the
main memory and the vector functional pipelines. The vector data flow is
coordinated by the control unit. A number of vector functional pipelines may be
built into a vector processor.
In vector super computer, there will be a vector processor and it can be built on two
architectures, namely
1. Register-to-register architecture
2. Memory-to-memory architecture
Register-to-register architecture:
Here vector registers are used to hold the vector operands, intermediate and
final vector results.

The vector functional pipelines retrieve operands from and put results into the
vector registers.
All vector registers are programmable in user instructions.
Each vector register is equipped with a component counter which keeps track
of the component registers used in successive pipeline cycle.

In general, there are fixed number of vector registers and functional pipelines
in a vectorScalar
processor.
processor
Memory-to-memory architecture:
In this architecture, the vector operands and intermediate results are directly copied
into the memory
and they are retrieved as and when it isVector
required
from the memory.
processor
Scalar
instructions
Main memory
(program and
data)
Mass
storage
Host
comp
Instructions
Scalar
vector
Data
Data
I/O (user)
Fig 4: The architecture of vector supercomputer.
5) a) Explain different types of data dependency with an example

b) Draw the data dependency graph for the following.
S1: Load R1, M(100)
S2: Move R2, R1
S3: Inc R1
S4: Add R2, R1
S5: Store M(100), R1
(5+5Marks)
Sol:
a) There are 5 types of data dependencies. They are as follows:
(1)
Flow dependence:
A statement S2 is flow-dependent on the statement S1 if an execution path exists
from S1 to s2 and if at least one output of S1 feeds in as input to S2.

Ex: S1:
load R1, A
S2: Add R2, R1
S
S
(2)
Anti dependence:
1
Statement S2 is anti dependent on statement
S1 if S2 follows2S1 in program order
and if the output of S2 overlaps the input to S1.
Ex:
S1:
add R2, R1
S2:
move R1, R3
S
1
S
1
(3)
Output dependence:
Two statements are output dependent if they produce the same output variable.
Ex:
(4)
S1:
S2:
load R1, A
move R1, R3
S
1
S
1
I/O dependence:
Read and write are I/O statements. I/O dependence occurs not because the
same variable is involved but because the same file is referenced by both I/O
statements.
(5)
Unknown dependence:
The dependence relation between two statements cannot be determined in the
following situations.
The subscript of a variable itself subscribed.
The subscript does not contain the loop index variable.
A variable appears more than once with subscripts having different coefficients of
the loop variable.
The subscript is nonlinear in the loop index variable.
When one or more of these conditions exists, a conservative assumption is to
claim unknown dependence among the statements involved.
b) Draw the data dependency graph for the following.

S1: Load R1, M(100)
S2: Move R2, R1
S3: Inc R1
S4: Add R2, R1
S5: Store M(100), R1
Sol: The data dependence graph is as shown below.
S
1
S
2
S
3
S
5
S
4
PART3
Answer any Two full questions.
6) Trace out the following program to detect the parallelism using Bernsteins conditions
P1: C = D x E
P2: M= G + C
P3: A = B + C
P4: C = L + M
P5: F = G / E
Assume that each step requires one cycle to execute and two adders are available.
Compare between serial and parallel execution of the above program
(10 Marks)
Sol: Bernstein revealed a set of conditions based on which two processes can execute in
parallel.
P1, P2 - process
I1, I2 - inputs
O1, O2 -- outputs
P1 || P2 if and only if
I1 O 2 =
I2 O 1 =
O1 O2 =
P1 || P2 || . . . . || Pk if and only if
Pi || Pj
if
ij
G
+
1
+1
+2
+
2
A
+3
+
3
G
E
F
Fig(a): Sequential execution in 5 steps
Fig (b): Parallel execution in 3steps
P1 || P5,
P2 || P3,
P2 || P5,
P4 || P5,
P5 || P3
Collectively
P2 || P3 || P5 Because
P2 || P3,
P2 || P5,
P3 || P5
7)
Explain hardware and software parallelism with an example.
(10Marks)
Sol:
Hardware parallelism:
This refers to parallelism defined by machine architecture and hardware multiplicity.
One way to characterize the parallelism is by the number of instruction issues per
machine cycle.
If a processor issues k-instructions per machine cycle, then it is called k-issue processor.
A conventional processor takes one or more machine cycles to issue a single instruction.
These are called one issue machine with single instruction pipeline in the processor.
A multiprocessor system built with n k-issue processor should be able to handle a
maximum nk thread of instructions simultaneously.
Software parallelism:
It is defined by the control and data dependences of programs.
The degree of parallelism is revealed in the program profile or in the program flow graph.
Software parallelism can be achieved by algorithms, programming style and compiler

optimization.
Parallelism in a program varies during execution period.
Control parallelism:
It is a kind of software parallelism. This appears in the form of pipelining or multiple
functional units. But both pipelining and functional parallelisms are handled by the hardware.
So while programming, programmer has to take special actions to invoke them.
Data parallelism:
It offers the highest potential for concurrency.

It is practiced in both SIMD and MIMD modes on MPP system.
Data parallel code is easier to write and to debug than control parallel code.
Synchronization in SIMD data parallelism is handled by the hardware.
Data parallelism exploits parallelism in proportion to the quantity of data involved.
Assuming two multiplier units and two add/subtract units calculate average software
parallelism.
Assuming two multiplier units and two add/subtract units and 2-issue processor in which
one memory access (load/store) and one arithmetic operation can execute simultaneously.
Calculate average hardware parallelism.
L
1
L
2
L
4
L
3
1-cycle, 4-operations
B
Fig(a): Software parallelism
s/w parallelism = 8/3 = 2.67 instructions per cycle.

L1
Cycle 1
L2
Cycle 1
Cycle 1
L
3
L
4
Cycle 1
Cycle 1
A
Fig (b) Hardware parallelism
Cycle 1
Cycle 1
B
7-cycles and 8-operations
H/w parallelism = 8/7 = 1.14 instruction/cycle
8) Explain how grain packing can be done to compute the sum of the 4 elements in the
resulting product matrix C = A x B Where A and B are 2x2 Matrices. Assume grain size
for multiplication is 101 and the grain size for addition is 8.
(10Marks)
Sol:
A
B
A
B
is = 101
is = 8
C = AX B
A=
A11
A12
A21
A22
B=
B11
B21
C11 = A11B11 + A12B12

C12 = A11B12 + A12B22
C21 = A21B11 + A22B21
C22 = A21B12 + A22B22
SUM, C = C11 + C12 + C21 + C22
A22
B12
C=
C21
C11
C22
C12
Fine grain graph:

A
SUM
Coarse grain graph:
SUM
Grain size of U = 210
Grain size of V = 210
Grain size of W = 210
Grain size of X = 210

Grain size of Y = 24
****

ACA T1 Solutions

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ACA T1 Solutions

Enviado por

Direitos autorais:

Formatos disponíveis

TEST-1 SOLUTIONS

Subject: Advanced Computer Architecture

Sol: Michael Flynns introduced a classification of various computer architectures based on

SISD (single instruction stream over a single data stream)

Fig 1a: SISD Uniprocessor architecture

SIMD (single instruction stream over a multiple data stream):

Fig 1c: MIMD architecture (with shared memory)

Fig 1d: MISD architecture (the systolic array)

Determine the effective CPI, MIPS rate and execution time.

Fig 5 (a): Implicit Parallelism

3) Explain UMA and NUMA Model of Shared-Memory Multiprocessors with a neat

Uniform memory access model [UMA]:

In this model physical memory is uniformly shared by all processors.

System interconnect (Bus, crossbar,

Non-Uniform memory access model [NUMA]:

considered a NUMA multiprocessor.

b. A hierarchical cluster model

Global interconnect network

Fig 3: Two NUMA models for multiprocessor systems.

Answer any two full questions.

final vector results.

of the component registers used in successive pipeline cycle.

Fig 4: The architecture of vector supercomputer.

5) a) Explain different types of data dependency with an example

from S1 to s2 and if at least one output of S1 feeds in as input to S2.

b) Draw the data dependency graph for the following.

Fig (b): Parallel execution in 3steps

Software parallelism can be achieved by algorithms, programming style and compiler

It offers the highest potential for concurrency.

s/w parallelism = 8/3 = 2.67 instructions per cycle.

C11 = A11B11 + A12B12

Fine grain graph:

Coarse grain graph:

Grain size of X = 210

Você também pode gostar