Department of Computer Science & Engineering CSL718 Architecture of High Performance Systems Major Test Solution

Department of Computer Science & Engineering
CSL718 Architecture of High Performance Systems

Major Test Solution
Date: 5.5.2005 Time: 0800 – 1000 Max. Marks: 40
1. A program containing the following loop is to be executed on a 4-issue VLIW processor

with 4 execution units, each capable of doing any instruction. Translate this loop into
assembly code without any predicated instructions or loop unrolling or speculation. Show
what improvements can be obtained by these three techniques. Take care of preserving
exception behaviour while doing speculation.
for (i = 0; i < N; i++)

if (A[i] < 0) A[i] = A[i] + b;
(10)
Solution:
Write assembly program and give.

[2 marks]
Create a schedule for 4 issue VLIW processor.
[2 marks]
Re-write the program with predicated instructions and reschedule.
[2 marks]
Do loop unrolling and reschedule.
[2 marks]
Write the program with speculation and reschedule.
[2 marks]
2. A multiprocessor system with 3 processors, each with its own cache, has a shared memory
connected through a shared bus. A simple 3 state invalidate protocol is followed for
maintaining cache coherence. Write procedures for acquiring and releasing a lock in a shared
memory multiprocessor system using instructions “load linked” and “store conditional”.
Each processor is running a process which simply performs the operation A = A + X

repeatedly, where A is a shared variable and X is a local variable. The lock needs to be
acquired before doing this operation each time. After doing this operation, the lock is
released. Assume “sequential consistency” model for the memory. Show a possible sequence
of operations (including cache coherence protocol steps) as seen over the shared bus,
covering a period in which each processor gets a chance to do this operation once.
Suggest some changes which will improve the performance.

(12)
Solution:
Let the three processors be labelled as P1, P2 and P3. The process being run by each
processor including procedures for acquiring and releasing a lock L is as follows.
acquire: LL r2, L ;read lock using load linked

if(r2 ≠ 0) →acquire ;not available
r2 ← 1
SC r2, L ;try locking using store conditional
if(r2 = 0) →acquire ;already locked
operation: LD r1, A ;ordinary load
ADD r1, r1, X ;assume X is in a register
ST r1, A ;ordinary store
release: r2 ← 0
ST r2, L ;ordinary store
[2 marks for acquire and 1 mark for release procedure]
The sequence of memory operations performed by each processor can be represented as
follows, where the braces denote one or more repetitions.
{{LL} SC} LD ST ST
For example, if the outer loop is repeated 2 LL

times and the inner loop is repeated 4 times on LL
LL
the first occasion and 3 times on the second LL
occasion, then the sequence will appear as SC
shown here. LL
LL
LL
SC
LD
ST
ST
Now let us assume that the order in which the P1 P2 P3
three processors are able to acquire the lock is LL LL LL
SC SC SC
P1 followed by P2 followed by P3. Since a LD LL LL
sequential consistency model is followed for ST LL LL
the memory, all memory operations are atomic ST LL LL
and their order is maintained as per the SC SC
LD LL
program. Therefore, the three sequences of ST LL
memory operations performed are as shown. ST LL
The store conditionals which fail are shown in SC
red and those which succeed are shown in LD
green. ST
ST
Because of sequential consistency model, all LL1 m c

the operations would appear as atomic and LL2 m c
LL3 m c
interleaved on the bus in some manner. SC1 h
Atomicity means that all the steps involved in SC2 m i
one operation (these steps are shown later) are SC3 m i
completed before the next operation is taken LD1 m c
LL2 h
up. A possible sequence is shown here with LL3 h
round robin interleaving, that is, one operation ST1 h
of P1, then one operation of P2 and then one LL2 h
operation of P3 and so on. The operations are LL3 h
shown with a digit suffixed to them to indicate ST1 h
LL2 m i
the processor number. It is also indicated which LL3 m i
operations result in a miss (denoted by “m”) SC2 h
and which ones in a hit (denoted by “h”). The SC3 m i
misses here include the compulsory misses LD2 m i
LL3 h
(denoted by “c”) and the misses resulting from ST2 h
the invalidations (denoted by “i”). LL3 h
ST2 h
LL3 m i
SC3 h
LD3 m i
ST3 h
ST3 h
[3 marks for the operation sequence]

Steps involved in various operations are as shown below. Acknowledgement messages are
there to ensure atomicity. It is assume that dirty blocks do not get replaced by other blocks
and hence no write back transfers are required.
read hit read miss read miss wriet hit write miss write miss
(compulsory) (due to invalidation) (compulsory) (due to invalidation)
none •read request • read request • invalidate • write request • write request
•mem to • owner cache to mem • ack • mem to • owner cache to
requesting • mem to requesting requesting mem
cache cache cache • mem to
•ack • ack • ack requesting cache
• ack
[4 marks for protocol steps]
Performance can be improved by the following changes.
• A better coherence protocol such as Berkeley or Illinois protocol may be used.
• A more relaxed memory model, for example, TSO or Processor Consistency, may be
used.
[2 marks]
3. Consider a two stage dynamic
network shown in the figure with
each stage consisting of k cross-bar 1 1
switches of size k × k. Is it a blocking
network or non-blocking network?
Illustrate your answer. 2 2
Suppose each input to the 1st stage
has same arrival rate of messages
with Poisson distribution. There are
buffers to queue up the requests at the
inputs of the 1st stage but no buffers k k
elsewhere. Derive an expression for
the throughput of this network,
assuming that all messages are of 1st stage 2nd stage
same size.
(12)
Solution:
The network can connect any input to any output individually, but when multiple messages need
to be routed, blocking may occur. Let us use the notation <i, j> to denote jth port (at input or
output side, as the case may be) of the ith switch (in the 1st stage or the 2nd stage, as the case may
be). The route from <i1, j1> at the primary input to <i2, j2> at the primary output, has to go
through <i1, i2> at the 1st stage output and <i2, i1> at the 2nd stage input. This requires i1th
switch in the 1st stage to switch from j1 to i2 and i2th switch in the 2nd stage to switch from i1 to
j2. This will conflict with all messages requiring routing from <i1, x> to <i2, y> or from <x, y>
to <i2, j2> for any x and y.
[4 marks for this part]
Now let r be the probability of arrival of a message at any input of the 1st stage in one cycle,
where one cycle corresponds to the service time of a message (assumed to be a constant because
of uniform message size).
Probability of i simultaneous messages arriving at the inputs of a switch in the 1st stage
= q(i) = k Ci r i (1 − r ) k −i
Expected no. of requests accepted out of these i requests
  k −1 i 
= E(i) = 1 −   k
  k  
Throughput at the input of the 1st stage = k2r
k
k
 r
Expected throughput at the output of the 1 stage = k ∑ E (i )q (i ) = k [1 − 1 −  ]
st 2
i =0  k
[3 marks for these basic expressions]
k
 r
Let us use s to denote [1 − 1 −  ]
 k
Throughput at the output of the 1st stage = k2s
= Throughput at the input of the second stage.
k
 s
By similar arguments, throughput at the output of the 2 stage = k [1 − 1 −  ]
nd 2
 k
[3 marks for extending the analysis to 2 stages]
This does not include the effects of resubmission (resulting from conflicts and rejections) and
the queuing. The effect of resubmissions is to increase the throughput at the input of the
network. Queuing affects the delay but not the throughput.
r
Let us say, r effectively increases to r’, where r’ = .
r + PA (1 − r )
k
 r' 
Then s increases to s’, where s’ = [1 − 1 −  ] and the overall throughput increases to
 k
k
 s' 
k [1 − 1 −  ] .
2
 k
k
1  s' 
Here PA = [1 − 1 −  ] .
r  k
The queuing buffers will allow this throughput to be maintained. In absence of the buffers, the
message sources would have slowed down.
[2 marks for accounting for resubmissions]
4. Show a systolic array to multiply two band matrices a and b, where each matrix has a band
of width 3 (aij = 0 and bij = 0 for i < j-1 or i > j+1). The dimension of each matrix is n × n.
Find the number of steps required for computation, starting from the time when the first
element of each matrix enters the systolic array.
(6)
Solution:
The band matrices here have only 3 non-zero diagonals as shown.
 A11 A12 0 0 0 0   B11 B12 0 0 0 0 

 A A A 0 0 0  B B B 0 0 0 
 21 22 23   21 22 23 
0 A32 A33 A34 0 0  0 B32 B33 B34 0 0 
[C ] =  • 
0 0 A43 A44 A45 0  0 0 B43 B44 B45 0 
0 0 0 A A A  0 0 0 B B B 
 54 55 56
  54 55 56

0 0 0 0 A A  
65 66  0 0 0 0 B65 B66 
Therefore, the systolic array to compute the product will have only 3 × 3 = 9 processing
elements as shown.
A23
A22 A12 B21
A21 A11 B11 B12
The position of various matrix elements at T = 0 when the first elements of each matrix enter the
systolic array is as shown next.
T=0
A33 A23 B32
A32 A22 A12 B21 B22
A21 A11 B11 B12
Position at T = 1, T = 2 and T = 3 is shown next.
T=1
A34
A33 A23 B32
A32 A22 A12 B21 B22 B23
A21 A11 B11 B12

T=2
A34
A43 A33 A23 B32 B33
A32 A22 A11 B11

B22 B23
A12 B21
A21 B11 A11 B12
T=3
A34
A43 A33 A23 C11 B32 B33
A21 B11 A11 B12

A32 B23
A22 B21 A12 B22
A21 B12
The first element of the result C11 comes out at T = 3. The next main diagonal element C22 will
come out at T = 6. Clearly, the last element Cnn comes out at time 3n.
[2 marks for the systolic array, 2 marks for

showing/describing the operations and 2
marks for the number of steps.]

Department of Computer Science & Engineering CSL718 Architecture of High Performance Systems Major Test Solution

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Department of Computer Science & Engineering CSL718 Architecture of High Performance Systems Major Test Solution

Enviado por

Direitos autorais:

Formatos disponíveis

Department of Computer Science & Engineering

CSL718 Architecture of High Performance Systems

1. A program containing the following loop is to be executed on a 4-issue VLIW processor

for (i = 0; i < N; i++)

Write assembly program and give.

Each processor is running a process which simply performs the operation A = A + X

Suggest some changes which will improve the performance.

acquire: LL r2, L ;read lock using load linked

For example, if the outer loop is repeated 2 LL

Because of sequential consistency model, all LL1 m c

[3 marks for the operation sequence]

The band matrices here have only 3 non-zero diagonals as shown.

 A11 A12 0 0 0 0   B11 B12 0 0 0 0 

A22 A12 B21

A21 A11 B11 B12

A33 A23 B32

A32 A22 A12 B21 B22

A21 A11 B11 B12

Position at T = 1, T = 2 and T = 3 is shown next.

A33 A23 B32

A32 A22 A12 B21 B22 B23

A21 A11 B11 B12

A43 A33 A23 B32 B33

A32 A22 A11 B11

A21 B11 A11 B12

A43 A33 A23 C11 B32 B33

A21 B11 A11 B12

[2 marks for the systolic array, 2 marks for

Você também pode gostar