Escolar Documentos
Profissional Documentos
Cultura Documentos
1. Introdução
Loop Unrolling
Embora você possa aplicar este algoritmo a maior parte das vezes, em alguns casos
será necessário modificar o algoritmo ligeiramente. Por exemplo, no programa “Min”,
que contém um desvio dentro do loop requer a criação de novos rótulos quando
replicando o corpo do laço. Também, programas como “Inner” podem requerer
alguma renomeação adicional.
Software Pipelining
LOADS STORES
EXECS EXEC
STORES LOADS
2. Resultados
1) Número de Instruções
2) Número de Ciclos
3) O IPC de todo o programa
4) Número de Paradas (Bolhas)
a. RAW
b. WAW
c. Estruturais
Você vai precisar codificar algum trecho para preparação dos dados para sua
simulação antes do kernel propriamente dito. Apreente um relatório com seus dados e
conclusões.
3. Anexos
The original version of the Livermore Loops (officially
known as the Livermore Fortran Kernels) was written in
Fortran by Frank McMahon of Lawrence Livermore National
Laboratory. The Loops are used to benchmark the float-
ing point performance of a computer. The Loops bench-
mark was created by extracting the DO loops that con-
sumed most of the run time from several physics simula-
tion programs at LLNL. These loops are all one-dimen-
sional and the data they use is small enough to fit in
the cache memory of most current (1998) computers. The
results of the Loops have been a good predictor of per-
formance on complex physics simulation programs, pro-
vided that they use cache efficiently. Many 2D and 3D-
simulations programs at LLNL have data sets much larger
than cache memory, but have been written so that they
achieve high cache utilization. The Loops is a good
predictor of the performance of these programs. The
Loops use a single processor, so other benchmarks are
required to measure the performance of parallel comput-
ers.
q = 0.0;
for ( k=0 ; k<n ; k++ ) {
q += z[k]*x[k];
}
;;inner
LOOP:
l.d F1, 0(R2)
l.d F0, 0(R3)
mul.d F1, F1, F0
l.d F0, 0(R4)
add.d F0, F0, F1
s.d F0, 0(R4)
daddi R3, R3, 8
daddi R31, R31, 1
ld R1, 0(R5)
slt R1, R31, R1
bne R1, R0, LOOP
;; filled delay slot:
daddi R2, R2, 8
/*
*******************************************************
Kernel 5 -- tri-diagonal elimination, below diagonal
*********************************************************
* DO 5 i = 2,n
* 5 X(i)= Z(i)*(Y(i) - X(i-1))
*/
;;tridag
LOOP:
l.d F2, 0(R31)
l.d F0, 0(R2)
sub.d F2, F2, F0
l.d F0, 0(R3)
mul.d F0, F0, F2
s.d F0, 0(R4)
daddi F4, R4, 8
daddi R3, R3, 8
daddi R2, R2, 8
daddi R5, R5, 1
l.d r1, 0(R7)
slt R1, R5, R1
bne R1, R0, LOOP
;; filled delay slot:
daddi R31,R31,8
/*
*******************************************************
* Kernel 11 -- first sum
*********************************************************
* X(1)= Y(1)
* DO 11 k = 2,n
* 11 X(k)= X(k-1) + Y(k)
*/
x[0] = y[0];
for ( k=1 ; k<n ; k++ ) {
x[k] = x[k-1] + y[k];
}
;;sum
L17:
l.d F0, 0(R2)
l.d F1, 0(R3)
add.d F0, F0, F1
s.d F0, 0(R4)
daddi R4, R4, 8
daddi R3, R3, 8
daddi R31, R31, 1
ld R1, 0(R6)
slt R1, R31, R1
bne R1, R0, L17
;; filled delay slot:
daddi R2, R2, 8
/*
*********************************************************
* Kernel 24 -- find location of first minimum in array
*********************************************************
* X( n/2)= -1.0E+10
* m= 1
* DO 24 k= 2,n
* IF( X(k).LT.X(m)) m= k
* 24 CONTINUE
*/
x[n/2] = -1.0e+10;
m = 0;
for ( k=1 ; k<n ; k++ ) {
if ( x[k] < x[m] ) m = k;
}
;;min
LOOP:
ld R1, 0(R4)
dsll R1, R1, 2
dadd R1, R1, R5
l.d F1, 0(R31)
l.d F0, 0(R1)
c.lt.d F1, F0
bc1f GREAT
nop ; not filled.
sd R2, 0(R4)
GREAT:
daddi R2, R2, 1
slt R1, R2, R3
bne R1, R0, LOOP
/*
*********************************************************
*
* Kernel 1 -- hydro fragment
*********************************************************
*
* DO 1 L = 1,Loop
* DO 1 k = 1,n
* 1 X(k)= Q + Y(k)*(R*ZX(k+10) + T*ZX(k+11))
*/
cvt.d.l F5,F0
daddiu $sp,$sp,-24048
li 5,20 # 0x14
li R4,1 # 0x1
mov.d F4,F5
mov.d F3,F5
move R2,$sp
$L15:
li R3,99 # 0x63
.align 3
L8:
l.d F8,16120(R2) #F8 = z [k+10]
daddiu R3,R3,-1
l.d F7,16112(R2) #F7 = z [k+11]
l.d F6,8016(R2) #F6 = y [k]
mul.d F0,F3,F8 #F0 = t*z[k+10]
madd.d F1,F0,F4,F7
#F1 = t*z[k+10]+r*z[k+11]
madd.d F2,F5,F6,F1
#F2 = q+y[k]*(F1)
s.d F2,0(R2) #x[k] = F2
bgez R3,L8
daddiu R2,R2,8
daddiu R4,R4,1
slt R2,R5,R4
beq R2,R0,L15
move R2,$sp