Pipelining Parallel Processing

DigitalVLSIA
Digital VLSI Architectures:

A
Pipelining&ParaallelProcessing
MahdiSShabany
SharifUniversity
f
yofTechnology
f
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::CriticalPath
Criticalpathinanydesignisthelongeestpathbetween
1.
2.
3.
4.
Anytwointernallatches/flipflops
An input pad and an internal latch
Aninputpadandaninternallatch
Aninternallatchandanoutputpad
d
Aninputpadandanoutputpad
UseFFsrightafter/before
input/outpadstoavoid
th l t th
thelastthreecases
(offchipandpackagingdelay)
Themaximumdelaybetweenany
The
maximum delay between any
twosequentialelementsina
designwilldeterminethemax
clockspeed
2
Input
Pad
Comb.Logic
3
Output
Pad
DigitalDesignMetrics
Threeprimaryphysicalcharacteristicsofadigitaldesign:
Speed
Throughput
Latency
Timing
Area
Power
DigitalDesignMetrics
Speed
Throughput
oug put :
Theamountofdatathatispro
ocessedperclockcycle(bitspersecond)
Latency
Thetimebetweendatainputandprocesseddataoutput(clockcycle)
Timing
Thelogicdelaysbetweensequ
uentialelements(clockperiod)
Whenadesigndoesnotmeettthetimingitmeansthedelayofthe
critical path is greater than the ttargetclockperiod
criticalpathisgreaterthanthet
target clock period
MaximumClockFrequenccy:CriticalPath
MaximumClockFrequency:
Fmax
1
Tclkq Tlogic Tsetup Trouting Tskew
Tclkq :timefromclockarrivaluntildataaarrivesatQ
Tlogic :propagationdelaythroughlogicbe
: propagation delay through logic beetweenflip
etween flipflops
flops
Trouting :routingdelaybetweenflipflops
Tsetup :minimumtimedatamustarriveattDbeforethenextrisingedgeofclock
Tskew :propagationdelayofclockbetweeenthelaunchflipflopandthecaptureflipflop.
Pipelining(toImproveThrroughput)
Pipelining:
Comesfromtheideaofawaterpipe:continuesendingwaterwithout
waiting the water in the pipe to
waitingthewaterinthepipeto
o be out
obeout
Usedtoreducethecriticalpath
hofthedesign
Advantageous:
Reductioninthecriticalpath
Higherthroughput(numberof
Higher throughput (number of computedresultsinagivetime)
computed results in a give time)
Increasestheclockspeed(orsaamplingspeed)
Reducesthepowerconsumptio
onatsamespeed
ArchitecturalTechniques::Pipelining
Pipelining:
Verysimilartotheassemblylineintheautoindustry
Thebeautyofapipelineddesignisthatnewdatacanbeginprocessing
beforethepriordatahasfinished,m
muchlikecarsareprocessedonan
assemblyline.
OriginalSystem:(Criticalpath=1
Maxoperatingfreq:f1=1/1)
2cycles
later
Com
mb Logic
mb.Logic
Clk
Criticalpath=1
Maxoperatingfreq:f1=1/1
Pipelinedversion:(Criticalpath=2 Maxoperatingfreq:f2=1/2)
SmallerCriticalPathhighertthroughput(2<1f2>f1)
Longerlatency
l
X
f
Comb.Logic
Comb.Logic
Clk
Criticalpath=2
Maxoperatingfreq:f2=1/2
PipeliningRegister
3cycles
later
ArchitecturalTechniques::Pipelinedepth
Pipelinedepth:0(NoPipeline)
Criticalpath:3Adders
wire w1,w2;
assign w1=X+a;
assign
g w2=w1+b;;
assign Y=w2+c;
Latency:0
Latency : 0
t1
t2
time
t3
X(1)
( )
X(2
(2))
X(3)
Y(1)
Y(2
2)
Y(3)
Pipelinedepth:1(OnePipelinereggisterAdded)
Criticalpath:2Adders
a(n)
b(n)
X(n)
c
c(n)
w2
2
w1
wire w1;
reg w2;
assign w1=X+a;
assign
g Y=w2+c;;
Y(n)
always@(posedgeClk)
w2<=w1+b;
Latency:1
Latency : 1
t1
t2
X(1)
( )
t4
t3
X(2)
(2)
X(3)
Y(1)
Y(2)
time
Y(3)
Pipelinedepth:2(OnePipelinereggisterAdded)
Criticalpath:1Adder
reg w1,w2;
assign Y=w2+c;
Y = w2 + c;
always@(posedgeClk)
begin
w1 <= X + a;
w1<=X+a;
w2<=w1+b;
end
Latency:2
Latency : 2
t1
t2
X(1)
X(2)
t5
t4
t3
X(3)
Y(1)
Y(2)
Y(3)
Clockperiodandthroughputasaffunctionofpipelinedepth:
1
Clk
n
Throughput:
h
h
Tn
Clockperiod:
Clock Period
Throughput
Addingregisterlayersimproves
Addi
it l
i
timingbydividingthecriticalpath
intotwopathsofsmallerdelay
4
Pipeline Depth
GeneralRule:
Pipelininglatchescanonlybeplacedacrossfeedforwardcutsets
ofthecircuit.
Cutset:
Cutset:
Asetofpathsofacircuitsuchtthatifthesepathsareremoved,the
circuitbecomesdisjoint(i.e.,two
j
separatepieces)
p
p
FeedForwardCutset:
Acutsetiscalledfeedforwardcutsetifthedatamoveinthe
forwarddirectiononallthepathssofthecutset
Example:
FIRFilter
Threefeedforwardcutsetsareeshown
NOTafeedforwardcutset
X(n)
X(n1)
X(n2)
Y(n)
CriticalPath:1M+2A
X(n)
X(n1)
X(n2)
w1
CriticalPath:2A
w4
w2
w3
assign w1=a
w1 = a*Xn;
Xn;
assign w2=b*Xn_1;
assign w3=w1+w2;
assign w4=c*Xn_2;
assign
g Y=w3+w4;;
always@(posedgeClk)
begin
Xn_1<=Xn;
Xn_2<=Xn_1;
end
Y(n)
assign Y=r3+w1;
Y r3 + 1
assign w1=r1+r2;
always@(posedgeClk)
begin
Xn 1 <=Xn;
Xn_1<
Xn;
Xn_2<=Xn_1;
r1<=a*Xn;
r2<=b*Xn_1;
r3<=c*Xn_2;
end
X(n)
X(n1)
X(n2)
2
1
Y(n)
Cloc
k
Input
Output
X(0)
aX(0)
aX(0)
aX(0)
Y(0)
X(1)
aX(1)
X(1) bX(0) aX(1)+bX(0)
X(1)+bX(0)
aX(1)+bX(0)
X(1)+bX(0)
Y(1)
X(2)
aX(2) bX(1) aX(2)+bX(1)
cX(0)
aX(2)+bX(1)+cX(0)
Y(2)
X(3)
aX(3) bX(2) aX(3)+bX(2)
cX(1)
aX(3)+bX(2)+cX(1)
Y(3)
X(n)
X(n1)
X(n2)
Y(n)
Cloc
k
Input
Output
X(0)
X(1)
aX(0)
aX(0)
aX(0)
Y(0)
X(2)
aX(1)
bX(0)
aX(1)+bX(0))
aX(1)+bX(0)
Y(1)
X(3)
aX(2)
bX(1)
aX(2)+bX(1))
cX(0)
aX(2)+bX(1)+cX(0)
Y(2)
Evenmorepipelining
Clock Input
Output
X(0)
X(1)
aX(0)
X(2)
aX(1)
bX(0)
aX(0)
aX(0)
Y(0)
X(3)
aX(2)
bX(1)
aX(1)+bX(0)
aX(1)+bX(0)
Y(1)
X(3)
aX(2)
bX(1)
aX(2)+bX(1)
cX(0)
aX(2)+bX(1)+cX(0)
Y(2)
ArchitecturalTechniques::FineGrainPipelining
Pipeliningattheoperationlevel
Breakthemultiplierintotwoparts
X(n)
FineGrain
Pipelining
Pi
li i
X(n1)
b
m1
m1
m2
m2
X(n2)
m1
m2
Y(n)
UnrollingtheLoopUsingP
Pipelining
CalculationofX3
Throughput=8/3,or2.7bits/clock
Latency =3clocks
Latency
3 clocks
Timing=Onemultiplierinthecriticalpath
Iterativeimplementation:
p
Nonewcomputationscanbeginun
ntilthe
previouscomputationhascompletted
module power3(
outputreg[7:0]X3,
output finished,
input [7:0]X,
input clk,start);
reg [7:0]ncount;
reg [7:0]Xpower,Xin;
assign finished=(ncount ==0);
always@(posedge clk)
if (start)begin
XPower <=X;Xin<=X;
ncount <=2;
X3<=XPower;
end
elseif(!finished)begin
ncount <=ncount 1;
XPower <=XPower *Xin;
End
endmodule
UnrollingtheLoopUsingP
Pipelining
CalculationofX3
Throughput=8/1,or8bits/clock(3Ximprovement)
Latency =3clocks
Latency
3 clocks
Timing=Onemultiplierinthecriticalpath
Penalty:MoreArea
Unrollinganalgorithmwithniterativeloo
opsincreases
throughputbyafactorofn
X2
Clk
X[0:7]
xpower
xpower1
2
xpower2
Clk
module power3(
output reg[7:0]XPower,
input clk,
input [7:0]X);
reg [7:0]XPower1,XPower2;
reg [7:0]X2;
always @(posedgeclk)begin
@(posedge clk) begin
//Pipelinestage1
XPower1<=X;
//Pipelinestage2
XPower2<=XPower1*XPower1;;
X2<=XPower1;
//Pipelinestage3
XPower <=XPower2*X2;
end
endmodule
RemovingPipelineRegiste
ers(toImproveLatency)
CalculationofX3
Throughput=8bits/clock(3Ximprovvement)
Latency =0clocks
Latency
0 clocks
Timing=Twomultipliersinthecriticalpath
Latencycanbereducedbyremovingpipellineregisters
module power3(
O t t [7:0]XPower,
Output
[7 0] XP
input [7:0]X);
reg [7:0]XPower1,XPower2;
reg [7:0]X1,X2;
always
l
@*
@*
XPower1=X;
always @(*)
begin
X2 XP
X2=XPower1;
1
XPower2=XPower1*XPower1;
end
assign
i XPower
XP
=XPower2*X2;
XP
2 * X2
endmodule
ArchitecturalTechniques::ParallelProcessing
Inparallelprocessingthesamehar
I
ll l
i
h
h rdwareisduplicatedto
d
i d li
d
Increasesthethroughputwithoutchaangingthecriticalpath
Increasesthesiliconarea
a(n)
X(n)
Pipelining
b(n)
ClockFreq:f
Throughput:Msamples
Y(n)
ParallelProcessing
a(2k)
b(2k)
Y(2k)
X(2k)
a(2k+1)
ClockFreq:2f
Throughput:2Msamples
b(2k+1)
Y(2k+1)
X(2k+1)
ClockFreq:f
q
Throughput:2Msamples
ArchitecturalTechniques::ParallelProcessing
Parallelprocessingfora3tapFIRffilter
Bothhavethesamecriticalpath(M
M+2A)
X(3k+2)
X(3k+1)
X(3k)
Y(3k+2)
X
X(3k2)
X(3k1)
ParallelFactor:3
y(3k) ax(3k) bx(3k 1) cx(3k 2)

y(3k 1) ax(3k 1) bx(3k) cx(3k 1)
y(3k 2) ax(3k 2) bx(3k 1) cx(3k)
Y(3k+1)
Not a simple dup

Notasimpledup
plication!
Y(3k)
SamplePeriodvs.ClockPe
eriod
Pipelinedsystem:Tclk =Tsample
Tsample TClk TM
X(3k+2)
ParallelSystem:Tclk Tsample
p
X(3k+1)
X(3k)
Y(3k+2)
X(3k2)
X(3k1)
1
1
Tsample TClk (TM 2TA )
3
3
Higher Sample rate than theclockrate

HigherSampleratethanthe
clock rate
Y(3k+1)
Y(3k)
CompleteParallelSystemwithS/PandP/S
ACompleteParallelSystem:
S/PandP/SBlocks
S/PConverter:
P/SConverter:
Y(3k+2)
T/3
Y(3k+1)
Y(3k)
T/3
ParalleltoSerialConverter
SamplePeri
T/3
Y(n)
WhenPipeliningWhenPaarallelism?
Pipelinetechniqueisusedwhenth
hecriticalpathisinthedesign
(Number1,2,3,4)
Parallelismisusedwhenthecriticaalpathisboundedbythe
communicationorI/Obound.(Numb
ber5)
Pipeliningdoesnothelpinthiscasse!
a.k.a CommunicationBounded
2
Comb.
Logic
Comb.
Logic

Pipelining Parallel Processing

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Pipelining Parallel Processing

Enviado por

Direitos autorais:

Formatos disponíveis

DigitalVLSIA

Digital VLSI Architectures:

aX(2) bX(1) aX(2)+bX(1)

aX(3) bX(2) aX(3)+bX(2)

y(3k) ax(3k) bx(3k 1) cx(3k 2)

Not a simple dup

Higher Sample rate than theclockrate

Você também pode gostar