Escolar Documentos
Profissional Documentos
Cultura Documentos
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::CriticalPath
Criticalpathinanydesignisthelongeestpathbetween
1.
2.
3.
4.
Anytwointernallatches/flipflops
An input pad and an internal latch
Aninputpadandaninternallatch
Aninternallatchandanoutputpad
d
Aninputpadandanoutputpad
UseFFsrightafter/before
input/outpadstoavoid
th l t th
thelastthreecases
(offchipandpackagingdelay)
Themaximumdelaybetweenany
The
maximum delay between any
twosequentialelementsina
designwilldeterminethemax
clockspeed
2
Input
Pad
Comb.Logic
3
Output
Pad
M.Shabany,DigitallVLSIArchitectures
DigitalDesignMetrics
Threeprimaryphysicalcharacteristicsofadigitaldesign:
Speed
Throughput
Latency
Timing
Area
Power
M.Shabany,DigitallVLSIArchitectures
DigitalDesignMetrics
Speed
Throughput
oug put :
Theamountofdatathatispro
ocessedperclockcycle(bitspersecond)
Latency
Thetimebetweendatainputandprocesseddataoutput(clockcycle)
Timing
Thelogicdelaysbetweensequ
uentialelements(clockperiod)
Whenadesigndoesnotmeettthetimingitmeansthedelayofthe
critical path is greater than the ttargetclockperiod
criticalpathisgreaterthanthet
target clock period
M.Shabany,DigitallVLSIArchitectures
MaximumClockFrequenccy:CriticalPath
MaximumClockFrequency:
Fmax
1
Tclkq Tlogic Tsetup Trouting Tskew
Tclkq :timefromclockarrivaluntildataaarrivesatQ
Tlogic :propagationdelaythroughlogicbe
: propagation delay through logic beetweenflip
etween flipflops
flops
Trouting :routingdelaybetweenflipflops
Tsetup :minimumtimedatamustarriveattDbeforethenextrisingedgeofclock
Tskew :propagationdelayofclockbetweeenthelaunchflipflopandthecaptureflipflop.
M.Shabany,DigitallVLSIArchitectures
Pipelining(toImproveThrroughput)
Pipelining:
Comesfromtheideaofawaterpipe:continuesendingwaterwithout
waiting the water in the pipe to
waitingthewaterinthepipeto
o be out
obeout
Usedtoreducethecriticalpath
hofthedesign
Advantageous:
Reductioninthecriticalpath
Higherthroughput(numberof
Higher throughput (number of computedresultsinagivetime)
computed results in a give time)
Increasestheclockspeed(orsaamplingspeed)
Reducesthepowerconsumptio
onatsamespeed
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::Pipelining
Pipelining:
Verysimilartotheassemblylineintheautoindustry
Thebeautyofapipelineddesignisthatnewdatacanbeginprocessing
beforethepriordatahasfinished,m
muchlikecarsareprocessedonan
assemblyline.
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::Pipelining
OriginalSystem:(Criticalpath=1
Maxoperatingfreq:f1=1/1)
2cycles
later
Com
mb Logic
mb.Logic
Clk
Criticalpath=1
Maxoperatingfreq:f1=1/1
Pipelinedversion:(Criticalpath=2 Maxoperatingfreq:f2=1/2)
SmallerCriticalPathhighertthroughput(2<1f2>f1)
Longerlatency
l
X
f
Comb.Logic
Comb.Logic
Clk
Criticalpath=2
Maxoperatingfreq:f2=1/2
PipeliningRegister
M.Shabany,DigitallVLSIArchitectures
3cycles
later
ArchitecturalTechniques::Pipelinedepth
Pipelinedepth:0(NoPipeline)
Criticalpath:3Adders
wire w1,w2;
assign w1=X+a;
assign
g w2=w1+b;;
assign Y=w2+c;
Latency:0
Latency : 0
t1
t2
time
t3
X(1)
( )
X(2
(2))
X(3)
Y(1)
Y(2
2)
Y(3)
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::Pipelinedepth
Pipelinedepth:1(OnePipelinereggisterAdded)
Criticalpath:2Adders
a(n)
b(n)
X(n)
c
c(n)
w2
2
w1
wire w1;
reg w2;
assign w1=X+a;
assign
g Y=w2+c;;
Y(n)
always@(posedgeClk)
w2<=w1+b;
Latency:1
Latency : 1
t1
t2
X(1)
( )
t4
t3
X(2)
(2)
X(3)
Y(1)
Y(2)
M.Shabany,DigitallVLSIArchitectures
time
Y(3)
ArchitecturalTechniques::Pipelinedepth
Pipelinedepth:2(OnePipelinereggisterAdded)
Criticalpath:1Adder
reg w1,w2;
assign Y=w2+c;
Y = w2 + c;
always@(posedgeClk)
begin
w1 <= X + a;
w1<=X+a;
w2<=w1+b;
end
Latency:2
Latency : 2
t1
t2
X(1)
X(2)
t5
t4
t3
X(3)
Y(1)
Y(2)
M.Shabany,DigitallVLSIArchitectures
Y(3)
ArchitecturalTechniques::Pipelining
Clockperiodandthroughputasaffunctionofpipelinedepth:
1
Clk
n
Throughput:
h
h
Tn
Clockperiod:
Clock Period
Throughput
Addingregisterlayersimproves
Addi
it l
i
timingbydividingthecriticalpath
intotwopathsofsmallerdelay
M.Shabany,DigitallVLSIArchitectures
4
Pipeline Depth
ArchitecturalTechniques::Pipelining
GeneralRule:
Pipelininglatchescanonlybeplacedacrossfeedforwardcutsets
ofthecircuit.
Cutset:
Cutset:
Asetofpathsofacircuitsuchtthatifthesepathsareremoved,the
circuitbecomesdisjoint(i.e.,two
j
separatepieces)
p
p
FeedForwardCutset:
Acutsetiscalledfeedforwardcutsetifthedatamoveinthe
forwarddirectiononallthepathssofthecutset
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::Pipelining
Example:
FIRFilter
Threefeedforwardcutsetsareeshown
NOTafeedforwardcutset
X(n)
X(n1)
X(n2)
Y(n)
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::Pipelining
CriticalPath:1M+2A
X(n)
X(n1)
X(n2)
w1
CriticalPath:2A
w4
w2
w3
assign w1=a
w1 = a*Xn;
Xn;
assign w2=b*Xn_1;
assign w3=w1+w2;
assign w4=c*Xn_2;
assign
g Y=w3+w4;;
always@(posedgeClk)
begin
Xn_1<=Xn;
Xn_2<=Xn_1;
end
Y(n)
assign Y=r3+w1;
Y r3 + 1
assign w1=r1+r2;
always@(posedgeClk)
begin
Xn 1 <=Xn;
Xn_1<
Xn;
Xn_2<=Xn_1;
r1<=a*Xn;
r2<=b*Xn_1;
r3<=c*Xn_2;
end
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::Pipelining
X(n)
X(n1)
X(n2)
2
1
Y(n)
Cloc
k
Input
Output
X(0)
aX(0)
aX(0)
aX(0)
Y(0)
X(1)
aX(1)
X(1) bX(0) aX(1)+bX(0)
X(1)+bX(0)
aX(1)+bX(0)
X(1)+bX(0)
Y(1)
X(2)
cX(0)
aX(2)+bX(1)+cX(0)
Y(2)
X(3)
cX(1)
aX(3)+bX(2)+cX(1)
Y(3)
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::Pipelining
X(n)
X(n1)
X(n2)
Y(n)
Cloc
k
Input
Output
X(0)
X(1)
aX(0)
aX(0)
aX(0)
Y(0)
X(2)
aX(1)
bX(0)
aX(1)+bX(0))
aX(1)+bX(0)
Y(1)
X(3)
aX(2)
bX(1)
aX(2)+bX(1))
cX(0)
aX(2)+bX(1)+cX(0)
Y(2)
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::Pipelining
Evenmorepipelining
Clock Input
Output
X(0)
X(1)
aX(0)
X(2)
aX(1)
bX(0)
aX(0)
aX(0)
Y(0)
X(3)
aX(2)
bX(1)
aX(1)+bX(0)
aX(1)+bX(0)
Y(1)
X(3)
aX(2)
bX(1)
aX(2)+bX(1)
cX(0)
aX(2)+bX(1)+cX(0)
Y(2)
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::FineGrainPipelining
Pipeliningattheoperationlevel
Breakthemultiplierintotwoparts
X(n)
FineGrain
Pipelining
Pi
li i
X(n1)
b
m1
m1
m2
m2
X(n2)
m1
m2
Y(n)
M.Shabany,DigitallVLSIArchitectures
UnrollingtheLoopUsingP
Pipelining
CalculationofX3
Throughput=8/3,or2.7bits/clock
Latency =3clocks
Latency
3 clocks
Timing=Onemultiplierinthecriticalpath
Iterativeimplementation:
p
Nonewcomputationscanbeginun
ntilthe
previouscomputationhascompletted
module power3(
outputreg[7:0]X3,
output finished,
input [7:0]X,
input clk,start);
reg [7:0]ncount;
reg [7:0]Xpower,Xin;
assign finished=(ncount ==0);
always@(posedge clk)
if (start)begin
XPower <=X;Xin<=X;
ncount <=2;
X3<=XPower;
end
elseif(!finished)begin
ncount <=ncount 1;
XPower <=XPower *Xin;
End
endmodule
M.Shabany,DigitallVLSIArchitectures
UnrollingtheLoopUsingP
Pipelining
CalculationofX3
Throughput=8/1,or8bits/clock(3Ximprovement)
Latency =3clocks
Latency
3 clocks
Timing=Onemultiplierinthecriticalpath
Penalty:MoreArea
Unrollinganalgorithmwithniterativeloo
opsincreases
throughputbyafactorofn
X2
Clk
X[0:7]
xpower
xpower1
2
xpower2
Clk
M.Shabany,DigitallVLSIArchitectures
module power3(
output reg[7:0]XPower,
input clk,
input [7:0]X);
reg [7:0]XPower1,XPower2;
reg [7:0]X2;
always @(posedgeclk)begin
@(posedge clk) begin
//Pipelinestage1
XPower1<=X;
//Pipelinestage2
XPower2<=XPower1*XPower1;;
X2<=XPower1;
//Pipelinestage3
XPower <=XPower2*X2;
end
endmodule
RemovingPipelineRegiste
ers(toImproveLatency)
CalculationofX3
Throughput=8bits/clock(3Ximprovvement)
Latency =0clocks
Latency
0 clocks
Timing=Twomultipliersinthecriticalpath
Latencycanbereducedbyremovingpipellineregisters
module power3(
O t t [7:0]XPower,
Output
[7 0] XP
input [7:0]X);
reg [7:0]XPower1,XPower2;
reg [7:0]X1,X2;
always
l
@*
@*
XPower1=X;
always @(*)
begin
X2 XP
X2=XPower1;
1
XPower2=XPower1*XPower1;
end
assign
i XPower
XP
=XPower2*X2;
XP
2 * X2
endmodule
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::ParallelProcessing
Inparallelprocessingthesamehar
I
ll l
i
h
h rdwareisduplicatedto
d
i d li
d
Increasesthethroughputwithoutchaangingthecriticalpath
Increasesthesiliconarea
a(n)
X(n)
Pipelining
b(n)
ClockFreq:f
Throughput:Msamples
Y(n)
ParallelProcessing
a(2k)
b(2k)
Y(2k)
X(2k)
a(2k+1)
ClockFreq:2f
Throughput:2Msamples
b(2k+1)
Y(2k+1)
X(2k+1)
ClockFreq:f
q
Throughput:2Msamples
M.Shabany,DigitallVLSIArchitectures
ArchitecturalTechniques::ParallelProcessing
Parallelprocessingfora3tapFIRffilter
Bothhavethesamecriticalpath(M
M+2A)
X(3k+2)
X(3k+1)
X(3k)
Y(3k+2)
X
X(3k2)
X(3k1)
ParallelFactor:3
Y(3k+1)
M.Shabany,DigitallVLSIArchitectures
SamplePeriodvs.ClockPe
eriod
Pipelinedsystem:Tclk =Tsample
Tsample TClk TM
X(3k+2)
ParallelSystem:Tclk Tsample
p
X(3k+1)
X(3k)
Y(3k+2)
X(3k2)
X(3k1)
1
1
Tsample TClk (TM 2TA )
3
3
Y(3k+1)
Y(3k)
M.Shabany,DigitallVLSIArchitectures
CompleteParallelSystemwithS/PandP/S
ACompleteParallelSystem:
M.Shabany,DigitallVLSIArchitectures
S/PandP/SBlocks
S/PConverter:
P/SConverter:
Y(3k+2)
T/3
Y(3k+1)
Y(3k)
T/3
ParalleltoSerialConverter
M.Shabany,DigitallVLSIArchitectures
SamplePeri
T/3
Y(n)
WhenPipeliningWhenPaarallelism?
Pipelinetechniqueisusedwhenth
hecriticalpathisinthedesign
(Number1,2,3,4)
Parallelismisusedwhenthecriticaalpathisboundedbythe
communicationorI/Obound.(Numb
ber5)
Pipeliningdoesnothelpinthiscasse!
a.k.a CommunicationBounded
2
Comb.
Logic
Comb.
Logic
M.Shabany,DigitallVLSIArchitectures