Você está na página 1de 28

DigitalVLSIA

Digital VLSI Architectures:


A
Pipelining&ParaallelProcessing
MahdiSShabany
SharifUniversity
f
yofTechnology
f

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::CriticalPath
Criticalpathinanydesignisthelongeestpathbetween
1.
2.
3.
4.

Anytwointernallatches/flipflops
An input pad and an internal latch
Aninputpadandaninternallatch
Aninternallatchandanoutputpad
d
Aninputpadandanoutputpad

UseFFsrightafter/before
input/outpadstoavoid
th l t th
thelastthreecases
(offchipandpackagingdelay)
Themaximumdelaybetweenany
The
maximum delay between any
twosequentialelementsina
designwilldeterminethemax
clockspeed

2
Input
Pad

Comb.Logic

3
Output
Pad

M.Shabany,DigitallVLSIArchitectures

DigitalDesignMetrics
Threeprimaryphysicalcharacteristicsofadigitaldesign:
Speed
Throughput
Latency
Timing

Area
Power

M.Shabany,DigitallVLSIArchitectures

DigitalDesignMetrics
Speed
Throughput
oug put :
Theamountofdatathatispro
ocessedperclockcycle(bitspersecond)
Latency
Thetimebetweendatainputandprocesseddataoutput(clockcycle)
Timing
Thelogicdelaysbetweensequ
uentialelements(clockperiod)
Whenadesigndoesnotmeettthetimingitmeansthedelayofthe
critical path is greater than the ttargetclockperiod
criticalpathisgreaterthanthet
target clock period

M.Shabany,DigitallVLSIArchitectures

MaximumClockFrequenccy:CriticalPath
MaximumClockFrequency:

Fmax

1
Tclkq Tlogic Tsetup Trouting Tskew

Tclkq :timefromclockarrivaluntildataaarrivesatQ
Tlogic :propagationdelaythroughlogicbe
: propagation delay through logic beetweenflip
etween flipflops
flops
Trouting :routingdelaybetweenflipflops
Tsetup :minimumtimedatamustarriveattDbeforethenextrisingedgeofclock
Tskew :propagationdelayofclockbetweeenthelaunchflipflopandthecaptureflipflop.

M.Shabany,DigitallVLSIArchitectures

Pipelining(toImproveThrroughput)
Pipelining:
Comesfromtheideaofawaterpipe:continuesendingwaterwithout
waiting the water in the pipe to
waitingthewaterinthepipeto
o be out
obeout
Usedtoreducethecriticalpath
hofthedesign

Advantageous:
Reductioninthecriticalpath
Higherthroughput(numberof
Higher throughput (number of computedresultsinagivetime)
computed results in a give time)
Increasestheclockspeed(orsaamplingspeed)
Reducesthepowerconsumptio
onatsamespeed
M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::Pipelining
Pipelining:
Verysimilartotheassemblylineintheautoindustry
Thebeautyofapipelineddesignisthatnewdatacanbeginprocessing
beforethepriordatahasfinished,m
muchlikecarsareprocessedonan
assemblyline.

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::Pipelining
OriginalSystem:(Criticalpath=1

Maxoperatingfreq:f1=1/1)
2cycles
later

Com
mb Logic
mb.Logic

Clk

Criticalpath=1

Maxoperatingfreq:f1=1/1

Pipelinedversion:(Criticalpath=2 Maxoperatingfreq:f2=1/2)
SmallerCriticalPathhighertthroughput(2<1f2>f1)
Longerlatency
l
X

f
Comb.Logic

Comb.Logic

Clk

Criticalpath=2

Maxoperatingfreq:f2=1/2

PipeliningRegister

M.Shabany,DigitallVLSIArchitectures

3cycles
later

ArchitecturalTechniques::Pipelinedepth
Pipelinedepth:0(NoPipeline)
Criticalpath:3Adders
wire w1,w2;
assign w1=X+a;
assign
g w2=w1+b;;
assign Y=w2+c;

Latency:0
Latency : 0
t1

t2

time

t3

X(1)
( )

X(2
(2))

X(3)

Y(1)

Y(2
2)

Y(3)

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::Pipelinedepth
Pipelinedepth:1(OnePipelinereggisterAdded)
Criticalpath:2Adders
a(n)

b(n)

X(n)

c
c(n)
w2
2

w1

wire w1;
reg w2;
assign w1=X+a;
assign
g Y=w2+c;;
Y(n)

always@(posedgeClk)
w2<=w1+b;

Latency:1
Latency : 1
t1

t2

X(1)
( )

t4

t3

X(2)
(2)

X(3)

Y(1)

Y(2)

M.Shabany,DigitallVLSIArchitectures

time

Y(3)

ArchitecturalTechniques::Pipelinedepth
Pipelinedepth:2(OnePipelinereggisterAdded)
Criticalpath:1Adder

reg w1,w2;
assign Y=w2+c;
Y = w2 + c;
always@(posedgeClk)
begin
w1 <= X + a;
w1<=X+a;
w2<=w1+b;
end

Latency:2
Latency : 2
t1

t2

X(1)

X(2)

t5

t4

t3

X(3)
Y(1)

Y(2)

M.Shabany,DigitallVLSIArchitectures

Y(3)

ArchitecturalTechniques::Pipelining
Clockperiodandthroughputasaffunctionofpipelinedepth:

1
Clk
n
Throughput:
h
h
Tn
Clockperiod:

Clock Period

Throughput

Addingregisterlayersimproves
Addi
it l
i
timingbydividingthecriticalpath
intotwopathsofsmallerdelay

M.Shabany,DigitallVLSIArchitectures

4
Pipeline Depth

ArchitecturalTechniques::Pipelining
GeneralRule:
Pipelininglatchescanonlybeplacedacrossfeedforwardcutsets
ofthecircuit.
Cutset:
Cutset:
Asetofpathsofacircuitsuchtthatifthesepathsareremoved,the
circuitbecomesdisjoint(i.e.,two
j
separatepieces)
p
p
FeedForwardCutset:
Acutsetiscalledfeedforwardcutsetifthedatamoveinthe
forwarddirectiononallthepathssofthecutset

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::Pipelining
Example:
FIRFilter
Threefeedforwardcutsetsareeshown

NOTafeedforwardcutset

X(n)

X(n1)

X(n2)

Y(n)

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::Pipelining
CriticalPath:1M+2A
X(n)

X(n1)

X(n2)

w1

CriticalPath:2A

w4

w2
w3

assign w1=a
w1 = a*Xn;
Xn;
assign w2=b*Xn_1;
assign w3=w1+w2;
assign w4=c*Xn_2;
assign
g Y=w3+w4;;
always@(posedgeClk)
begin
Xn_1<=Xn;
Xn_2<=Xn_1;
end

Y(n)

assign Y=r3+w1;
Y r3 + 1
assign w1=r1+r2;
always@(posedgeClk)
begin
Xn 1 <=Xn;
Xn_1<
Xn;
Xn_2<=Xn_1;
r1<=a*Xn;
r2<=b*Xn_1;
r3<=c*Xn_2;
end

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::Pipelining
X(n)

X(n1)

X(n2)

2
1

Y(n)

Cloc
k

Input

Output

X(0)

aX(0)

aX(0)

aX(0)

Y(0)

X(1)

aX(1)
X(1) bX(0) aX(1)+bX(0)
X(1)+bX(0)

aX(1)+bX(0)
X(1)+bX(0)

Y(1)

X(2)

aX(2) bX(1) aX(2)+bX(1)

cX(0)

aX(2)+bX(1)+cX(0)

Y(2)

X(3)

aX(3) bX(2) aX(3)+bX(2)

cX(1)

aX(3)+bX(2)+cX(1)

Y(3)

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::Pipelining
X(n)

X(n1)

X(n2)

Y(n)

Cloc
k

Input

Output

X(0)

X(1)

aX(0)

aX(0)

aX(0)

Y(0)

X(2)

aX(1)

bX(0)

aX(1)+bX(0))

aX(1)+bX(0)

Y(1)

X(3)

aX(2)

bX(1)

aX(2)+bX(1))

cX(0)

aX(2)+bX(1)+cX(0)

Y(2)

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::Pipelining
Evenmorepipelining

Clock Input

Output

X(0)

X(1)

aX(0)

X(2)

aX(1)

bX(0)

aX(0)

aX(0)

Y(0)

X(3)

aX(2)

bX(1)

aX(1)+bX(0)

aX(1)+bX(0)

Y(1)

X(3)

aX(2)

bX(1)

aX(2)+bX(1)

cX(0)

aX(2)+bX(1)+cX(0)

Y(2)

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::FineGrainPipelining
Pipeliningattheoperationlevel
Breakthemultiplierintotwoparts

X(n)

FineGrain
Pipelining
Pi
li i

X(n1)

b
m1

m1

m2

m2

X(n2)

m1

m2
Y(n)

M.Shabany,DigitallVLSIArchitectures

UnrollingtheLoopUsingP
Pipelining
CalculationofX3
Throughput=8/3,or2.7bits/clock
Latency =3clocks
Latency
3 clocks
Timing=Onemultiplierinthecriticalpath

Iterativeimplementation:
p
Nonewcomputationscanbeginun
ntilthe
previouscomputationhascompletted

module power3(
outputreg[7:0]X3,
output finished,
input [7:0]X,
input clk,start);
reg [7:0]ncount;
reg [7:0]Xpower,Xin;
assign finished=(ncount ==0);
always@(posedge clk)
if (start)begin
XPower <=X;Xin<=X;
ncount <=2;
X3<=XPower;
end
elseif(!finished)begin
ncount <=ncount 1;
XPower <=XPower *Xin;
End
endmodule

M.Shabany,DigitallVLSIArchitectures

UnrollingtheLoopUsingP
Pipelining
CalculationofX3
Throughput=8/1,or8bits/clock(3Ximprovement)
Latency =3clocks
Latency
3 clocks
Timing=Onemultiplierinthecriticalpath

Penalty:MoreArea
Unrollinganalgorithmwithniterativeloo
opsincreases
throughputbyafactorofn
X2
Clk

X[0:7]

xpower
xpower1

2
xpower2

Clk

M.Shabany,DigitallVLSIArchitectures

module power3(
output reg[7:0]XPower,
input clk,
input [7:0]X);
reg [7:0]XPower1,XPower2;
reg [7:0]X2;
always @(posedgeclk)begin
@(posedge clk) begin
//Pipelinestage1
XPower1<=X;
//Pipelinestage2
XPower2<=XPower1*XPower1;;
X2<=XPower1;
//Pipelinestage3
XPower <=XPower2*X2;
end
endmodule

RemovingPipelineRegiste
ers(toImproveLatency)
CalculationofX3
Throughput=8bits/clock(3Ximprovvement)
Latency =0clocks
Latency
0 clocks
Timing=Twomultipliersinthecriticalpath

Latencycanbereducedbyremovingpipellineregisters

module power3(
O t t [7:0]XPower,
Output
[7 0] XP
input [7:0]X);
reg [7:0]XPower1,XPower2;
reg [7:0]X1,X2;
always
l
@*
@*
XPower1=X;
always @(*)
begin
X2 XP
X2=XPower1;
1
XPower2=XPower1*XPower1;
end
assign
i XPower
XP
=XPower2*X2;
XP
2 * X2
endmodule

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::ParallelProcessing
Inparallelprocessingthesamehar
I
ll l
i
h
h rdwareisduplicatedto
d
i d li
d
Increasesthethroughputwithoutchaangingthecriticalpath
Increasesthesiliconarea
a(n)

X(n)

Pipelining

b(n)

ClockFreq:f
Throughput:Msamples

Y(n)

ParallelProcessing
a(2k)

b(2k)

Y(2k)

X(2k)
a(2k+1)

ClockFreq:2f
Throughput:2Msamples

b(2k+1)

Y(2k+1)

X(2k+1)

ClockFreq:f
q
Throughput:2Msamples

M.Shabany,DigitallVLSIArchitectures

ArchitecturalTechniques::ParallelProcessing
Parallelprocessingfora3tapFIRffilter
Bothhavethesamecriticalpath(M
M+2A)

X(3k+2)

X(3k+1)

X(3k)

Y(3k+2)
X
X(3k2)

X(3k1)

ParallelFactor:3

y(3k) ax(3k) bx(3k 1) cx(3k 2)


y(3k 1) ax(3k 1) bx(3k) cx(3k 1)
y(3k 2) ax(3k 2) bx(3k 1) cx(3k)

Y(3k+1)

Not a simple dup


Notasimpledup
plication!
Y(3k)

M.Shabany,DigitallVLSIArchitectures

SamplePeriodvs.ClockPe
eriod
Pipelinedsystem:Tclk =Tsample

Tsample TClk TM
X(3k+2)

ParallelSystem:Tclk Tsample
p

X(3k+1)

X(3k)

Y(3k+2)
X(3k2)

X(3k1)

1
1
Tsample TClk (TM 2TA )
3
3

Higher Sample rate than theclockrate


HigherSampleratethanthe
clock rate

Y(3k+1)

Y(3k)

M.Shabany,DigitallVLSIArchitectures

CompleteParallelSystemwithS/PandP/S
ACompleteParallelSystem:

M.Shabany,DigitallVLSIArchitectures

S/PandP/SBlocks
S/PConverter:

P/SConverter:

Y(3k+2)

T/3

Y(3k+1)

Y(3k)

T/3

ParalleltoSerialConverter

M.Shabany,DigitallVLSIArchitectures

SamplePeri
T/3
Y(n)

WhenPipeliningWhenPaarallelism?
Pipelinetechniqueisusedwhenth
hecriticalpathisinthedesign
(Number1,2,3,4)
Parallelismisusedwhenthecriticaalpathisboundedbythe
communicationorI/Obound.(Numb
ber5)
Pipeliningdoesnothelpinthiscasse!
a.k.a CommunicationBounded
2
Comb.
Logic

Comb.
Logic

M.Shabany,DigitallVLSIArchitectures

Você também pode gostar