Escolar Documentos
Profissional Documentos
Cultura Documentos
Eylon Caspi
University of California, Berkeley
eylon@cs.berkeley.edu
4/8/05
Outline
♦ Why Queues?
♦ Queue Connected Systems
• “Streaming Systems”
♦ Queue Implementations
♦ Stream Enabled Pipelining
performance
♦ How to • Register File AA BA CA
connect / Memory
modules?
RF
A
• Point to pt. AA Q BA Q’ CA
channels
PA CA
• Data every cycle
PA V CA
• Data valid signal
D
♦ Consumer may stall PA CA
• Back-pressure signal B
D
♦ Either may stall PA V CA
• Valid + Back-pressure B
PA V CA
B 0 4 8 12
0 4 8 12
♦ Busrty communication
• P + C each have average throughput 1/2
• Producer is bursty • Consumer is steady
PA V QA V’ CA
B B’ 0 4 8 12
♦ Queue contents: c d d
a b b c c d d
0 4 8 12
X
• Reorder: P:{x,y} C:{y,x}
PA CA
• Store: P:{x*,y} C:{x,y} Y
♦ For convenience
• Buffer up / packetize to communicate with microprocessor
• Off chip • Time multiplexed computation (SCORE)
4/8/05 Eylon Caspi 8
Streaming Systems
♦ Suppose queues were the only form of IPC
• Stream = FIFO channel with buffering (queue)
• Every compute module (process) must stall waiting for
♦ Input data ♦ Output buffer space
Clk
D (Data)
D
V (Valid)
Producer Consumer V
B
B (Back-pressure)
♦ Shift register
• Put: shift all entries
• Get: tail pointer
♦ Circular buffer
• Memory with
head / tail pointers
4/8/05 Eylon Caspi 12
Enabled Register Queue
♦ Systolic, depth-1 stage iB iV iD
♦ If ( ! ob ) then produce
• Else (neither full nor empty) Non-
Empty
♦ If ( iv ^ ob ) then consume full
♦ If ( ! iv ^ ob ) then idle
♦ If ( ! iv ^ ! ob ) then produce
oB oV oD
4/8/05 Eylon Caspi 16
Shift Register Queue – Verilog 1
module Q_srl (clock, reset, i_d, i_v, i_b, o_d, o_v, o_b);
parameter depth = 16; // - greatest #items in queue (2 <= depth <= 256)
parameter width = 16; // - width of data (i_d, o_d)
input clock;
input reset;
endmodule // Q_srl
Min of LUT-FFs
10000
Width
1000 1
2
4
100 8
16
32
10
64
128
1
2 4 8 16 32 64 128
Q_srl.v
Name Depth
Min of LUT-FFs/bit
3.5
3 Width
1
2.5 2
2 4
8
1.5 16
1 32
64
0.5 128
0
2 4 8 16 32 64 128
Q_srl.v
Name Depth
300 Width
1
250 2
200 4
8
150 16
100 32
64
50 128
0
2 4 8 16 32 64 128
Q_srl.v
Name Depth
Q_srl.v
Name Depth
0.6 Width
1
0.5 2
0.4 4
8
0.3 16
0.2 32
64
0.1 128
0
2 4 8 16 32 64 128
Q_srl.v
Name Depth
3 1
2
2.5
4
2 8
1.5 16
32
1
64
0.5 128
0
2 4 8 16 32 64 128
Q_srl.v
Name Depth
♦ Clk-to-B slows due to (1) wider addr cmp, (2) higher addr fanout
4/8/05 Eylon Caspi 32
SRL Queue with Data Output Reg. (SRL+D)
iB iV iD
♦ Registered data out
• od (clock-to-Q delay)
• Non-retimable
♦ Data output register 0 +1 -1
extends shift register
♦ Bypass shift register Empty
♦ Address = number
of stored elements =Depth-2 =0
minus 2 More
full
zero
♦ Flow control Data Out
• ov = !(State==Empty)
• ib = (Address
==Depth-2)
oB oV oD
4/8/05 Eylon Caspi 33
SRL+D Queue – Control
iB iV iD
♦ State Empty (consume into D out reg.)
♦ If ( iv ) then consume
♦ If ( ! iv ) then idle
♦ State One (consume into shift reg.)
♦ If ( iv ^ ob ) then consume
♦ If ( iv ^ ! ob ) then consume + produce
Empty
♦ If ( ! iv ^ ob ) then idle
♦ If ( ! iv ^ ! ob ) then produce
♦ State More (consume into shift reg.) One
• If (full) then
♦ If ( ob ) then idle
♦ If ( ! ob ) then produce
More full
• Else (neither full nor empty)
♦ If ( iv ^ ob ) then consume zero
♦ If ( iv ^ ! ob ) then consume + produce
♦ If ( ! iv ^ ob ) then idle
♦ If ( ! iv ^ ! ob ) then produce
oB oV oD
4/8/05 Eylon Caspi 34
SRL+D: Data Delay
Revision rev_4__200mhz Device XC3S1000
0.6 Width
1
0.5 2
0.4 4
8
0.3 16
0.2 32
64
0.1 128
0
2 4 8 16 32 64 128
Q_srl_d.v
Name Depth
1.2 Width
1
1 2
0.8 4
8
0.6 16
0.4 32
64
0.2 128
0
2 4 8 16 32 64 128
Name Depth
Min of LUT-FFs
2.5
Width
2 1
2
1.5 4
8
1 16
32
0.5 64
128
0
2 4 8 16 32 64 128
Name Depth
♦ Larger area from data out reg. – cannot pack with shift reg.
4/8/05 Eylon Caspi 37
SRL+D Queue: Valid Delay
Revision rev_4__200mhz Device XC3S1000
Q_srl_d.v
Name Depth
• ib = (Address
Address
en
Shift Reg
==Depth-2)
One
=Depth-2 =0
More
full
zero
Data Out
oB oV oD
4/8/05 Eylon Caspi 39
SRL+DV: Valid Delay
Revision rev_4__200mhz Device XC3S1000
0.6 Width
1
0.5 2
0.4 4
8
0.3 16
0.2 32
64
0.1 128
0
2 4 8 16 32 64 128
Q_srl_dv.v
Name Depth
3 1
2
2.5
4
2 8
1.5 16
32
1
64
0.5 128
0
2 4 8 16 32 64 128
Q_srl_dv.v
Name Depth
♦ Clk-to-B slows w/depth, (1) wider addr cmp, (2) higher addr fanout
4/8/05 Eylon Caspi 41
Pre-Computed Back-Pressure (SRL+DVB)
iB iV iD
♦ Registered back-
pressure out
• ob (clock-to-Q delay)
• Non-retimable
0 +1 -1
♦ Based on pre-
computed fullness Empty
• full_next = en
(Address_next Address Shift Reg
==Depth-2) One
♦ Flow control
• ov_next = !(State_next full_next
=Depth-2 =0
==Empty) More
full
zero
• ib_next = Data Out
(Address_next
==Depth-2)
oB oV oD
4/8/05 Eylon Caspi 42
SRL+DVB: Back-Pressure Delay
Revision rev_4__200mhz Device XC3S1000
0.6 Width
1
0.5 2
0.4 4
8
0.3 16
0.2 32
64
0.1 128
0
2 4 8 16 32 64 128
Q_srl_dvb.v
Name Depth
300 Width
1
250 2
200 4
8
150 16
100 32
64
50 128
0
2 4 8 16 32 64 128
Q_srl_dvb.v
Name Depth
iB iV iD
♦ SRL+DVB critical loop
full, FSM,
Address update,
Address compare 0 +1 -1
♦ Speed up full
Empty
pre-computation en
by special-casing Address Shift Reg
full_next for One
each state full
• ib_next = full_next
♦ zero pre-computation
is less critical oB oV oD
4/8/05 Eylon Caspi 45
SRL+DVBF: Speedup
Revision rev_4__200mhz Device XC3S1000
1.2 Width
1
1 2
0.8 4
8
0.6 16
0.4 32
64
0.2 128
0
2 4 8 16 32 64 128
Name Depth
300 Width
1
250 2
200 4
8
150 16
100 32
64
50 128
0
2 4 8 16 32 64 128
Q_srl_dvbf.v
Name Depth
Min of LUT-FFs
1.2
Width
1
1
0.8 2
4
0.6 8
16
0.4 32
64
0.2
128
0
2 4 8 16 32 64 128
Name Depth
Min of LUT-FFs
10000
Width
1000 1
2
4
100 8
16
32
10
64
128
1
2 4 8 16 32 64 128
Q_srl_dvbf.v
Name Depth
Long distance
D D D
V V Original V
Producer Q2 Consumer
Queue
B B B
Long distance
D (Data) D
V (Valid) Queue V
Producer with Consumer
2N Reserve
B B
(Backpressure)
4/8/05 Eylon Caspi 52
Logic Pipelining
♦ Add N pipeline registers to D, V
♦ Retime backwards
• This pipelines feed-forward parts of producer’s data-path
♦ Stale flow control may overflow queue (by N)
♦ Modify queue to emit back-pressure when empty slots ≤ N
♦ No manual modification of processes!
Retime
D (Data) D
V (Valid) Queue V
Producer with Consumer
N Reserve
B (Backpressure) B
D D D
V V Original V
Producer Consumer
en Queue
B B B