Pipelining

PIPELINING
Advanced Computer Architecture

Dr. M Faisal Iqbal
RECALL: Latency Vs Throughput
• Latency: Time it takes to complete one instruction
• Throughput: Number of instructions completed per unit time
• Rate of completion of instructions
Pipelining
• Hardware technique to improve throughput of the system
• Exploits instruction level parallelism
• Multiple instructions execute in overlapped format
• Cut data path into multiple stages (5 stages in our pipeline)
• Nothing magical about 5 stages. Pentium 4 has 22 stages
• One instruction in each stage in each cycle
• Ideal CPI = 1 : instruction enters and leaves every cycle
Pipeline Diagram
• Pipeline diagram: At which pipeline stage an ins is at each instant of
time
• Across: cycles
• Down: insns
• Convention: E means load R4,[R5 + 8] finishes Execute stage and
writes into M latch at end of cycle 4
1 2 3 4 5 6 7 8 9
add R3,R2, R1 F D E M W
load R4,[R5+8] F D E M W
store R6,[R7] F D E M W
Pipelining in a Processor
1 2 3 4 5 6 7 8 9 10
I1: F1 D1 E1 M1 W1
I2: F2 D2 E2 M2 W2
I3: F3 D3 E3 M3 W3
I4: F4 D4 E4 M4 W4
I5: F5 D5 E5 M5 W5
• It takes some time to fill the pipeline, but once the pipeline is filled,
instructions keep completing at a very high rate
• Example: If we have 10 instructions in our program
• No pipeline: 50 Cycles
• With Pipelining: 14 cycles
Five stage pipeline datapath
PC PC
+
4
Insn Register A
PC y
Mem File Z Data
s1 s2 d B Mem
M
IR
IR IR IR
PC
B1(D) B2(E) B3(M) B4(W)
Question
• Why is pipeline clock period > (Data path delay)/(Number of stages)?
• Three reasons
• Latches add delay
• Pipeline stages have different delays, clock period is max delay
• Extra datapath and logic for pipelining (e.g., data forwarding)
• These factors have implications for ideal number of pipeline stages

• Diminishing clock frequency gains for longer (deeper) pipelines
What is the optimal number of pipeline stages ?
• Clocking constraints determine the physical limit to the depth of
pipeline
• Maximum pipeline depth may not be the optimal design. Why?

• Consider cost or overhead of pipelining
• Tradeoff between cost and performance must be considered
• A model for cost/performance tradeoff has been proposed by Peter Kogge
(1981)
Cost Performance Tradeoff of a Pipeline
• Cost of non pipelined design = G
• Gate count, transistor count, silicon area etc
• The cost for a k-stage pipeline C = G + k x L
• L is the cost of adding inter-stage latches or buffers
• Latency in non-pipelined system = T

• Performance of non-pipelined design = 1/T
• Performance of a pipelined design P = (1/T/k + S)
• S is the delay due to addition of latch
Cost Performance Tradeoff of a Pipeline
• Expression for Cost to performance ratio
• C/P = LT + GS + LSk + GT/k
Why is Pipeline CPI > 1 ?
• Initial fill : Not a big problem (instruction count -> ∞, CPI -> 1)
• We have billions of instructions in our program, so we can ignore first few
instructions and say that at steady state we are completing one instruction
every cycle
• Pipeline stalls:
• Example of car assembly
• Instruction pipeline also has stalls
• Stalls are used to resolve hazards
• Hazards: Conditions that jeopardize sequential illusion in pipeline
• Stall: Pipeline delay added to restore the sequential illusion
Pipelining Issues
• Instruction Pipeline stalls because of the following issues
• Data Dependence
• Memory Delays
• Branch Delays
Data Dependence
• Independent operations
• I1: Add R3, R2, R1
• I2: Add R5, R4, R6
• Dependent Operations
• I1: Add R3, R2, R1
• I2: Sub R4, R3, 30
1 2 3 4 5 6 7 8 9
I1: F D E M W
I2: F D D D D E M W
I3: F F F F D C M
Resolution of Data Dependence
• Software technique
• Hardware Interlock
• Data Forwarding
Handling Data Dependence in Software
• Compiler detects data dependence between two successive
instructions and inserts explicit NOP (No-operation) instruction
between them
• Following sequence of instructions is generated
• I1: Add R3,R2,R1
• I2: NOP 1 2 3 4 5 6 7 8 9
Add R3,R2,R1 F D C M W
• I3: NOP
NOP F D C M W
• I4: NOP
NOP F D C M W
• I5: Sub R4, R3, 30 NOP F D C M W
• + Simplifies hardware Sub R4, R3, 30 F D C M W
• - code size increases
• - Execution time not reduced
Hardware Interlock
• Hardware inserts NOP between dependent instructions
• C1: I1 is fetched and put in B1
• C2: I1 moves to B2, I2 moves to B1
• C3: I2 tries to read regs but … should not
• Contents of B1 should not change
• B2 should get a NOP
• PC should not get incremented
PC PC
+
4
Insn Register A
y
PC Z
Mem File Data
s1 s2 d Mem
B M
PC IR
IR
IR IR
B1(D) B2(E) B3(M) B4(W)
Hardware Interlock
PC PC
+
4
Insn Register A
PC y
Mem File Z Data
s1 s2 d B Mem
M
IR
IR IR IR
PC_enable
NOP
B2(E) B3(M) B4(W)
Interlock
Bypassing
A O
Register
O a D
File B Data
s1 s2 d
B dMem
S
D X X M W
IR IR IR IR
add R4,R3,30 add R3,R2,R1
• Bypassing
• Reading a value from an intermediate (marchitectural) source
• Not waiting until it is available from primary source
• Here, we are bypassing the register file
• Also called forwarding
WC Bypassing
A O
Register D
O a
File B Data
s1 s2 d
B dMem
D C S M
X
W
IR IR IR IR
add R4,R3,R2 add R3,R2,R1
• What about this combination?

• Add another bypass path and MUX (multiplexor) input
• First one was an MC bypass
• This one is a WC bypass
ALUinB Bypassing
A O
Register
O a D
File B Data
s1 s2 d
B dMem
D C S M W
X
IR IR IR IR
add R4,R2,R3 add R3,R2,R1
• Can also bypass to ALU input B

WM Bypassing?
A O
Register D
O a
File B Data
s1 s2 d
B dMem
D C S M W
X
IR IR IR IR
store R4,[R3+4] load R3,[R2+8]
• Does WM bypassing work?

• Not to the address input (why not?)
X
• But to the store data input, yes
Bypassing Logic
A O
Register D
O a
File B Data
s1 s2 d
B dMem
D C S M
X
W
IR IR IR IR
bypass
• Each multiplexer has its own logic, here it is for ALUinA

(X.IR.RegSrc1 == M.IR.RegDest) => 0
(X.IR.RegSrc1 == W.IR.RegDest) => 1
Else => 2
Memory Delays
• Memory Access stage may take more than 1 cycle
• For example: A cache miss
1 2 3 4 5 6 7 8 9
Ij: Load F D C M M M W
Ij+1 F D C C C M W
Ij+2 F D D D C M W
• Data dependency involving Load Instruction

• Draw pipeline execution diagram of the following instructions with and
without data forwarding
• I1: Load R2, [R3]
• I2: Add R3, R2, 20
Performance impact of Load/Use Penalty
• Assume following statistics of a program instructions
• Branch: 20%, load: 20%, store: 10%, other: 50%
• 50% of loads are followed by dependent instruction
• require 1 cycle stall (I.e., insertion of 1 nop)
• Calculate CPI
• CPI = 1 + (1 * 20% * 50%) = 1.1
Branch Delays
• In pipelined execution, a new instruction is fetched every cycle
• The preceding instruction is still being decoded
• Branch instructions can alter the sequence of instructions
• But they must first be decoded & executed to determine whether and where
to branch
Unconditional Branches
• Steps involved in execution of unconditional branches e.g., jmp [pc+x]
1. MemAd = PC, ReadMem, IR = memData, PC = PC + 4
2. Decode
3. PC = PC + x
4. No Action
5. No Action PC PC
+
4
Insn Register A
PC y
Mem File Z Data
s1 s2 d B Mem
M
IR
IR IR IR
PC
B1(D) B2(E) B3(M) B4(W)
Conditional Branches
• Steps involved in execution of unconditional branches
e.g., BEQ R5,R6, PC+x;
1. MemAd = PC, ReadMem, IR = memData, PC = PC + 4
2. Decode
3. Compare R5, R6, if R5 = R6, PC = PC + x
4. No Action
5. No Action PC PC
+
4
Insn Register A
PC y
Mem File Z Data
s1 s2 d B Mem
M
IR
IR IR IR
PC
B1(D) B2(E) B3(M) B4(W)
Branch Delays
t0 t1 t2 t3 t4 t5
Insth IFPC ID ALU MEM
Insti IFPC+4 ID ALU
Instj IFPC+8 ID
Instk IFtarget When a branch resolves
Instl - branch target (Instk) is fetched
Insth branch condition and
- all instructions target
fetched since
evaluated in ALU
inst h (so called “wrong-path”
instructions) must be flushed
- This two cycle penalty is called branch
penalty
28
Insth is a branch
Pipeline Flush on a Misprediction
t0 t1 t2 t3 t4 t5
Insth IFPC ID ALU MEM WB
Insti IFPC+4 ID killed
Instj IFPC+8 killed
Instk IFtarget ID ALU WB
Instl IF ID ALU
IF ID
IF
29
Insth is a branch
Performance Analysis
• Correct guess  no penalty
• Incorrect guess  2 bubbles
• Assume
•no data hazards
•20% of dynamic instructions are control flow instructions
•70% of control flow instructions are taken
•Calculate CPI
•CPI = [ 1 + (0.20*0.7) * 2 ] =
= [ 1 + 0.14 * 2 ] = 1.28
probability of penalty for
a wrong guess a wrong guess
Can we reduce either of the two penalty terms?

30
Reducing Branch Misprediction Penalty
• Compute branch target address earlier in the pipeline
• Compute target address + update PC in decode stage i.e., correct
target instruction can be fetched one cycle earlier
• Additional comparator needed in decode stage
• Adder for address calculation also moves to decode stage
PC PC
+
4
comparato
Insn r A
PC y
Mem Z
B M
IR
PC IR IR IR
B1(D) B2(E B3(M) B4(W)

The Branch Delay Slot
• Original Sequence of Instructions • Placing branch instruction in delay slot
Add R7, R8, R9 BEQ R3,0, TARGET
BEQ R3,0, TARGET Add R7, R8, R9
Ij+1 Ij+1
. .
. .
Target: Ik Target: Ik
• In Case branch is taken, IJ+1 is quashed

• If branch is not taken, Ij+1 is correctly executed
• In both cases Ij+1 is always fetched
• IDEA: Put an instruction right after the branch which is executed whether branch is
taken or not. The position right after branch is called Branch Delay Slot
• Logically, branch effect is as if it were placed after the add instruction
• Experimental data shows delay slot can be filled with useful instructions for 70% or
more cases
Branch Prediction
• Problem: Need to determine the next fetch address when the branch
is fetched (to avoid a pipeline bubble)
• Requires three things to be predicted at fetch stage:
• Whether the fetched instruction is a branch
• (Conditional) branch direction
• Branch target address (if taken)
Branch Prediction
• Observation: Target address remains the same for a conditional direct
branch across dynamic instances
• Idea: Store the target address from previous instance and access it with the
PC
• Called Branch Target Buffer (BTB) or Branch Target Address Cache
Fetch Stage with BTB and Direction Prediction
Direction predictor
taken?
PC + inst size Next Fetch

Address
Program
hit?
Counter
Address of the
current branch
target address
Cache of Target Addresses (BTB: Branch Target Buffer)
35
Simple Branch Direction Prediction Schemes
• Compile time (static)
• Always not taken
• Always taken
• BTFN (Backward taken, forward not taken)
• Run time (dynamic)

• Last time prediction (single-bit)
• Two-bit counter based prediction
36
Dynamic Branch Prediction
• Idea: Predict branches based on dynamic information (collected at
run-time)
• Advantages
+ Prediction based on history of the execution of branches
+ It can adapt to dynamic changes in branch behavior
+ No need for static profiling: input set representativeness problem goes away
• Disadvantages
-- More complex (requires additional hardware)
Last Time Predictor
• Last time predictor
• Single bit per branch (stored in BTB)
• Indicates which direction branch went last time it executed
TTTTTTTTTTNNNNNNNNNN  90% accuracy
• Always mis predicts the last iteration and the first iteration of
a loop branch
• Accuracy for a loop with N iterations = (N-2)/N
+ Loop branches for loops with large number of iterations

-- Loop branches for loops will small number of iterations
TNTNTNTNTNTNTNTNTNTN  0% accuracy
Last-time predictor CPI = [ 1 + (0.20*0.15) * 2 ] = 1.06 (Assuming 85% accuracy)
38
Implementing the Last-Time Predictor
tag BTB idx
N-bit One
tag Bit
BTB
table Per
branch
taken? PC+4
1 0
=
nextPC
The 1-bit BHT (Branch History Table) entry is updated with
the correct outcome after each execution of a branch
39
State Machine for Last-Time Prediction
actually
taken
actually predict predict actually

not taken not taken taken
taken
actually
not taken
40
Improving the Last Time Predictor
• Problem: A last-time predictor changes its prediction from

TNT or NTT too quickly
• even though the branch may be mostly taken or mostly not taken
• Solution Idea: Add hysteresis to the predictor so that

prediction does not change on a single different outcome
• Use two bits to track the history of predictions for a branch instead
of a single bit
• Can have 2 states for T or NT instead of 1 state for each
• Smith, “A Study of Branch Prediction Strategies,” ISCA 1981.

41
Two-Bit Counter Based Prediction
• Each branch associated with a two-bit counter
• One more bit provides hysteresis
• A strong prediction does not change with one single
different outcome
 Accuracy for a loop with N iterations = (N-1)/N

TNTNTNTNTNTNTNTNTNTN  50% accuracy
(assuming init to weakly taken)
+ Better prediction accuracy

2BC predictor CPI = [ 1 + (0.20*0.10) * 2 ] = 1.04 (90% accuracy)
-- More hardware cost (but counter can be part of a BTB entry)
42
State Machine for 2-bit Saturating Counter
• Counter using saturating arithmetic
actually actually
Not taken pred taken pred
!taken !taken
00 01
actually
Not taken
actually actually
Not taken taken
actually
pred taken pred
taken taken actually
10 actually 11 taken43
Not taken
State Machine for 2-bit Saturating Counter
• Counter using saturating arithmetic
actually actually
Not taken taken
SNT LNT
actually
Not taken
actually actually
Not taken taken
actually
taken
LT ST actually
actually taken44
Not taken

Pipelining

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Pipelining

Enviado por

Direitos autorais:

Formatos disponíveis

PIPELINING

Advanced Computer Architecture

• These factors have implications for ideal number of pipeline stages

• Maximum pipeline depth may not be the optimal design. Why?

• Latency in non-pipelined system = T

add R4,R3,30 add R3,R2,R1

add R4,R3,R2 add R3,R2,R1

• What about this combination?

add R4,R2,R3 add R3,R2,R1

• Can also bypass to ALU input B

store R4,[R3+4] load R3,[R2+8]

• Does WM bypassing work?

• Each multiplexer has its own logic, here it is for ALUinA

• Data dependency involving Load Instruction

Can we reduce either of the two penalty terms?

B1(D) B2(E B3(M) B4(W)

• In Case branch is taken, IJ+1 is quashed

PC + inst size Next Fetch

Cache of Target Addresses (BTB: Branch Target Buffer)

• Run time (dynamic)

+ Loop branches for loops with large number of iterations

actually predict predict actually

• Problem: A last-time predictor changes its prediction from

• Solution Idea: Add hysteresis to the predictor so that

• Smith, “A Study of Branch Prediction Strategies,” ISCA 1981.

 Accuracy for a loop with N iterations = (N-1)/N

+ Better prediction accuracy

Você também pode gostar