Escolar Documentos
Profissional Documentos
Cultura Documentos
1 2 3 4 5 6 7 8 9
add R3,R2, R1 F D E M W
load R4,[R5+8] F D E M W
store R6,[R7] F D E M W
Pipelining in a Processor
1 2 3 4 5 6 7 8 9 10
I1: F1 D1 E1 M1 W1
I2: F2 D2 E2 M2 W2
I3: F3 D3 E3 M3 W3
I4: F4 D4 E4 M4 W4
I5: F5 D5 E5 M5 W5
• It takes some time to fill the pipeline, but once the pipeline is filled,
instructions keep completing at a very high rate
• Example: If we have 10 instructions in our program
• No pipeline: 50 Cycles
• With Pipelining: 14 cycles
Five stage pipeline datapath
PC PC
+
4
Insn Register A
PC y
Mem File Z Data
s1 s2 d B Mem
M
IR
IR IR IR
PC
B1(D) B2(E) B3(M) B4(W)
Question
• Why is pipeline clock period > (Data path delay)/(Number of stages)?
• Three reasons
• Latches add delay
• Pipeline stages have different delays, clock period is max delay
• Extra datapath and logic for pipelining (e.g., data forwarding)
• Dependent Operations
• I1: Add R3, R2, R1
• I2: Sub R4, R3, 30
1 2 3 4 5 6 7 8 9
I1: F D E M W
I2: F D D D D E M W
I3: F F F F D C M
Resolution of Data Dependence
• Software technique
• Hardware Interlock
• Data Forwarding
Handling Data Dependence in Software
• Compiler detects data dependence between two successive
instructions and inserts explicit NOP (No-operation) instruction
between them
• Following sequence of instructions is generated
• I1: Add R3,R2,R1
• I2: NOP 1 2 3 4 5 6 7 8 9
Add R3,R2,R1 F D C M W
• I3: NOP
NOP F D C M W
• I4: NOP
NOP F D C M W
• I5: Sub R4, R3, 30 NOP F D C M W
• + Simplifies hardware Sub R4, R3, 30 F D C M W
• - code size increases
• - Execution time not reduced
Hardware Interlock
• Hardware inserts NOP between dependent instructions
• C1: I1 is fetched and put in B1
• C2: I1 moves to B2, I2 moves to B1
• C3: I2 tries to read regs but … should not
• Contents of B1 should not change
• B2 should get a NOP
• PC should not get incremented
PC PC
+
4
Insn Register A
y
PC Z
Mem File Data
s1 s2 d Mem
B M
PC IR
IR
IR IR
B1(D) B2(E) B3(M) B4(W)
Hardware Interlock
PC PC
+
4
Insn Register A
PC y
Mem File Z Data
s1 s2 d B Mem
M
IR
IR IR IR
PC_enable
NOP
B2(E) B3(M) B4(W)
Interlock
Bypassing
A O
Register
O a D
File B Data
s1 s2 d
B dMem
S
D X X M W
IR IR IR IR
• Bypassing
• Reading a value from an intermediate (marchitectural) source
• Not waiting until it is available from primary source
• Here, we are bypassing the register file
• Also called forwarding
WC Bypassing
A O
Register D
O a
File B Data
s1 s2 d
B dMem
D C S M
X
W
IR IR IR IR
IR IR IR IR
A O
Register D
O a
File B Data
s1 s2 d
B dMem
D C S M W
X
IR IR IR IR
A O
Register D
O a
File B Data
s1 s2 d
B dMem
D C S M
X
W
IR IR IR IR
bypass
• Calculate CPI
• CPI = 1 + (1 * 20% * 50%) = 1.1
Branch Delays
• In pipelined execution, a new instruction is fetched every cycle
• The preceding instruction is still being decoded
• Branch instructions can alter the sequence of instructions
• But they must first be decoded & executed to determine whether and where
to branch
Unconditional Branches
• Steps involved in execution of unconditional branches e.g., jmp [pc+x]
1. MemAd = PC, ReadMem, IR = memData, PC = PC + 4
2. Decode
3. PC = PC + x
4. No Action
5. No Action PC PC
+
4
Insn Register A
PC y
Mem File Z Data
s1 s2 d B Mem
M
IR
IR IR IR
PC
B1(D) B2(E) B3(M) B4(W)
Conditional Branches
• Steps involved in execution of unconditional branches
e.g., BEQ R5,R6, PC+x;
1. MemAd = PC, ReadMem, IR = memData, PC = PC + 4
2. Decode
3. Compare R5, R6, if R5 = R6, PC = PC + x
4. No Action
5. No Action PC PC
+
4
Insn Register A
PC y
Mem File Z Data
s1 s2 d B Mem
M
IR
IR IR IR
PC
B1(D) B2(E) B3(M) B4(W)
Branch Delays
t0 t1 t2 t3 t4 t5
Insth IFPC ID ALU MEM
Insti IFPC+4 ID ALU
Instj IFPC+8 ID
Instk IFtarget When a branch resolves
Instl - branch target (Instk) is fetched
Insth branch condition and
- all instructions target
fetched since
evaluated in ALU
inst h (so called “wrong-path”
instructions) must be flushed
- This two cycle penalty is called branch
penalty
28
Insth is a branch
Pipeline Flush on a Misprediction
t0 t1 t2 t3 t4 t5
Insth IFPC ID ALU MEM WB
Insti IFPC+4 ID killed
Instj IFPC+8 killed
Instk IFtarget ID ALU WB
Instl IF ID ALU
IF ID
IF
29
Insth is a branch
Performance Analysis
• Correct guess no penalty
• Incorrect guess 2 bubbles
• Assume
•no data hazards
•20% of dynamic instructions are control flow instructions
•70% of control flow instructions are taken
•Calculate CPI
•CPI = [ 1 + (0.20*0.7) * 2 ] =
= [ 1 + 0.14 * 2 ] = 1.28
probability of penalty for
a wrong guess a wrong guess
PC PC
+
4
comparato
Insn r A
PC y
Mem Z
B M
IR
PC IR IR IR
• Problem: Need to determine the next fetch address when the branch
is fetched (to avoid a pipeline bubble)
• Requires three things to be predicted at fetch stage:
• Whether the fetched instruction is a branch
• (Conditional) branch direction
• Branch target address (if taken)
Branch Prediction
• Observation: Target address remains the same for a conditional direct
branch across dynamic instances
• Idea: Store the target address from previous instance and access it with the
PC
• Called Branch Target Buffer (BTB) or Branch Target Address Cache
Fetch Stage with BTB and Direction Prediction
Direction predictor
taken?
Address of the
current branch
target address
35
Simple Branch Direction Prediction Schemes
• Compile time (static)
• Always not taken
• Always taken
• BTFN (Backward taken, forward not taken)
36
Dynamic Branch Prediction
• Idea: Predict branches based on dynamic information (collected at
run-time)
• Advantages
+ Prediction based on history of the execution of branches
+ It can adapt to dynamic changes in branch behavior
+ No need for static profiling: input set representativeness problem goes away
• Disadvantages
-- More complex (requires additional hardware)
Last Time Predictor
• Last time predictor
• Single bit per branch (stored in BTB)
• Indicates which direction branch went last time it executed
TTTTTTTTTTNNNNNNNNNN 90% accuracy
• Always mis predicts the last iteration and the first iteration of
a loop branch
• Accuracy for a loop with N iterations = (N-2)/N
N-bit One
tag Bit
BTB
table Per
branch
taken? PC+4
1 0
=
nextPC
The 1-bit BHT (Branch History Table) entry is updated with
the correct outcome after each execution of a branch
39
State Machine for Last-Time Prediction
actually
taken
actually
not taken
40
Improving the Last Time Predictor
actually actually
Not taken pred taken pred
!taken !taken
00 01
actually
Not taken
actually actually
Not taken taken
actually
pred taken pred
taken taken actually
10 actually 11 taken43
Not taken
State Machine for 2-bit Saturating Counter
• Counter using saturating arithmetic
actually actually
Not taken taken
SNT LNT
actually
Not taken
actually actually
Not taken taken
actually
taken
LT ST actually
actually taken44
Not taken