07 MIPS Pipelining CH4

MIPS Pipelining
Chapter 4
Sections 4.5 – 4.8
Dr. Iyad F. Jafar

Outline
 Introduction
 Why Pipelining?
 MIPS Pipelined Datapath
 MIPS Pipelined Control
 Pipelining Hazards
 Structural Hazards
 Data Hazards
 Control Hazards
 Exceptions and Interrupts
 Fallacies and Pitfalls
 Reading Assignment
2
Introduction
 Single-cycle datapath
 Simple!
 Hardware replication?
 Cycle time?
 Multi-cycle datapath
 More involved
 Less HW replication of major units
 Better performance if the delay of major functional
units is balanced!
 Can we do any better?
 Pipelining!
3
Introduction
 Pipelining
 In Multi-cycle, only one major unit is used in each
cycle while other units are idle!
 Why not to use them to do something else?
 Basically, start the next instruction before the
current one is finished!
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
LW IFetch Dec Exec Mem WB
SW IFetch Dec Exec Mem WB
R-Type IFetch Dec Exec Mem WB
4
Introduction
 Pipelining
 The time required to execute one instruction
(Instruction latency) is not affected!
 However, the number of instructions finished per
unit time (Throughput) is increased
 Thus, Pipelining improves the throughput not
latency!
 Most modern processors are pipelined!
 Notes
 As in multi-cycle, the cycle time is determined by
the slowest unit!
 However, similar to single-cycle, we can get one
instruction done every cycle!
 It is assumed that all instructions take the same
number of cycles!
5
Introduction
Single Cycle Implementation:
Cycle 1 Cycle 2
Clk
lw sw Waste R-type
Multiple Cycle Implementation:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
Clk
lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch
Pipeline Implementation:
lw IFetch Dec Exec Mem WB
sw IFetch Dec Exec Mem WB
R-type IFetch Dec Exec Mem WB

6
Why Pipelining?
 For Performance!
Time (clock cycles)
Once the pipeline

I
is full, one
ALU
n Inst 1 IM Reg DM Reg
instruction is
s
completed every
t
cycle, so CPI = 1
ALU
r. Inst 2 IM Reg DM Reg
(similar to Single-
cycle)
O
ALU
IM Reg DM Reg
r Inst 3
d
e
ALU
IM Reg DM Reg
r Inst 4
Inst 5 IM Reg ALU DM Reg
7 Time to fill the pipeline

Why Pipelining?
 Example 1. Comparing pipelining to single-cycle
Consider a program that consists of a large number of LOAD
instructions only that is executed on a single-cycle CPU and 5-stage
pipelined CPU with the operation time for the major units (memory,
ALU, and register file) to be 200 ps in both cases.
1) Determine the time required to finish executing 1,000,000 LOAD

instructions and compute the speed up of pipelining.
2) Determine the time required to finish executing the first 3 LOAD

instructions
3) Repeat (1) and (2) if the delay of the register file is 100 ps instead
of 200 ps.
Cycle times for the two implementations

CCSC = 200 + 200 + 200 + 200 + 200 = 1000 ps
8
CCPP = 200 ps
Why Pipelining?
1) Determine the time required to finish executing 1,000,000
LOAD instructions and compute the speed up of pipelining.
Single-cycle
TimeSC = 1000 ps x 1000000 = 1,000,000,000 ps
Pipelining TimePP = 1000 ps + 200 ps x 999999 = 200,000,800 ps
Speeup = 1,000,000,000 /
200,000,800 = 4.99998
(very close to the number of stages)
9
Why Pipelining?
2) Determine the time required to finish executing the first 3
LOAD instructions and compute the speed up of pipelining
Single-cycle
TimeSC = 1000 x 3 = 3000 ps
Pipelining
TimePP = 200 x 5 +200 + 200 = 1400 ps
Speeup = 3000 / 1400 = 2.14

(less than the number of stages)
10
Why Pipelining?
3) Repeat (1) and (2) if the delay of the register file is 100 ps .
CCSC = 200 + 100 + 200 + 200 + 100 = 800 ps
CCPP = 200 ps
For 1,000,000 instructions
TimeSC = 800 x 1,000,000 = 800,000,000 ps

TimePP = 1000+ 200x999,999 = 200,000,800ps
Speeup = 800,000,000/ 200,000,600 = 3.99998 (<5)
For 3 instructions
TimeSC = 800 x 3 = 2400 ps

TimePP = 1000 + 200x 2 = 1400 ps
11 Speeup = 2400/ 1400 = 1.71 (<5)

Why Pipelining?
 Example 1. Summary
 Ideally, the pipeline speedup is n times faster than the single-
cycle, where n is the number of pipeline stages.
 In the 5-stage MIPS, the pipelined version would be 5 times
faster.
 When the pipeline is full, the throughput will be one instruction
per cycle
 Many factors affect pipelining performance
 Time to fill and empty the pipeline
 Number of instructions to execute
 Unbalancecd delay of pipeline stages
 Instruction mix
 Pipeline hazards
 Ideally, the number of cycles required to finish M instructions

in N-stages pipeline is N + M – 1
12
Pipelined MIPS Datapath
 What do we need to implement pipelining?
 We need to consider the following:
1. The execution of instructions is divided into 5 stages
(cycles): Instruction fetch (IF) , Instruction decode (ID),
Execute (EX), Memory Access (MEM), Write Back (WB)
2. Instruction flow is from left to right except in two cases
 In the write-back stage where the result is written into the register
file in the middle of the datapath
 Choosing between the incremented PC and the branch address in
the MEM stage
3. In pipelining, all units are operating in every cycle; thus we
have to duplicate hardware where needed
4. Since the execution is over multiple cycles, we need to add
State (Pipeline) registers between stages to preserve
intermediate data and control for each instruction.
 These registers hold the values to be used in later stages as long
13 as they are needed.
IF ID EX MEM WB
+
4 Shift +
left 2
IFetch/Dec
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Exec/Mem
Dec/Exec
Read File Read
PC
Mem/WB
Address ALU Address
Write Addr Read Data
Data 2 Write Data
Write Data
Sign
16 Extend 32
System Clock
14 Any problem?
IF ID EX MEM WB
+
4 Shift +
left 2
IFetch/Dec
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Exec/Mem
Dec/Exec
Read File Read
PC
Mem/WB
Address ALU Address
Write Addr Read Data
Data 2 Write Data
Write Data
Sign
16 Extend 32
System Clock
15 Need to preserve the destination register !

 Example 2. Execution of LW instruction
(1) Instruction Fetch: Put PC and the loaded instruction in the
IF/ID register
16
(2) Instruction Decode and Read Registers: Store Reg[rs],
Reg[rt], sign extended offset , rd, rt, and the updated PC (why?) in the
ID/EX register
17
MIPS Pipelining
(3) Execute Or Address Calculation: Store branch address,
Reg[rt], result, and zero flag in the EX/MEM register
18
(4) Memory Access: Store the data from memory into
MEM/WB register
19
(5) Write Back: Copy the data loaded in the MEM/WB
register to register file
20
 Required data fields in the pipelining registers
 Data fields are moved from one pipeline register to
another every clock cycle until they are no longer
needed
Pipeline
Data Fields Register Size
Register
IF/ID Instruction and PC 64 bits
PC, Reg[rs], Reg[rt], sign-extended
ID/EX 138 bits
offset, rt, rd
Branch address, Zero, ALU result,
EX/MEM Reg[rt], Destination register address (rt 103 bits
or rd)
ALU Result, Data from memory,
MEM/WB 69
Destination register address
21
Pipelined MIPS Control
 All control signals can be determined during Decode stage while they
are needed in later stages!
 Solution! Expand the pipeline registers to store and move the control
signals between stages until they are needed
22
 Define the control signals and generate them in the decode stage
 For the time being, no explicit write signals are required for the
pipeline registers since the are updated every cycle
23
 Control signals needed in each stage
Pipeline Stage Control signals

IF None
ID None
EX RegDst, ALUOp1, ALUOp0,
ALUSrc
MEM Branch, MemRead, MemWrite
WB MemtoReg, RegWrite
 Control signal values based on instruction type
24
MIPS Pipeline
 Example 3. Given the code segment and the register
contents below, show the contents of the data and control
fields in the pipeline registers if the sixth instruction has
been fetched (i.e. the beginning of cycle 7)
Register Contents
$1 1
Address Instruction
$2 5
0x00000000 lw $10, 20($1)
$3 3
0x00000004 sub $11,$1,$2
$4 -6
0x00000008 add $12,$3,$4
$5 2
0x0000000c lw $13, 24($1)
$6 7
0x00000010 add $3,$2,$1
$11 12
0x00000014 Sub $1,$5,$6
$12 -15
25 $13 10
MIPS Pipeline
 Example 3. Multi-cycle diagram Time
ALU
lw $10, 20($1) IM Reg DM Reg
I
n
sub $11,$1,$2
ALU
s IM Reg DM Reg
t
r.
add $12,$3,$4
ALU
IM Reg DM Reg
O
r
ALU
d lw $13, 24($1) IM Reg DM Reg
e
r
ALU
add $3,$2,$1 IM Reg DM Reg
ALU
sub $1,$5,$6 IM Reg DM Reg
26
MIPS Pipeline
 Example 3. Single-cycle diagram
sub $1,$5,$6 add $3,$2,$1 lw $13, 24($1) add $12,$3,$4 sub $11,$1,$2
27
MIPS Pipeline
 Example 3.
At the beginning of cycle 7, the sixth instruction is stored
in the IF/ID register while the data and control for earlier
instructions are pushed to next pipeline registers and the
register files. Thus,
 IF/ID register
 No control signals are stored
 Store the instruction sub $1,$5,$6 and PC+4
 IF/ID.Instruction = 0x00A60822
 IF/ID.PC = 0x00000018
28
MIPS Pipeline
 Example 3.
 ID/EX register
 Store the information of add $3,$2,$1 and PC+4
 ID/EX.PC = 0x00000014
 ID/EX.RegRsContents = 0x00000005
 ID/EX.RegRtContents = 0x00000001
 ID/EX.RegRt = (00001)2
 ID/EX.RegRd = (00011)2
 ID/EX.SignExtend = 0x00001820
 Control Information
 ID/EX.MemToReg = 0
 ID/EX.RegWrite = 1
 ID/EX.MemRead = 0
 ID/EX.MemWrite = 0
 ID/EX.Branch = 0
 ID/EX.ALUSrc = 0
 ID/EX.RegDst = 1
 ID/EX.ALUOp = (10)2
29
MIPS Pipeline
 Example 3.
 EX/MEM register
 Store the information of lw $13,24($1), branch address,

and memory address
 EX/MEM.BranchAddress = 0x00000070
 EX/MEM.ALUOut = 0x00000019
 EX/MEM.Zero = 0
 EX/MEM.RegDestination= (01101)2
 EX/MEM.RegRtContents = 0x0000000A
 EX/MEM.MemToReg = 0
 EX/MEM.RegWrite = 1
 EX/MEM.MemRead = 1
 EX/MEM.MemWrite = 0
 EX/MEM.Branch = 0
30
MIPS Pipeline
 Example 3.
 MEM/WB register
 Store the information of add $12, $3,$4, addition

result, and data memory
 MEM/WB.RegDestination= (01100)2
 MEM/WB.ALUOut = 0xFFFFFFFD
 MEM/WB.MemoryData = XXXX
 MEM/WB.MemToReg = 0
 MEM/WB.RegWrite = 1
 For the sub $11, $1,$2

 It will be writing (1 - 5) to $11
31
Pipelining Hazards
 In general, pipelining is effective!
 MIPS ISA makes even easy
 All instructions are of the same length (32 bits)
 Can fetch the next instruction once the current is being decoded
 Few instruction formats with symmetry across them
 Can read the register file in the 2nd stage
 Memory access is through the Load and Store instructions
 Can use the execute stage to compute the address
 Each MIPS instruction writes at most one result in the MEM
or WB stage
 Is it that easy? Any complications?
 YES!
 PIPELINING HAZARDS !
32
Pipelining Hazards
 Hazards - problems the might occur during pipeline operation
 Three basic sources
 Structural Hazards
 In pipelining, all functional units are used in any cycle
 What if two instructions use the same functional unit in the same cycle?
 Data Hazards
 In pipelining, execution of instructions is overlapped
 What if the operand(s) of some instruction comes from an earlier
instruction that is still in the pipeline?
 Control Hazards
 In pipelining, an instruction is fetched every cycle
 What if an instruction is a jump or a branch instruction that evaluates to
true? The following instruction(s) in the pipeline might not be correct?
 Simple Solution?
 Wait until the issue is resolved!
33
Structural Hazards
 Single Memory! Reading from
memory twice in the
Time (clock cycles) same cycle!
lw
ALU
I Mem Reg Mem Reg
n
s
Inst 1
ALU
t Mem Reg Mem Reg
r.
ALU
O Inst 2 Mem Reg Mem Reg
r
d
ALU
e Inst 3 Mem Reg Mem Reg
r
ALU
Inst 4 Mem Reg Mem Reg
34
Solution: Use two memories; Data and Instruction!
Structural Hazards
 Single Register File!
Time (clock cycles)
One instruction is
writing and the
add $1,
ALU
I IM Reg DM Reg other is reading
n the register file?
s
Inst 1 Solution: Design
ALU
t IM Reg DM Reg
the register file to
r. write in the first
half of the cycle
ALU
O Inst 2 IM Reg DM Reg and read in the
r second half!
d
ALU
e add $2,$1, IM Reg DM Reg
r
clock edge that controls

35
clock edge that controls loading of pipeline state
register writing registers
Data Hazards
ALU
add $1, IM Reg DM Reg
ALU
ALU
and $6,$1,$7 IM Reg DM Reg
ALU
or $8,$1,$9 IM Reg DM Reg
ALU
IM Reg DM Reg
xor $4,$1,$5
• Dependencies backward in time cause hazards

• This is called Read-after-Write (RAW) data hazard Solution?
36
• Register-use data hazard
Data Hazards
 Simply, wait for the earlier instruction to finish! This is
called stalling the pipeline! However, this affects the CPI?
add $1,
ALU
I IM Reg DM Reg
n
s
t stall
r.
O stall
r
d
sub $4,$1,$5
ALU
e IM Reg DM Reg
r
ALU
37 Do we need two stalls all the time?

Data Hazards
ALU
lw $1,5($s1) IM Reg DM Reg
ALU
ALU
ALU
or $8,$1,$9 IM Reg DM Reg
ALU
IM Reg DM Reg
xor $4,$1,$5
• Dependencies backward in time cause hazards

• It is a Read-after-Write (RAW) data hazard Solution?
38
• Load-use data hazard
Data Hazards
 Again, wait for the LW instruction to finish by stalling the
pipeline! However, this affects the CPI?
lw $1,
ALU
I IM Reg DM Reg
n
s
t stall
r.
O stall
r
d
sub $4,$1,$5
ALU
e IM Reg DM Reg
r
ALU
39
Data Hazards
 Example 4. how many cycles are actually required to
execute the following code? Assume the pipeline is
already full. Ideally, and since the pipeline
is full, each instruction
requires 1 cycle. Thus, we
need 6 cycles (CPI =6/6= 1).
add $1, $2, $5 However, …
add $5, $3, $1
Register-use data hazard
sub $10, $7, $8 Adds 2 cycles by stalls
sub $5, $6, $7 Load-use data hazard

Adds 2 cycles by stalls
lw $3, 45($9)
add $3, $3, $8 Thus, 10 cycles are needed.
CPI = 10/6 = 1.667 ??
Performance ??
40 Can we do any better?

Data Hazards
 Fixing Register-use Hazard by Forwarding
 Note that data produced by an instruction and needed by a
later instruction is pushed through the pipeline registers until
it is saved into the register file !
 Why not to read the data from the pipeline registers before it
is stored ?
 This is called forwarding!
 What is required?
 Need to detect the hazard
 Is any of the source registers for the instruction the same as the
destination register for an earlier instruction that is still in the
pipeline?
 Need to create a path to pass the data between pipeline stages
 Instead of reading the source registers of the instruction from
the register file, read them from the pipeline registers
41
Data Hazards
 Fixing Register-use Hazard by Forwarding
ALU
add $1, IM Reg DM Reg
I
n
s
ALU
t IM Reg DM Reg
sub $4,$1,$5
r.
ALU
IM Reg DM Reg
r and $6,$1,$7
d
e
ALU
r or $8,$1,$9 IM Reg DM Reg
ALU
IM Reg DM Reg
xor $4,$1,$5
42
No Stalls!
Data Hazards
 Forwarding Hardware implementation
43 Note that forwarding could be from EX/MEM or from MEM/WB! Why?

Data Hazards
 Inside the forwarding unit
(1) Forwarding from EX/MEM (MEM Stage)
if (EX/MEM.RegWrite
and (EX/MEM.RegRd != 0)
and (EX/MEM.RegRd = ID/EX.RegRs))
then ForwardA = From EX/MEM
if (EX/MEM.RegWrite
and (EX/MEM.RegRd != 0)
and (EX/MEM.RegRd = ID/EX.RegRt))
then ForwardB = From EX/MEM
 Why to check the RegWrite signal?

44  Why to check the Zero register?
Data Hazards
 Inside the forwarding unit
(2) Forwarding from MEM/WB (WB Stage)
if (MEM/WB.RegWrite
and (MEM/WB.RegRd != 0)
and (MEM/WB.RegRd = ID/EX.RegRs))
then ForwardA = From MEM/WB
if (MEM/WB.RegWrite
and (MEM/WB.RegRd != 0)
and (MEM/WB.RegRd = ID/EX.RegRt))
then ForwardB = From MEM/WB
45
Data Hazards
 Can the forwarding hardware be used with Load-use
data hazard?
ALU
I lw $1,4($2) IM Reg DM Reg
n
s
ALU
t
r.
ALU
IM Reg DM Reg
O and $6,$1,$7
r
d
ALU
IM Reg DM Reg
e or $8,$1,$9
r
ALU
IM Reg DM Reg
xor $4,$1,$5
46
We still need 1 Stall for the instruction following the load?
Data Hazards
 How to stall the pipeline?
 Stall is required when the instruction in the EX stage is Load and

the one in the ID stage depends on the loaded value
 The Load instruction moves normally to EX/MEM on the next

cycle
 The conflicting instruction (the instruction following the load)

should stay in the decode stage? How?
 Don’t write the IF/ID register  need IF/IDWrite Signal
 Don’t update the PC  need PCWrite Signal
 The control signals of the instruction in the decode stage are stored as
0’s (WHY?) in the ID/EX  need a multiplexor for the control signals
 Controlling the process requires a special unit; Hazard Detection Unit

47
Data Hazards
 Stall Implementation
48
Data Hazards
 Stall Implementation
 Inside hazard detection unit
if (ID/EX.MemRead
and [(ID/EX.RegRt == IF/ID.RegRs) or
(ID/EX.RegRt == IF/ID.RegRt)])
then
PCWrite = 0
IF/IDWrite = 0
Select 0’s as control signals
Any Problem?
Do we need to stall in all cases?
How about j and jal that come immediately after load with rs and/or rt
fields being the same as the rt field of the load?
49
Data Hazards
 Example 5. Consider the following code segment in C
A=B+E
C=B+F
(1) Generate the MIPS code assuming that variables

A, B, C, E, and F are in memory and addressable with
offsets 0, 4, 8, 12, and 16 from $t0
(2) Find all the data hazards and determine the
number of cycles required to run the code. Assume
forwarding is implemented.
(3) Can you reorder the code to reduce the stalls ?
50 
Data Hazards
 Example 5. Ideally, each instruction
requires 1 cycle after the
pipeline is full. Thus, we
need (5+7-1) cycles.
lw $t1, 4($t0) # loads B CPI = 11/7 = 1.57
lw $t2, 12($t0) # loads E
add $t3, $t1, $t2 #A=B+E Load-use data hazard
Adds 1 cycle as a stall
sw $t3, 0($t0) # stores A
lw $t4, 16($t0) # loads F
add $t5, $t1, $t4 #C=B+F Load-use data hazard
Adds 1 cycle as a stall
sw $t5, 8($t0) # stores C
Thus, 13 cycles are needed.

CPI = 13/7 = 1.86 ??
51
Performance ??
Data Hazards
 Example 5. Reducing stalls by instruction reordering
lw $t1, 4($t0) # loads B

lw $t2, 12($t0) # loads E
lw $t4, 16($t0) # loads F Moving this
instructions fills the
add $t3, $t1, $t2 #A=B+E first stall and eliminate
the second one!
sw $t3, 0($t0) # stores A
lw $t4, 16($t0) # loads F
add $t5, $t1, $t4 #C=B+F
Thus, 11 cycles are
sw $t5, 8($t0) # stores C needed.
CPI = 11/7 = 1.57
52
Data Hazards
 Example 6. Assume that the pipelined MIPS processor
without forwarding is used to run a program with the
following instruction mix: 20% loads, 20% store, and 60%
ALU. Then compute the average CPI given that
 10% of the ALU instructions result in load-use hazards.

 15% of the ALU instructions result in read-before-write hazards.
 Solution
 Ideally, the average CPI is 1 for each instruction
 With no forwarding
 Load-use hazards add two cycles
 Register-use hazards add two cycles
 Average CPI = 0.2 x 1 + 0.2 x 1 + 0.75 x 0.60 x 1 +

53 0.1 x 0.60 x 3 + 0.15 x 0.60 x 3 = 1.30
Control Hazards
 For the pipelined datapath designed so far, the
branch address and decision are known by the end of
the MEM stage
 Instructions following the branch instruction in the
pipeline are not correct if the branch evaluates to true!
 If the branch is true, then these instructions should be
removed from the pipeline and execution should
continue from the branch address
 Otherwise, no action is required!
 This is a dependency backward in time  Control

Hazard
54
Control Hazards
Branch
Inst1
Inst2
Inst3
Solution!
55 Once it is known that the instruction is branch, then stall the pipeline for 3
cycles? Is it actually a stall?
Control Hazards
beq
ALU
I IM Reg DM Reg
n
s
t
stall
r.
stall
O
r stall
d
ALU
e IM Reg DM Reg
r
Inst
ALU
IM Reg DM
Inst
Are these actual stalls? Why not to start

Fetching from instruction memory is the execution of the following
either from PC+4 or Branch address instructions normally and if the branch is
56 depending on the branch result true, then flush these instructions?!
Control Hazards
 Reducing the Cost of Branch Hazard
 Note that three cycles are lost if the branch evaluates to
true in order to remove the three instructions following
the branch instruction!
 This could affect the performance significantly!
 Can we reduce this cost?

 Move the branch address computation to the decode stage
 Add additional hardware to compare the two registers in the ID
stage!
 Whenever there is a branch instruction in the ID/EX register
(ID/EX.branch =1), flush the instruction in the IF/ID register.
 The branch penalty in this case will be 1 cycle instead of 3 cycles!
57
Control Hazards
58
Control Hazards
beq
ALU
IM Reg DM Reg
stall
ALU
IM Reg DM Reg
lw
 Modifying the Hazard Detection Unit

IF (ID/EX.Branch) then Flush IF/ID register
 Note that we lose one cycle whenever a branch
instruction is encountered!
 Can we do any better?
59
Control Hazards
 Approach I – Static Branch Prediction
 Always predict the branch as Not Taken and start fetching the
instruction following the branch
 If the branch evaluates to Not Taken, then the prediction is
correct and no further actions are required!
 If the branch evaluates to Taken, then the prediction is not
correct! Remove the fetched instruction and start fetching from
the branch address
 In this approach, we only lose one cycle if the prediction is not
correct
 Inside the hazard detection unit
IF (ID/EX.Branch) and (ID/EX.ZERO) Then Flush IF/ID register

60
Control Hazards
 Approach II – Dynamic Branch Prediction
 Prediction could be Taken or Not Taken
 If the branch is predicted as Not Taken
 Fetch the next instruction
 If prediction is false, flush the instruction. One cycle is lost!
 If branch is predicted as Taken
 Fetch the instruction from the branch address
 If prediction is false, flush and fetch from PC+4
 How to store branch prediction?
 Use Branch History Table or Branch Prediction Buffer
 The table is addressable by the lower bits of the branch instruction
address
 If branch is predicted as taken, we need to wait for the
branch address to be computed?
 Use Branch Target Buffer
61
Control Hazards
 1-bit Branch Predictor
 Basically we have two states (Taken and Not Taken)
 One bit is used to store the prediction
 Prediction state is changed when prediction is wrong
 Performance Issues
 Consider branching in loops? EXAMPLE?
62
Control Hazards
 2-bit Branch Predictor
 Basically we have four states
 two bits are used to store the prediction
 Prediction state is changed when prediction is wrong twice
63
Control Hazards
 Example 7. Consider a certain program that have a
conditional branch instruction whose actual outcome
is given below when the program is executed.
T-T-N-T-T-N-T
List predictions for the following branch prediction
schemes and find the prediction accuracy.
1. Predict always taken
2. Predict always not taken
3. 1-bit predictor, initialized to predict taken
4. 2-bit predictor, initialized to weakly predict taken
64
Control Hazards
 Example 7.
 Actual branch actions : T-T-N-T-T-N-T
 Predict as always taken
 Predictions : T-T-T-T-T-T-T
 Accuracy = 5/7 = 71%
 Predict as always not taken
 Predictions : N-N-N-N-N-N-N
 Accuracy = 2/7 = 29%
 1-bit predictor initialized to predict taken
 Predictions: T-T-T-N-T-T-N
 Accuracy = 3/7 = 43%
 2-bit predictor initialized to weakly predict taken
 Predictions: T-T-T-T-T-T-T
 Accuracy = 5/7 = 71%
65
Pipelining Performance
 Example 8. Let’s compare the performance of single-cycle, multi-cycle, and
pipeline implementation of MIPS processor given the operation times and
instruction mix below.
For the pipelined implementation, assume that:
1) Branch decision is done in the MEM cycle. Branch handling in the pipeline
implementation is done by stalling the pipeline.
2) Half of the load instructions incur load-use hazard.
3) Forwarding is implemented.
4) The jump instruction is completed in the ID stage
Instruction type Percentage %
Unit Time (ps) ALU 52
Memory 200 Load 25
ALU and adders 100 Store 10
Register File 50 Branch 11

66 Jump 2
 Example 8.
 Clock cycle time
 Single-cycle = 200 + 50 + 100 + 50 + 200 = 600 ps
 Multi-cycle = 200 ps
 Pipeline = 200 ps
 CPI
 Single-cycle = 1
 Multi-cycle = 5x 0.25 + 4x0.52 + 4x0.10 + 3x0.11 + 3x0.02
= 4.12
 Pipeline = 0.125x2 + 0.125x1 + 0.52x1 + 0.1x1 + 0.11x4 + 0.02x2
= 1.475
 Execution Time per instruction
 Single-cycle = 600 ps
 Multi-cycle = 4.12 x 200 ps = 824 ps
 Pipeline = 1.475 x 200 = 295 ps
67
 Example 9. Redo example 8 by assuming that branch
prediction is employed and 1/4th of the branch instructions
are miss predicted.
68
Exceptions & Interrupts
 Exceptions and interrupts are unexpected events
that require the change in the flow
 The two terms are used interchangeably and
depending is ISA
 Intel x86 uses the term interrupt only
 In MIPS
 Exceptions: any internal unexpected change in the flow (undefined
opecode, overflow, system calls)
 Interrupts: the event is external (I/O controller request)
 Dealing with them

 Is a challenging part of processor design
 Affects performance
69
Exceptions & Interrupts
 In MIPs, when an exception is generated, the
following sequence of steps are taken
 The address of the offending instruction is saved into a
special called the Exception Program Counter (EPC).
 The cause of the exception is saved in a special register
called the Cause Register.
 The control is transferred to the operating system by
loading a special address (0x8000 00180) into the PC.
The code loaded starting at this address
 Determines what actions will be done by the operating system in
response to the exception based on the value found in the Cause
Register. The operating system may terminate the program or
resume the execution using the value found in the EPC
70
Overflow Exception
 Modifications to the Datapath
71
Fallacies
 Fallacy 1. Pipelining is easy !

 Not true ! Hazards complicate the operation
 Fallacy 2. Pipelining is independent of

technology!
 Why didn’t we have pipelined processors before ?
 Advanced technology allowed more transistors and
thus more operations !
72
Reading Assignment
 Read the following from the textbook
 Section 4.9 – Exceptions

 Section 4.10 – Parallelism and Advanced Instruction
Level Parallelism
73

07 MIPS Pipelining CH4

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

07 MIPS Pipelining CH4

Enviado por

Direitos autorais:

Formatos disponíveis

MIPS Pipelining

Dr. Iyad F. Jafar

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

LW IFetch Dec Exec Mem WB

SW IFetch Dec Exec Mem WB

R-Type IFetch Dec Exec Mem WB

Multiple Cycle Implementation:

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

Once the pipeline

Inst 5 IM Reg ALU DM Reg

7 Time to fill the pipeline

1) Determine the time required to finish executing 1,000,000 LOAD

2) Determine the time required to finish executing the first 3 LOAD

Cycle times for the two implementations

Pipelining TimePP = 1000 ps + 200 ps x 999999 = 200,000,800 ps

Speeup = 3000 / 1400 = 2.14

TimeSC = 800 x 1,000,000 = 800,000,000 ps

TimeSC = 800 x 3 = 2400 ps

11 Speeup = 2400/ 1400 = 1.71 (<5)

 Ideally, the number of cycles required to finish M instructions

15 Need to preserve the destination register !

Pipeline Stage Control signals

 Control signal values based on instruction type

 Store the information of lw $13,24($1), branch address,

 Store the information of add $12, $3,$4, addition

 For the sub $11, $1,$2

clock edge that controls

• Dependencies backward in time cause hazards

37 Do we need two stalls all the time?

• Dependencies backward in time cause hazards

sub $5, $6, $7 Load-use data hazard

40 Can we do any better?

43 Note that forwarding could be from EX/MEM or from MEM/WB! Why?

 Why to check the RegWrite signal?

 Stall is required when the instruction in the EX stage is Load and

 The Load instruction moves normally to EX/MEM on the next

 The conflicting instruction (the instruction following the load)

 Don’t update the PC  need PCWrite Signal

 Controlling the process requires a special unit; Hazard Detection Unit

(1) Generate the MIPS code assuming that variables

Thus, 13 cycles are needed.

lw $t1, 4($t0) # loads B

 10% of the ALU instructions result in load-use hazards.

 Average CPI = 0.2 x 1 + 0.2 x 1 + 0.75 x 0.60 x 1 +

 This is a dependency backward in time  Control

Are these actual stalls? Why not to start

 This could affect the performance significantly!

 Can we reduce this cost?

 Modifying the Hazard Detection Unit

IF (ID/EX.Branch) and (ID/EX.ZERO) Then Flush IF/ID register

Instruction type Percentage %

Unit Time (ps) ALU 52

Memory 200 Load 25

ALU and adders 100 Store 10

Register File 50 Branch 11

 Dealing with them

 Fallacy 1. Pipelining is easy !

 Fallacy 2. Pipelining is independent of

 Section 4.9 – Exceptions

Você também pode gostar