ARM Processor Organization

ARM processor organization
P. Bakowski
bako@ieee.org
ARM register bank

The register bank, which stores the processor state.
r00 r01 r14 r15
P. Bakowski
ARM register bank

It has two read ports and one write port which can each be used to access any register, plus an additional read port and an additional write port that give special access to r15, the program counter. r00 r01 r14 r15
P. Bakowski
3
ARM register bank

P. Bakowski
4
ARM register bank

P. Bakowski
5
ARM register bank

P. Bakowski
6
ARM barrel shifter

The barrel shifter, which can shift or rotate one operand by any number of bits.
number of bits
P. Bakowski
ARM ALU
The ALU, which performs the arithmetic and logic functions required by the instruction set. operands functions
P. Bakowski
ARM 3-stage pipeline

The address register and incrementer, which select and hold addresses and generate sequential addresses when required.
address register incrementer
P. Bakowski

The data registers, which hold data passing to and from memory. data data out register instructions
data in register
to/from memory
P. Bakowski
d[31:0]
10

The instruction decoder and associated control logic.
data in register
11
control path control signals data path
instructions
P. Bakowski
Three stage pipeline : ARM 1,2,3

FETCH; the instruction is fetched from memory and placed in the instruction pipeline. address register data in register memory
to instruction register
fetch clock cycle

P. Bakowski
12

DECODE; the instruction is decoded and the datapath control signals prepared for the next cycle. instruction register control path control signals fetch decode fetch
P. Bakowski
13

EXECUTE; the instruction controls the datapath; the register bank is read, an operand shifted the ALU result generated and written back into a destination register. control signals data path fetch decode fetch
P. Bakowski
execute decode
14

instruction throughput : 1 instruction per clock cycle instruction latency : 3 clock cycles
fetch
decode fetch
execute decode fetch execute decode execute
clock cycle
P. Bakowski
15

instruction throughput : 1 instruction per clock cycle instruction latency : 3 clock cycles
fetch
decode fetch
execute decode fetch execute decode execute
P. Bakowski
16
ARM 1,2,3 architecture

a[31:0] instruction register control path control signals
incrementer
Bbus
PC register bank
BS Abus ALU
ALUbus
P. Bakowski
M multiplier, BS barrel shifter
data out register data in register d[31:0]

17
address register
Multi cycle execution

fetch decode STR fetch STR store instruction needs two execution cycles: address calculation cycle and data transfer cycle
P. Bakowski
18

fetch decode STR execute decode fetch fetch address
address calculation cycle and data transfer cycle

P. Bakowski
19

4 clock cycles fetch decode STR execute decode fetch fetch address transfer decode execute decode

P. Bakowski
20

4 clock cycles fetch decode STR execute decode fetch fetch address transfer decode execute decode

P. Bakowski
Attention: only one memory transfer per clock cycle

21
Processor performance
The time T, required to execute a given program is given by:
T=(Ninst*CPI)/fclk Ninst - number of instructions in the program

CPI - clock cycles per instruction
fclk
P. Bakowski
- clock frequency
22

fclk
P. Bakowski
- clock frequency
23

CPI - clock cycles per instruction (throughput)
fclk
P. Bakowski
- clock frequency
24

fclk
P. Bakowski
- clock frequency
25
Since Ninsi is constant for a given program there are only two ways to increase performance: increase the clock rate, fclk. reduce the average number of clock cycles per instruction, CPI.
P. Bakowski
26
Clock rate increase

Increase the clock rate, fclk. This requires the logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased.
stage1
stage2
stage3
clock cycle 3 stages
P. Bakowski
27
Clock rate increase

Increase the clock rate, fclk. This requires the logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased.
stage1
stage2
stage3
clock cycle 3 stages stage1 stage2 stage3 stage4 stage5
clock cycle 5 stages

P. Bakowski
28
Clock rate increase

Note that the clock rate to be used depends heavily on the implementation technology.
stage1
stage2
stage3
clock cycle 3 stages stage1 stage2 stage3
clock cycle 3 stages new implementation technology

P. Bakowski
29
More hardware resources

Reduce the average number of clock cycles per instruction, CPI. This requires the introduction of more parallelism that means more hardware resources to be used in a given clock cycle.
D/I M data read or instruction fetch

P. Bakowski
DM
IM
data read and instruction fetch

30
5-stage pipeline organization

Higher performance ARM cores employ a 5-stage pipeline and have separate instruction and data memories. Breaking instruction execution down into five stages rather than three reduces the maximum work which must be completed in a clock cycle, and hence allows a higher clock frequency to be used. The separate instruction and data memories seen as separate caches connected to a unified instruction and data main memory allow a significant reduction in the core's CPI.
P. Bakowski
31

P. Bakowski
32

P. Bakowski
33
Fetch stage
fetch decode execute buffer write
FETCH - the instruction is fetched from memory and placed in the instruction cache.
next PC
incrementer
I cache
to decoder
P. Bakowski
34
Decode stage
DECODE - the instruction is decoded and register operands read from the register file.
P. Bakowski
I - decode
35
Decode stage
There are three operand read ports in the register file, so most instructions can obtain all their operands in one cycle.
I - decode
register file
P. Bakowski
36
Execute stage
EXECUTE - an operand is shifted and the ALU result generated. register file
BS M ALU +4
ALUbus
P. Bakowski
37
Execute stage
EXECUTE - an operand is shifted and the ALU result generated. register file
BS M ALU +4
ALUbus
P. Bakowski
38
Execute stage
If the instruction is a load or store the memory address is computed in the ALU. register file
BS M ALU +4
ALUbus
P. Bakowski
39
Buffer stage
BUFFER data - data memory is accessed if required. Otherwise the ALU result is simply buffered for one clock cycle to give the same pipeline flow for all instructions. byte replication D cache +4
P. Bakowski
40
rotation/sign extension
Write back stage

WRITE-back; the results generated by the instruction are written back to the register file, including any data loaded from memory. ALUbus ALU D cache
P. Bakowski
register file rotation/sign extension

41
Data forwarding
In the 5-stage pipeline instruction execution is spread across three pipeline stages, the only way to resolve data dependencies without stalling the pipeline is to introduce forwarding paths. fetch decode execute buffer write
P. Bakowski
42
Data forwarding
to register file fetch decode execute buffer fetch write write
decode execute buffer
Data dependencies arise when an instruction needs to use the result of one of its predecessors before that result has returned to the register file.
P. Bakowski
43
Data forwarding
Forwarding paths (by-pass) allow the intermediate results to be passed between stages as soon as they are available, in the 5-stage ARM pipeline each of the three source operands can be forwarded from any of three intermediate result registers to register file fetch decode execute buffer write by-pass paths fetch
P. Bakowski
decode execute buffer
write
44
PC organization - compatibility
The programming behavior of the PC implemented through r15 is based on the operational characteristics of the 3-stage ARM pipeline. Basically the 5-stage pipeline reads the instruction operands one stage earlier and that is incompatible with 3-stage design.
P. Bakowski
45
PC organization - compatibility
The programming behavior of the PC implemented through r15 is based on the operational characteristics of the 3-stage ARM pipeline. Basically the 5-stage pipeline reads the instruction operands one stage earlier and that is incompatible with 3-stage design.
P. Bakowski
46
PC organisation - solution
This problem is resolved by the incrementation of the PC value from the fetch stage in the decode stage, bypassing the pipeline register between the two stages. PC+4 for the next instruction is equal to PC+8 for the current instruction (4 bytes farther), so the correct r15 value is obtained without additional hardware.
P. Bakowski
47
PC organisation - solution
PC+4 for the next instruction is equal to PC+8 for the current instruction (4 bytes farther), so the correct r15 value is obtained without additional hardware. next PC+4
incrementer
I cache
register to decoder file r15
next PC+8
P. Bakowski
48
ARM programming model

The Instruction Set Architecture (ISA) defines the operations that the programmer can use to change the state of the system incorporating the processor. This state usually comprises the values of the data items in the visible registers and the memory. Each instruction performs a defined transformation from the state before the instruction is executed to the state after it has completed.
P. Bakowski
49

P. Bakowski
50

P. Bakowski
51
ARM memory subsystem

ARM memory may be viewed as a linear array of bytes numbered from zero up to 232-1. 232-1
linear array of bytes 0
P. Bakowski
52

Data items may be 8-bit bytes, 16-bit half-words or 32-bit words. 232-1
bytes
words
3
P. Bakowski
0
53

Words are always aligned on 4-byte boundaries (the two least significant address bits are zero) and halfwords are aligned on even byte boundaries. 232-1
byte number 00 word address
0
54
P. Bakowski

Words are always aligned on 4-byte boundaries (the two least significant address bits are zero) and halfwords are aligned on even byte boundaries. 232-1 little endian organization
0
55
P. Bakowski
ARM load-store architecture

The processing instruction (add, subtract, and so on) take the values from the registers and always place the results into a register. register file
BS M ALU +4
ALUbus
P. Bakowski
56

The processing instruction (add, subtract, and so on) take the values from the registers and always place the results into a register. register file
BS M ALU +4
ALUbus
P. Bakowski
57

The only instructions which apply to memory state are ones which copy memory values into register (load instructions) or copy register values into memory (store instructions).
register file
D -cache
memory
P. Bakowski
58

The only instructions which apply to memory state are ones which copy memory values into register (load instructions) or copy register values into memory (store instructions).
register file
D -cache
memory
P. Bakowski
59
ARM instructions
In general the ARM instructions fall into one of the following three categories: data processing instructions data transfer instructions control flow instructions
P. Bakowski
60
ARM instructions
P. Bakowski
61
ARM instructions
P. Bakowski
62
ARM instructions
P. Bakowski
63
ARM data processing

Data processing instructions: these use and change only register values;
P. Bakowski
64
ARM data processing

For example, an instruction can add two registers and place the result in a register.
register file M
BS ALU +4
ALUbus
P. Bakowski
65
ARM data transfer

Data transfer instructions copy memory values into registers (load instructions) or copy register values into memory (store instructions); An additional form, useful only in systems code, exchanges a memory value with a register value.
P. Bakowski
66
ARM data transfer

P. Bakowski
67
ARM data transfer

e.g. test and set instruction
P. Bakowski
68
ARM control flow

Control flow instructions cause execution to switch to a different address, either permanently (branch instructions) or saving a return address to resume the original sequence (branch and link instructions) or trapping into system code (supervisor calls).
P. Bakowski
69
ARM control flow

link address
P. Bakowski
70
ARM control flow

link address
system code
P. Bakowski
71
ARM supervisor mode

The ARM processor supports a protected supervisor mode. The protection mechanism ensures that user code cannot gain supervisor privileges without appropriate checks being carried out to ensure that the code is not attempting illegal operations.
I/O driver illegal operation ?
P. Bakowski
user code
72
ARM supervisor mode

The upshot of this for the user-level programmer is that system-level functions can only be accessed through specified supervisor call. system/supervisor call
I/O driver system code user code

P. Bakowski
73
ARM I/O programming

The ARM handles I/0 (input/output) peripherals (such as disk controllers, network interfaces, and so on) as memory-mapped devices with interrupt support.
memory-mapped devices
P. Bakowski
74
ARM I/O programming

The internal registers in these devices appear as addressable locations within the ARM's memory map and may be read and written using the same (loadstore) instructions as any other memory locations.
store load memory locations

P. Bakowski
75
ARM I/O interruptions

Peripherals may attract the processor's attention by making an interrupt request using either the normal interrupt (IRQ) or the fast interrupt (FIQ) input. to CPU - FIQ to CPU IRQ
P. Bakowski
76

Both interrupt inputs are level-sensitive and maskable. Normally most interrupt sources share the IRQ input, with just one or two time-critical sources connected to the higher-priority FIQ input.
IRQ
FIQ
P. Bakowski
77

Some systems may include direct memory access (DMA) hardware external to the processor to handle high-bandwidth traffic. system bus DMA traffic
P. Bakowski
78
ARM exceptions
The ARM architecture supports a range of : interrupts traps supervisor calls all grouped under the general heading of exceptions.
P. Bakowski
79
ARM exceptions
P. Bakowski
80
ARM exceptions
P. Bakowski
81
ARM exceptions
P. Bakowski
82
ARM exceptions
P. Bakowski
83
ARM exceptions
The general way of exception handling is the same in all cases: the current state is saved by copying the PC into rl4_exc and the CPSR into SPSR_exc (where exc stands for the exception type); the processor operating mode is changed to the appropriate exception mode; the PC is forced to a value between 0016 and 1C16, the particular value depending on the type of exception.
P. Bakowski
84
ARM exceptions
P. Bakowski
85
ARM exceptions
P. Bakowski
86
ARM exceptions
P. Bakowski
87
ARM instruction execution

a[31:0] instruction register control path control signals
incrementer
Bbus
PC register bank
BS Abus ALU
ALUbus
P. Bakowski
M multiplier, BS barrel shifter
data out register data in register d[31:0]

88
address register
Data processing instruction

register register operations Bbus
a[31:0]
data out register data in register

89
address register
incrementer
PC register bank
BS Abus ALU
ALUbus
P. Bakowski
d[31:0]
Data processing instruction

ir register register immediate operations Bbus
a[31:0]

90
address register
incrementer
PC register bank
BS Abus ALU
ALUbus
P. Bakowski
d[31:0]
Store instruction
ir register compute address operation Bbus
a[31:0]

91
address register
incrementer
PC register bank
BS Abus ALU
ALUbus
P. Bakowski
new address
d[31:0]
Store instruction
ir register store data auto-index Bbus
a[31:0]

92
address register
incrementer
PC register bank
BS Abus ALU
ALUbus
P. Bakowski
auto-index
d[31:0]
Branch instruction
ir register compute branch address Bbus
a[31:0]

93
address register
incrementer
PC register bank
BS Abus ALU
ALUbus
P. Bakowski
branch address
d[31:0]
Branch instruction
ir register store return address Bbus
a[31:0]

94
address register
incrementer
PC register bank R14
BS Abus ALU
ALUbus
P. Bakowski
return address
d[31:0]
Summary
ARM register bank ARM barrel shifter and ALU ARM 3-stage and 5-stage pipelines ARM programming model ARM instructions
P. Bakowski
95
Summary
P. Bakowski
96
Summary
P. Bakowski
97
Summary
P. Bakowski
98
Summary
P. Bakowski
99

ARM Processor Organization - Presentation

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ARM Processor Organization - Presentation

Enviado por

Direitos autorais:

Formatos disponíveis

ARM register bank

r00 r01 r14 r15

ARM register bank

ARM register bank

ARM register bank

ARM register bank

ARM barrel shifter

ARM 3-stage pipeline

address register incrementer

ARM 3-stage pipeline

ARM 3-stage pipeline

control path control signals data path

Three stage pipeline : ARM 1,2,3

fetch clock cycle

Three stage pipeline : ARM 1,2,3

Three stage pipeline : ARM 1,2,3

Three stage pipeline : ARM 1,2,3

execute decode fetch execute decode execute

Three stage pipeline : ARM 1,2,3

execute decode fetch execute decode execute

ARM 1,2,3 architecture

M multiplier, BS barrel shifter

data out register data in register d[31:0]

Multi cycle execution

Multi cycle execution

address calculation cycle and data transfer cycle

Multi cycle execution

address calculation cycle and data transfer cycle

Multi cycle execution

address calculation cycle and data transfer cycle

Attention: only one memory transfer per clock cycle

T=(Ninst*CPI)/fclk Ninst - number of instructions in the program

T=(Ninst*CPI)/fclk Ninst - number of instructions in the program

T=(Ninst*CPI)/fclk Ninst - number of instructions in the program

T=(Ninst*CPI)/fclk Ninst - number of instructions in the program

Clock rate increase

clock cycle 3 stages

Clock rate increase

clock cycle 3 stages stage1 stage2 stage3 stage4 stage5

clock cycle 5 stages

Clock rate increase

clock cycle 3 stages stage1 stage2 stage3

clock cycle 3 stages new implementation technology

More hardware resources

D/I M data read or instruction fetch

data read and instruction fetch

5-stage pipeline organization

5-stage pipeline organization

5-stage pipeline organization

Write back stage

register file rotation/sign extension

decode execute buffer

decode execute buffer

register to decoder file r15

ARM programming model

ARM programming model

ARM programming model

ARM memory subsystem

linear array of bytes 0

ARM memory subsystem

ARM memory subsystem

byte number 00 word address