Escolar Documentos
Profissional Documentos
Cultura Documentos
11/13/2013
Amit Kulkarni
ARM ltd
Founded in November 1990
Spun out of Acorn Computers
Designs the ARM range of RISC processor cores Licenses ARM core designs to semiconductor partners who fabricate and sell to their customers.
ARM does not fabricate silicon itself
Also develop technologies to assist with the design-in of the ARM architecture
Software tools, boards, debug hardware, application software, bus architectures, peripherals etc
11/13/2013 Amit Kulkarni
11/13/2013
Amit Kulkarni
ARM core
A key component of many successful 32-bit embedded systems. The most successful core is ARM7TDMI. Over 32 billion ARM processor have been shipped in 2011 worldwide. It is not a single core but a whole family of designs sharing similar design principles and common instruction set.
11/13/2013
Amit Kulkarni
What is a Microprocessor?
The word comes from the combination micro and processor. Processor means a
device that processes whatever. In this context processor means a device that processes numbers, specifically binary numbers, 0s and 1s.
To process means to manipulate. It is a general term that describes all manipulation. Again in this content, it means to perform certain operations on the numbers that depend on the microprocessors design.
Differences between:
Microcomputer a computer with a microprocessor as its CPU. Includes memory, I/O etc. Microprocessor silicon chip which includes ALU, register circuits & control circuits Microcontroller silicon chip which includes microprocessor, memory & I/O in a single package.
In the early 1970s the microchip was invented. All of the components that made up the processor were now placed on a single piece of silicon. The size became several thousand times smaller and the speed became several hundred times faster. The MicroProcessor was born.
Definition (Contd.)
Programmable device: The microprocessor can perform different sets of operations on the data it receives depending on the sequence of instructions supplied in the given program. By changing the program, the microprocessor manipulates the data in different ways. Instructions: Each microprocessor is designed to execute a specific group of operations. This group of operations is called an instruction set. This instruction set defines what the microprocessor can and cannot do.
Definition (Contd.)
Takes in: The data that the microprocessor manipulates must come from somewhere. It comes from what is called input devices. These are devices that bring data into the system from the outside world. These represent devices such as a keyboard, a mouse, switches, and the like.
Definition (Contd.)
Numbers: The microprocessor has a very narrow view on life. It only understands binary numbers. A binary digit is called a bit (which comes from binary digit). The microprocessor recognizes and processes a group of bits together. This group of bits is called a word. The number of bits in a Microprocessors word, is a measure of its abilities.
Definition (Contd.)
Words, Bytes, etc.
information 8-bits at a time. Thats why they are called 8-bit processors. They can handle large numbers, but in order to process these numbers, they broke them into 8-bit pieces and processed each group of 8-bits separately. Later microprocessors (8086 and 68000) were designed with 16-bit words. A group of 8-bits were referred to as a half-word or byte. A group of 4 bits is called a nibble. Also, 32 bit groups were given the name long word.
The earliest microprocessor (the Intel 8088 and Motorolas 6800) recognized 8-bit words. They processed
Today, all processors manipulate at least 32 bits at a time and there exists microprocessors that can process 64, 80, 128 bits i
Definition (Contd.)
Arithmetic and Logic Operations:
Every microprocessor has arithmetic operations such as add and subtract as part of its instruction set. Most microprocessors will have operations such as multiply and divide. Some of the newer ones will have complex operations such as square root.
In addition, microprocessors have logic operations as well. Such as AND, OR, XOR, shift left, shift right, etc. Again, the number and types of operations define the microprocessors instruction set and depends on the specific microprocessor.
Definition (Contd.)
Stored in memory :
First, what is memory? Memory is the location where information is kept while not in current use. Memory is a collection of storage devices. Usually, each storage device holds one bit. Also, in most kinds of memory, these storage devices are grouped into groups of 8. These 8 storage locations can only be accessed together. So, one can only read or write in terms of bytes to and form memory. Memory is usually measured by the number of bytes it can hold. It is measured in Kilos, Megas and lately Gigas. A Kilo in computer language is 210 =1024. So, a KB (KiloByte) is 1024 bytes. Mega is 1024 Kilos and Giga is 1024 Mega.
Definition (Contd.)
Stored in memory: When a program is entered into a computer, it is stored in memory. Then as the microprocessor starts to execute the instructions, it brings the instructions from memory one at a time. Memory is also used to hold the data. The microprocessor reads (brings in) the data from memory when it needs it and writes (stores) the results into memory when it is done.
Definition (Contd.)
Produces:
For the user to see the result of the execution of the program, the results must be presented in a human readable form. The results must be presented on an output device. This can be the monitor, a paper from the printer, a simple LED or many other forms.
Overview
Instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the IntelPentium and the AMD Athlon implement nearly identical versions of the x86 instruction set A complex instruction set computer (CISC) has many specialized instructions, which may only be rarely used in practical programs. Areduced instruction set computer (RISC) simplifies the processor by only implementing instructions that are frequently used in programs;
11/13/2013 Amit Kulkarni
11/13/2013
Amit Kulkarni
MU - a simple processor
A simple form of processor can be built from a few basic components: a program counter (PC) register that is used to hold the address of the current instruction; a single register called an accumulator (ACC) that holds a data value while it is worked upon; an arithmetic-logic unit (ALU) that can perform a number of operations on binary operands, such as add, subtract, increment, and so on; an instruction register (IR) that holds the current instruction while it is executed; instruction decode and control logic that employs the above components to achieve the desired results from each instruction.
11/13/2013 Amit Kulkarni
11/13/2013
Amit Kulkarni
MU logic design
To understand how this instruction set might be implemented we will go through the design process in a logical order. The approach taken here will be to separate the design into two components:
The datapath The control logic.
11/13/2013
Amit Kulkarni
The datapath
All the components carrying, storing or processing many bits in parallel will be considered part of the datapath, including the accumulator, program counter, ALU and instruction register. For these components we will use a register transfer level (RTL) design style based on registers, multiplexers, and so on.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Datapath design
Each instruction takes exactly the number of clock cycles defined by the number of memory accesses it must make
11/13/2013
Amit Kulkarni
Contd..
Datapath operation
Access the memory operand and perform the desired operation. Fetch the next instruction to be executed.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Instruction types
Data processing instructions such as add, subtract and multiply. Data movement instructions that copy data from one place in memory to another, or from memory to the processor's registers, and so on. Control flow instructions that switch execution from one part of the program to another, possibly depending on data values. Special instructions to control the processor's execution state, for instance to switch into a privileged mode to carry out an operating system function.
11/13/2013 Amit Kulkarni
Addressing modes
1. Immediate addressing: the desired value is presented as a binary value in the instruction. 2. Absolute addressing: the instruction contains the full binary address of the desired value in memory. 3. Indirect addressing: the instruction contains the binary address of a memory location that contains the binary address of the desired value. 4. Register addressing: the desired value is in a register, and the instruction contains the register number. 5. Register indirect addressing: the instruction contains the number of a register which contains the address of the value in memory. 6. Base plus offset addressing: the instruction specifies a register (the base) and a binary offset to be added to the base to form the memory address. 7. Base plus index addressing: the instruction specifies a base register and another register (the index) which is added to the base to form the memory address. 8. Base plus scaled index addressing: as above, but the index is multiplied by a constant (usually the size of the data item, and usually a power of two) before being added to the base. 9. Stack addressing: an implicit or specified register (the stack pointer) points to an area of memory (the stack) where data items are written (pushed) or read (popped) on a last-in-first-out basis.
11/13/2013
Amit Kulkarni
CISC
The principal trend in instruction set design was towards increasing complexity in an attempt to reduce the semantic gap that the compiler had to bridge. The origins of this trend were in the minicomputers developed during the 1970s. These computers had relatively slow main memories coupled to processors built using many simple integrated circuits. So it made sense to implement frequently used operations as microcode sequences rather than them requiring several instructions to be fetched from main memory.
11/13/2013 Amit Kulkarni
CISC contd
In particular, the microcode ROM which was needed for all the complex routines absorbed an unreasonable proportion of the area of a single chip, leaving little room for other performance- enhancing features.
Compiler
11/13/2013
Amit Kulkarni
RISC
Reducing the semantic gap between the processor instruction set and the high-level language is not the right way to make an efficient computer. What other options are open to the designer? What processors do?
11/13/2013
Amit Kulkarni
RISC contd
The ARM core uses a RISC architecture. RISC is a design philosophy aimed at delivering simple but powerful instructions that execute within a single cycle at a high clock speed. The RISC philosophy concentrates on reducing the complexity of instructions It is easier to provide greater flexibility and intelligence in software rather than hardware. As a result, a RISC design places greater demands on the compiler.
11/13/2013 Amit Kulkarni
RISC contd
Contd
PipelinesThe processing of instructions is broken down into smaller units that can be executed in parallel by pipelines. Ideally the pipeline advances by one step on each cycle for maximum throughput. Instructions can be decoded in one pipeline stage. There is no need for an instruction to be executed by a miniprogram called microcode as on CISC processors.
11/13/2013
Amit Kulkarni
Contd
RegistersRISC machines have a large generalpurpose register set. Any register can contain either data or an address. Registers act as the fast local memory store for all data processing operations. In contrast, CISC processors have dedicated registers for specific purposes.
11/13/2013
Amit Kulkarni
Contd
Load-store architectureThe processor operates on data held in registers. Separate load and store instructions transfer data between the register bank and external memory. Memory accesses are costly, so separating memory accesses from data processing provides an advantage because you can use data items held in the register bank multiple times without needing multiple memory accesses. In contrast, with a CISC design the data processing operations can act on memory directly.
11/13/2013 Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Memory
Hierarchy Width Types
11/13/2013
Amit Kulkarni
Peripherals
All ARM peripherals are memory mappedthe programming interface is a set of memory-addressed registers. The address of these registers is an offset from a specific peripheral base address.
Memory Controllers Interrupt Controllers
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Summary
RISC CISC A fixed (32-bit) instruction size with Hard-wired instruction decode logic; few formats CISC processors used large microcode ROMs to decode their instructions. A load-store architecture where instructions that process data operate only on registers and are separate from instructions that access memory Pipelined execution; CISC processors allowed little, if any, overlap between consecutive instructions (though they do now).
A large register bank of thirty-two 32- Single-cycle execution; CISC processors bit registers, all of which could be used typically took many clock cycles to for any purpose, to allow the load-store complete a single instruction. architecture to operate efficiently A smaller die size., A shorter Greater code density development time, A higher performance, poor code density
11/13/2013 Amit Kulkarni
Chapter-2
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Registers
Current Visible Registers
Abort Mode
r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 (sp) r14 (lr) r15 (pc) cpsr spsr r13 (sp) r14 (lr)
IRQ
SVC
Undef
spsr
spsr
spsr
spsr
11/13/2013
Amit Kulkarni
T Bit
Architecture xT only T = 0: Processor in ARM state T = 1: Processor in Thumb state
J bit
Architecture 5TEJ only J = 1: Processor in Jazelle state
11/13/2013 Amit Kulkarni
Mode bits
Specify the processor mode
Processor Modes
User : unprivileged mode under which most tasks run FIQ : entered when a high priority (fast) interrupt is raised IRQ : entered when a low priority (normal) interrupt is raised Supervisor : entered on reset and when a Software Interrupt instruction is executed Abort : used to handle memory access violations Undef : used to handle undefined instructions System : privileged mode using the same registers as user mode
11/13/2013
Amit Kulkarni
Banked Registers
11/13/2013
Amit Kulkarni
These sates cannot be intermingle sequential ARM, Thumb, and Jazelle instructions. The Jazelle J and Thumb T bits in the cpsr reflect the state of the processor. When both J and T bits are 0, the processor is in ARM state and executes ARM instructions.
11/13/2013 Amit Kulkarni
Jazelle
Jazelle functionality was specified in the ARMv5TEJ architecture[2] and the first processor with Jazelle technology was the ARM926EJ-S. Jazelle RCT (Runtime Compilation Target) is a different technology and is based on ThumbEE mode and supports ahead-of-time (AOT) and just-intime (JIT) compilation with Java and other execution environments. The most prominent use of Jazelle DBX is by manufacturers of mobile phones to increase the execution speed of Java ME games and applications
11/13/2013 Amit Kulkarni
11/13/2013
Amit Kulkarni
Interrupt Masks
Interrupt masks are used to stop specific interrupt requests from interrupting the processor. There are two interrupt request levels available on the ARM processor core
interrupt request (IRQ) and fast interrupt request (FIQ)
The cpsr has two interrupt mask bits, 7 and 6 (or I and F), which control the masking of IRQ and FIQ, respectively. The I bit masks IRQ when set to binary 1, and similarly the F bit masks FIQ when set to binary 1.
11/13/2013 Amit Kulkarni
Condition Flags
11/13/2013
Amit Kulkarni
Condition mnemonics
11/13/2013
Amit Kulkarni
Pipeline
A pipeline is the mechanism a RISC processor uses to execute instructions. Using a pipeline speeds up execution by fetching the next instruction while other instructions are being decoded and executed.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
11/13/2013
Amit Kulkarni
Exception Handling
When an exception occurs, the ARM:
Copies CPSR into SPSR_<mode> Sets appropriate CPSR bits 0x1C Change to ARM state 0x18 Change to exception mode 0x14 Disable interrupts (if appropriate) 0x10 Stores the return address in LR_<mode> 0x0C 0x08 Sets PC to vector address
To return, exception handler needs to:
0x04 0x00
Reset
11/13/2013
Amit Kulkarni
Nomenclature
11/13/2013
Amit Kulkarni
An Introduction Chapter-3
11/13/2013
Amit Kulkarni
Data types
ARM processors support six data types: 8-bit signed and unsigned bytes. 16-bit signed and unsigned half-words; these are aligned on 2-byte boundaries. 32-bit signed and unsigned words; these are aligned on 4-byte boundaries.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
If you use the S suffix on a data processing instruction, then it updates the flags in the cpsr.
11/13/2013 Amit Kulkarni
Barrel Shifter
Data processing instructions are processed within the arithmetic logic unit (ALU). A unique and powerful feature of the ARM processor is the ability to shift the 32-bit binary pattern in one of the source registers left or right by a specific number of positions before it enters the ALU. This shift increases the power and flexibility of many data processing operations. Pre-processing or shift occurs within the cycle time of the instruction. This is particularly useful for loading constants into a register and achieving fast multiplies or division by a power of 2.
11/13/2013 Amit Kulkarni
Barrel Shifter
11/13/2013
Amit Kulkarni
x represents the register being shifted and y represents the shift amount.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Shift operations
LSL #n Logical shift left immediate
11/13/2013
Amit Kulkarni
Shift operations
ASL #n Arithmetic shift left immediate: This is a synonym for LSL #n and has an identical effect. LSR #n Logical shift right immediate: n is the number of bit positions by which the value is shifted. It has the value 1..32. An LSR by one bit is shown below:
11/13/2013
Amit Kulkarni
Shift operations
ASR #n Arithmetic shift right immediate: n is the number of bit positions by which the value is shifted. It has the value 1..32. An ASR by one bit is shown below:
11/13/2013
Amit Kulkarni
Shift operations
ROR #n Rotate right immediate: n is the number of bit positions to rotate in the range 1..31. A rotate right by one bit is shown below:
11/13/2013
Amit Kulkarni
Shift operations
RRX Rotate right one bit with extend: This special case of rotate right has a slightly different effect from the usual rotates. There is no count; it always rotates by one bit only. The pictorial representation of RRX is:
11/13/2013
Amit Kulkarni
Arithmetic Instructions
11/13/2013
Amit Kulkarni
Logical Instructions
11/13/2013
Amit Kulkarni
Comparison Instructions
The comparison instructions are used to compare or test a register with a 32-bit value. They update the cpsr flag bits according to the result, but do not affect other registers. No need to apply the S suffix for comparison instructions to update the flags.
11/13/2013
Amit Kulkarni
Multiply Instructions
The multiply instructions multiply the contents of a pair of registers and, depending upon the instruction, accumulate the results in with another register. The long multiplies accumulate onto a pair of registers representing a 64-bit value. The final result is placed in a destination register or a pair of registers.
11/13/2013
Amit Kulkarni
Multiply Instructions
11/13/2013
Amit Kulkarni
Branch Instructions
A branch instruction changes the flow of execution or is used to call a routine. This type of instruction allows programs to have subroutines, if-then-else structures, and loops.
11/13/2013
Amit Kulkarni
Condition codes
To make an instruction conditional, a two-letter suffix is added to the mnemonic. AL Always: An instruction with this suffix is always executed. ADDAL and ADD mean the same thing: add unconditionally. NV Never: Such instructions might be used for 'padding' or perhaps to use up a (very) small amount of time in a program. EQ Equal: This condition is true if the result flag Z (zero) is set. This might arise after a compare instruction where the operands were equal, or in any data instruction which received a zero result into the destination.
11/13/2013 Amit Kulkarni
Condition codes
NE Not equal: This is clearly the opposite of EQ, and is true if the Z flag is cleared. If Z is set, and instruction with the NE condition will not be executed. VS Overflow set: This condition is true if the result flag V (overflow) is set. Add, subtract and compare instructions affect the V flag. VC Overflow clear: The opposite to VS. MI Minus: Instructions with this condition only execute if the N (negative) flag is set.
11/13/2013 Amit Kulkarni
Condition codes
PL Plus: This is the opposite to the MI condition and instructions with the PL condition will only execute if the N flag is cleared. CS Carry set: This condition is true if the result flag C (carry) is set. The carry flag is affected by arithmetic instructions such as ADD, SUB and CMP. CC Carry clear: This is the inverse condition to CS. HI Higher: This condition is true if the C flag is set and the Z flag is false. LS Lower or same: This condition is true if the C flag is cleared or the Z flag is set.
11/13/2013 Amit Kulkarni
Condition codes
GE Greater than or equal: This is true if N is cleared and V is cleared, or N is set and V is set. LT Less than: This is the opposite to GE and instructions with this condition are executed if N is set and V is cleared, or N is cleared and V is set. GT Greater than: This is the same as GE, with the addition that the Z flag must be cleared too. LE Less than or equal: This is the same as LT, and is also true whenever the Z flag is set.
11/13/2013 Amit Kulkarni
Load-Store Instructions
Load-store instructions transfer data between memory and processor registers. There are three types of load-store instructions:
single-register transfer, multiple-register transfer, and swap.
11/13/2013
Amit Kulkarni
Single-Register Transfer
These instructions are used for moving a single data item in and out of a register. The data types supported are signed and unsigned words (32-bit), halfwords (16-bit), and bytes.
11/13/2013
Amit Kulkarni
Single-Register Transfer
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Multiple-Register Transfer
Load-store multiple instructions can transfer multiple registers between memory and the processor in a single instruction. The transfer occurs from a base address register Rn pointing into memory. Multiple-register transfer instructions are more efficient from single-register transfers for moving blocks of data around memory and saving and restoring context and stacks.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Multiple-Register Transfer
Load-store multiple instructions can increase interrupt latency. ARM implementations do not usually interrupt instructions while they are executing. For example, on an ARM7 a load multiple instruction takes 2 + Nt cycles, where N is the number of registers to load and t is the number of cycles required for each sequential access to memory.
11/13/2013
Amit Kulkarni
Stack Operations
The ARM architecture uses the load-store multiple instructions to carry out stack operations. The pop operation (removing data from a stack) uses a load multiple instruction, similarly The push operation (placing data onto the stack) uses a store multiple instruction. When using a stack you have to decide whether the stack will grow up or down in memory. A stack is either ascending (A) or descending (D). Ascending stacks grow towards higher memory addresses; in contrast, descending stacks grow towards lower memory addresses.
11/13/2013 Amit Kulkarni
Stack Operations
When you use a full stack (F), the stack pointer sp points to an address that is the last used or full location (i.e., sp points to the last item on the stack). In contrast, if you use an empty stack (E) the sp points to an address that is the first unused or empty location (i.e., it points after the last item on the stack).
11/13/2013
Amit Kulkarni
Stack Operations
When handling a checked stack there are three attributes that need to be preserved:
stack base, the stack pointer, and the stack limit
The stack base is the starting address of the stack in memory. The stack pointer initially points to the stack base. If the stack pointer passes/goes back the stack limit, then a stack overflow/underflow error has occurred.
11/13/2013 Amit Kulkarni
Swap Instruction
The swap instruction is a special case of a load-store instruction. It swaps the contents of memory with the contents of a register. This instruction is an atomic operationit reads and writes a location in the same bus operation, preventing any other instruction from reading or writing to that location until it completes.
11/13/2013
Amit Kulkarni
Swap Instruction
The swap instruction loads a word from memory into register r0 and overwrites the memory with register r1.
This instruction is particularly useful when implementing semaphores and mutual exclusion in an operating system.
11/13/2013 Amit Kulkarni
Swap Instruction
This example shows a simple data guard that can be used to protect data from being written by another task. The SWP instruction holds the bus until the transaction is complete.
11/13/2013
Amit Kulkarni
When the processor executes an SWI instruction, it sets the program counter pc to the offset 0x8 in the vector table. The instruction also forces the processor mode to SVC, which allows an operating system routine to be called in a privileged mode.
11/13/2013 Amit Kulkarni
In the syntax you can see a label called fields. This can be any combination of control (c), extension (x), status (s), and flags (f ).
11/13/2013 Amit Kulkarni
Coprocessor Instructions
Coprocessor instructions are used to extend the instruction set. A coprocessor can either provide additional computation capability or be used to control the memory subsystem including caches and memory management. The coprocessor instructions include data processing, register transfer, and memory transfer instructions.
11/13/2013
Amit Kulkarni
Code Density
In early computers, memory was expensive, So minimizing the size of a program to make sure it would fit in the limited memory was often central. Thus the combined size of all the instructions needed to perform a particular task, the code density, was an important characteristic of any instruction set. Computers with high code density often have complex instructions for procedure entry, parameterized returns, loops etc. (therefore retroactively named Complex Instruction Set Computers, CISC).
11/13/2013 Amit Kulkarni
Over view
The Thumb instruction set addresses the issue of code density. Thumb encodes a subset of the 32-bit ARM instructions into a 16-bit instruction set space. Thumb programmer's model maps onto the ARM programmer's model. Implementations of Thumb use dynamic decompression in an ARM instruction pipeline and then instructions execute as standard ARM instructions within the processor. Thumb has higher code densitythe space taken up in memory by an executable programthan ARM. On average, a Thumb implementation of the same code takes up around 30% less memory than the equivalent ARM implementation.
11/13/2013 Amit Kulkarni
Over view
Thumb is not a complete architecture. It is not anticipated that a processor would execute Thumb instructions without also supporting the ARM instruction set. Thumb implementation may uses more instructions, but the overall memory footprint is reduced.
11/13/2013
Amit Kulkarni
An Introduction Chapter-3
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
1. No direct access to the cpsr or spsr. 2. To alter the cpsr or spsr, you must switch into ARM state to use MSR and MRS. 3. No coprocessor instructions in Thumb state.
11/13/2013 Amit Kulkarni
Branch Instruction
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Data Pro
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Code optimization
Optimizing code takes time and reduces source code readability. Optimize functions that are frequently executed and important for performance. C compilers have to translate your C function literally into assembler so that it works for all possible inputs. In practice, many of the input combinations are not possible or wont occur.
11/13/2013
Amit Kulkarni
Example
1.compiler, it does not know whether N can be 0 on input or not. 2. The compiler doesnt know whether the data array pointer is four-byte aligned or not. 3. Nor does it know whether N is a multiple of four or not. 4. The compiler must be conservative and assume all possible values for N and all possible alignments for data.
11/13/2013
Amit Kulkarni
Efficient C code
To write efficient C code, 1. You must be aware of areas where the C compiler has to be conservative, 2. The limits of the processor architecture the C compiler is mapping to, and 3. The limits of a specific C compiler.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Example
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
C Looping Structures
Loops with a fixed number of iterations Loops with a variable number of iterations. Loop unrolling.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Is it an efficient loop?
This is not efficient. On the ARM, a loop should only use two instructions: A subtract to decrement the loop counter, which also sets the condition code flags on the result A conditional branch instruction The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit. Then the comparison with zero is free since the result is stored in the condition flags. Since we are no longer using i as an array index, there is no problem in counting down rather than up.
11/13/2013 Amit Kulkarni
11/13/2013
Amit Kulkarni
Loop Unrolling
Each loop iteration costs two instructions in addition to the body of the loop: 1. a subtract to decrement the loop count and 2. a conditional branch. These instructions are the loop overhead. On ARM7 or ARM9 processors the subtract takes one cycle and the branch three cycles, giving an overhead of four cycles per loop.
11/13/2013
Amit Kulkarni
Loop Unrolling
Some of these cycles can be saved by unrolling a loop. Repeating the loop body several times, and reducing the number of loop iterations by the same proportion. There are two questions you need to ask when unrolling a loop: How many times should I unroll the loop? What if the number of loop iterations is not a multiple of the unroll amount?
11/13/2013
Amit Kulkarni
Example
int checksum_v9(int *data, unsigned int N) { int sum=0; do { sum += *(data++); sum += *(data++); sum += *(data++); sum += *(data++); N -= 4; } while ( N!=0); return sum; }
11/13/2013 Amit Kulkarni
Loop unrolling
1. Only unroll loops that are important for the overall performance of the application. Otherwise unrolling will increase the code size with little performance benefit. Unrolling may even reduce performance by evicting more important code from the cache. 2. Try to arrange it so that array sizes are multiples of your unroll amount. If this isnt possible, then you must add extra code to take care of the leftover cases. This increases the code size a little but keeps the performance high.
11/13/2013 Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
Register Allocation
The compiler attempts to allocate a processor register to each local variable you use in a C function. It will try to use the same register for different local variables if the use of the variables do not overlap. When there are more local variables than available registers, the compiler stores the excess variables on the processor stack. These variables are called spilled or swapped out variables since they are written out to memory. Spilled variables are slow to access compared to variables allocated to registers.
11/13/2013 Amit Kulkarni
Register Allocation
To implement a function efficiently, you need to Minimize the number of spilled variables Ensure that the most important and frequently accessed variables are stored in registers
11/13/2013
Amit Kulkarni
Function Calls
The ARM Procedure Call Standard (APCS) defines how to pass function arguments and return values in ARM registers. The more recent ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and Thumb interworking The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and r3. Subsequent integer arguments are placed on the full descending stack, ascending in memory
11/13/2013 Amit Kulkarni
Function Calls
11/13/2013
Amit Kulkarni
Example
Function with 5 arguments
11/13/2013
Amit Kulkarni
To reduce overheads
The caller function need not preserve registers that it can see the callee doesnt corrupt. Therefore the caller function need not save all the ATPCS corruptible registers. If the callee function is very small, then the compiler can inline the code in the caller function. This removes the function call overhead completely.
11/13/2013
Amit Kulkarni
Pointer Aliasing
Two pointers are said to alias when they point to the same address. If you write to one pointer, it will affect the value you read from the other pointer. In a function, the compiler often doesnt know which pointers can alias and which pointers cant.
11/13/2013
Amit Kulkarni
How to avoid?
Do not rely on the compiler to eliminate common subexpressions involving memory accesses. Instead create new local variables to hold the expression. This ensures the expression is evaluated only once. Avoid taking the address of local variables. The variable may be inefficient to access from then on.
11/13/2013
Amit Kulkarni
Structure Arrangement
Place all 8-bit elements at the start of the structure. Place all 16-bit elements next, then 32-bit, then 64-bit. Place all arrays and larger elements at the end of the structure. If the structure is too big for a single instruction to access all the elements, then group the elements into substructures. The compiler can maintain pointers to the individual substructures.
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni
#include <stdio.h> extern void thumb_function(void); int main(void) { printf("Hello from ARM\n"); thumb_function(); printf("And goodbye from ARM\n"); return (0); }
11/13/2013 Amit Kulkarni
11/13/2013
Amit Kulkarni
AREA AddReg,CODE,READONLY ;Name this block of code. ENTRY ;Mark first instruction to call. Main ADR r0,ThumbProg +1 ;Generate branch target address ;and set bit 0,hence arrive at target in Thumb state. BX r0 ;Branch exchange to ThumbProg. ;Subsequent instructions are Thumb
MOV r2,#2 MOV r3,#3 ADD r2,r2,r3 ADR r0,ARMProg BX r0 CODE32 ARMProg MOV r4,#4 MOV r5,#5 ADD r4,r4,r5 MOV r0,#0x18 LDR r1,=0x20026 SWI 0x123456 END
Amit Kulkarni
Stop
11/13/2013
11/13/2013
Amit Kulkarni
11/13/2013
Amit Kulkarni