Você está na página 1de 12

Embedded Systems

Advanced Computer Architecture


— ‰Application-Specific:
Part II: Embedded Computing
™Application fixed in advance
Digital Signal Processors
™Not or very moderately programmable by the user
‰Reactive:
™Reacts on events coming from the environment
™Has real time constraints

Paolo.Ienne@epfl.ch ‰Efficient:
EPFL – I&C – LAP ™Cost-reduction must profit from specialisation
™Low power, small size,…

2 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Embedded Processors Processor Sales Per Architecture

Source: Hennessy & Patterson, © MK 2005

Source: Hennessy & Patterson, © MK 2005


‰ Until recently, embedded processors were almost always simple
lowest-cost devices (8-bit microcontrollers, etc.)

But it is changing!…
3 AdvCompArch — Digital Signal Processors © Ienne 2003-05 4 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Specificities of
High-end Embedded Processors Embedded Processors

‰Networking, wireless communication, printer and ‰Cost used to be the only concern; now
disk controllers, DVDs and video, digital performance/cost is at premium and
photography, medical devices… still not performance alone as in PCs (Intel
‰Computing power ever growing model); performance is often a constraint
™Sometimes products with 5-10 Digital Signal
Processors on a single die (e.g., xDSL, VoIP) ‰Binary compatibility is less of an issue
™Multimedia, cryptographic capabilities, adaptive signal for embedded systems
processing, etc.
‰Systems-on-Chip make processor
volume irrelevant (moderate motivation
toward single processor for all products)
5 AdvCompArch — Digital Signal Processors © Ienne 2003-05 6 AdvCompArch — Digital Signal Processors © Ienne 2003-05

General Purpose Processors General Purpose Processors


Cost and Pricing Policy Costs and Pricing Policy

Source: Microprocessor Report, © MPR 2002

Source: Microprocessor Report, © MPR 2002


Intel
List
Prices
vs.
Time

7 AdvCompArch — Digital Signal Processors © Ienne 2003-05 8 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Cost and Performance Cost Is Not Just the Processor…

‰Performance is a design constraint ‰Processors have tangible induced costs


™General Purpose: new high-end processor must be
tangibly faster than best in class
‰Some could require:
™Embedded: adequately programmed, must just satisfy ™Larger memories
some minimal performance constraints ™More expensive memories (e.g., dual-port)
‰Cost is the optimisation criteria ™Caches (I and/or D)
™General Purpose: must be within some typical range ™Peripherals and accelerators
(e.g., 50-120 USD) Æ profit margin can be as high as
some factor (2-3x) ™Faster clock rate
™Embedded: must be minimal Æ economic margin on ™…
the whole product can be as low as a few percent
points ‰System cost can be extremely influenced
by the architecture of the processor
9 AdvCompArch — Digital Signal Processors © Ienne 2003-05 10 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Types of Embedded Processors Completely Different Benchmarks

‰ Microcontrollers ‰General Purpose (Word, Powerpoint, gcc,…)


™ Relatively slow, microprogrammed, CISC processors ™SPEC Æ Commercial
™ Typically derivates of old or very old general purpose processor
ƒ Scientific computing
families (68k, 8051, 6502, etc.)
ƒ Regular and irregular typical user applications…
‰ RISC Processors
™ Pipelined, relatively simple RISCs, often with special architectural ‰DSPs (IIR, FIR, FFT,…)
features for the embedded market ™DSPstone (Aachen Uni) Æ Academic
™ Typical representatives: ARM7 and ARM9 ™EEMBC (pronounced “embassy”) Æ Commercial
‰ Digital Signal Processors ƒ IIR, FIR, IDCT, FFT, IFFT, PWM,…
™ Special family of processors with peculiar architectures for ƒ Matrix arithmetic, bit manipulation, table lookup,
arithmetic intensive, signal processing applications interpolation,…
™ Typical representatives: TI C620, DSP56k, etc. ƒ Jpeg, RGB-to-CYMK, RGB-to-YIQ, Bezier curves, rotations,…
‰ Multimedia Processors… ƒ Viterbi, autocorrelation, convolutional encoders,…

11 AdvCompArch — Digital Signal Processors © Ienne 2003-05 12 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Trends in Computing? Pressure on the Compilers

‰Performance
Processor regularity

VLIW
™Squeeze out every possible MIPS of performance
from irregular architectures

OOO SSC
‰Code Size
RISC
™Memory is a key cost factor in embedded systems,
SMT
much more than in general purpose systems
‰Power Consumption
CISC ™Important metric in embedded systems, hardly of any
DSP relevance in general purpose computing (i.e., not
even considered by compilers)
Processor complexity

13 AdvCompArch — Digital Signal Processors © Ienne 2003-05 14 AdvCompArch — Digital Signal Processors © Ienne 2003-05

DSP Arithmetic:
Typical Features of DSPs Fixed-Point Vs. Floating-Point
‰ Arithmetic and Datapath ‰Typical example of embedded processor
™ Fixed-point arithmetic support
economics: much more complexity in designing
™ MAC = multiply-accumulate instruction
™ Special registers, not directly accessible
the algorithm (NRE cost) and in programming to
‰ Memory Architecture
get much less complexity in the hardware
™ Harvard architecture (mfg. cost)
™ Multiple data memories ‰Floating-point DSP ~2-4x cost of Fixed-point
‰ Addressing Modes DSP and much slower…
™ Special address generators
™ Bit-reversed addressing
‰Very poor support in automatic tools yet Î
™ Circular buffers decisions taken by algorithm analysis,
‰ Optimised Control simulation, and compliance tests (e.g.,
™ Zero-overhead loops accumulated error over a test set below some
‰ Special-purpose peripherals… value)
15 AdvCompArch — Digital Signal Processors © Ienne 2003-05 16 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Fixed Point Fixed Point Multiplication

‰In principle, if one adds a fractional point in a ‰ Multiplication typically introduces the need of arithmetic
fixed position, hardware for integers works just rescaling with shifts to the right (multiplicative constant
cannot be implicit anymore) Î Choice of accuracy
as well and there are no additional ISA needs
depending on how many bits one can keep…
0. 1 0 1 02 → 0.62510
0 1 0 0 12 → 910 0 1. 0 0 12 → 1.12510
× 0. 0 1 1 02 → 0.37510
+ 0 0 0 1 12 → 310 + 0 0. 0 1 12 → 0.37510
0. 0 0 1 1 1 1 0 02 → 0.23437510
0 1 1 0 02 → 1210 0 1. 1 0 02 → 1.50010
Æ 0. 0 0 1 1 1 1 02 → 0.23437510
24 23 22 21 20 21 20 2-1 2-2 2-3
Æ Æ 0. 0 0 1 1 1 12 → 0.23437510
‰It’s just a matter of representation! (I.e, implicit Æ Æ Æ 0. 0 0 1 1 12 → 0.2187510
constant multiplicative coefficient)
Æ Æ Æ Æ 0. 0 0 1 12 → 0.187510
17 AdvCompArch — Digital Signal Processors © Ienne 2003-05 18 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Fixed-Point Programming
Different Approximation Choices Example

‰ Truncate: Discard bits Æ Large bias /* an excerpt from adpcm.c */


™ 00.011 Î 00 and 01.011 Î 01 /* adpcm_coder, mediabench */
™ 00.100 Î 00 and 01.100 Î 01 /* Step 2 - Divide and clamp */
™ 00.101 Î 00 and 01.101 Î 01 ** This code *approximately* computes:
‰ Round: <.5 round down, >=.5 round up Æ Small bias
** delta = diff*4/step;
** vpdiff = (delta+0.5)*step/4;
™ 00.011 Î 00 and 01.011 Î 01 ** but in shift step bits are dropped. The net result of this is
™ 00.100 Î 01 and 01.100 Î 10 ** that even if you have fast mul/div hardware you cannot put it
** into good use since the fixup would be too expensive.
™ 00.101 Î 01 and 01.101 Î 10 */
‰ Convergent Round: <.5 round down, >.5 round up, delta = 0; vpdiff = (step >> 3);

=.5 round to nearest even Æ No bias if ( diff >= step ) { delta = 4; diff -= step; vpdiff += step; }
™ 00.011 Î 00 and 01.011 Î 01 step >>= 1;
™ 00.100 Î 00 and 01.100 Î 10 if (
step
diff >= step ) { delta |= 2; diff -= step; vpdiff += step; }
>>= 1;
™ 00.101 Î 01 and 01.101 Î 10 if ( diff >= step ) { delta |= 1; vpdiff += step; }

19 AdvCompArch — Digital Signal Processors © Ienne 2003-05 20 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Fixed-Point Programming
Example DSP Arithmetic Needs

‰Rather than having full floating-point


support (expensive and slow), one wants
/* an excerpt from adpcm.c */ ‰ Other classic DSP- in a DSP some simple and fast ad-hoc
/* adpcm_coder, mediabench */ type of operation: operations:
accumulation ™MUL + ADD in a single cycle (MAC)
int index, delta; with saturation
™Accumulation register after MAC
‰ Also, as in previous

index += indexTable[delta]; (precision?)
example, post
if (index < 0) index = 0;
multiplication ™Approximation mechanisms
if (index > 88) index = 88;
shift ‰Nonuniform precision in the whole
architecture (e.g., 24bit x 24bit + 56bit)

21 AdvCompArch — Digital Signal Processors © Ienne 2003-05 22 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Slower Clock Speed but 1-Cycle


Multiply-Accumulate Instruction Example of Pipelined MAC Datapath

‰ MAC operations tend to dominate DSP code (maybe Data Bus ‰Chained operations:
50% of critical code) Î highly optimised MAC instruction TR ™“Pipelined” MAC
RISC: ‰Many special registers:
MUL
‰ Typ. 2 cycles: MUL ™Dedicated pipelining
‰ 1 cycle: ADD MUL ™Reduced pressure on
‰ 1 cycle: SHR PR general-purpose register
‰ Some more cycles: file
>>
Saturation, Rounding, ™Shorter instruction
etc. length (implicit operand
ALU addressing)
DSP: ALU
‰Architecturally visible
‰ 1 cycle: “rich” MAC ACC pipeline!
ACCU
23 AdvCompArch — Digital Signal Processors © Ienne 2003-05 24 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Classic FIR Example Memory Bandwidth

‰Convolution: ‰The MAC instruction/unit is not enough…


N −1
y ( z ) = ∑ Ci ⋅ x ( z − i )
C: X:

i =0
Goal: 1 tap
−1 −1 −1 (= MAC)
x(t ) Z Z Z per cycle
X(t-2) x
+ A
X(t-1)
C0 X C1 X C2 X C3 X
X(t)

+ + + + y (t )

25 AdvCompArch — Digital Signal Processors © Ienne 2003-05 26 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Multiple Memory Ports RISC vs. DSP Organisation

‰ Harvard architecture: RISC:


™ Separate instruction memory ‰ Von Neumann (Harvard but hidden from the user)
™ I-Memory at times accessible as another D-Memory (e.g., TI
‰ ~1 access/cycle
C2000) to spare memory ports
‰ Heavily relies on caches to achieve performance
‰ Multiple data memories:
™ X-Memory ‰ Complex blend of on-chip SRAM/SRAM/DRAM
© TI

™ Y-Memory
™ Sometimes more… DSP:
‰ Multiple buses ‰ Harvard (architecturally visible)
‰ 1-4 memory accesses per cycle
‰ No caches
‰ SRAM
27 AdvCompArch — Digital Signal Processors © Ienne 2003-05 28 AdvCompArch — Digital Signal Processors © Ienne 2003-05
DSP vs. General Purpose
Caches and DSPs Memory Systems

‰Importance of real-time constraints: no data Fast and small SRAM, No Virtual Memory
Multiple D-Memories, (direct access to peripherals)
caches… Multiported D-Memory,…
‰Sometimes caches on the instruction memory,
Only I-Cache
but determinism is key in DSPs: (if at all…)
™Caches under programmer control to “lock-in” some
critical instructions
™Turn caches into fast program memory I-Mem

X-Mem
Once again, one is not after highest
performance but just the guaranteed Y-Mem
minimal performance one needs

29 AdvCompArch — Digital Signal Processors © Ienne 2003-05 30 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Example
Motorola DSP56600

Baseband Chip

Courtesy of Motorola, © Motorola 2000


© Motorola

31 AdvCompArch — Digital Signal Processors © Ienne 2003-05 32 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Addressing Modes Address Generation Units

‰To keep MAC busy all the time, with new data ‰ Dedicated simple datapaths to generate meaningful
from memory, one needs to generate memory sequences of addresses—usually 2-4 per DSP
addresses Address Reg. Immediate Modifier Reg.
Pointer Constant Pointer
‰Forget about Load/Store architectures
‰Complex addressing is now fully welcome if 1,2,3…
™Allows automatic next address calculation
™Does not require usage of the datapath (MAC is
busy…) AR0 MR0
AR1 MR1
AR2
+/- MR2
Explicit parallelism/pipelining
MPYF3 *AR0++%, *AR1++%, R0 AR3 MR3
|| ADDF3 R0, R2, R2

33 AdvCompArch — Digital Signal Processors © Ienne 2003-05 34


To Memory AdvCompArch — Digital Signal Processors © Ienne 2003-05

Typical Addressing Modes Radix-2 FFT


AR can be loaded with: 0000
‰ Immediate load: constant from the instruction field loaded into 1000
the pointed AR 0100
‰ Immediate modify: constant from the instruction field added to 1100
the pointed AR 0010
‰ Autoincrement: small constant (typ. 1 and/or 2) added to the 1010
pointed AR 0110
1110
‰ Automodify: value of the pointed MR added to the pointed AR
0001
‰ Bit Reversing: value of the pointed AR bit-reversed and loaded 1001
into the pointed AR 0101
‰ Modulo/Circular: autoincrement/automodify with modulo 1101
‰ Also decrement/subtract 0011
‰ Sometimes pre- and/or post-modification 1011
0111
1111
35 AdvCompArch — Digital Signal Processors © Ienne 2003-05 36 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Circular Buffers Remove Control Bottlenecks

‰ DSPs deal with continuous I/O flows, often organised in ‰Remember typical goal: FIR with MAC
circular buffers
busy 100% of the time…
Buffer Start
‰DSP code made essentially of tight loops,
often with a statically determined number
Buffer of iterations (coefficients of a filter, etc.)
Size
Pointer ‰How can one make the branches “cost
Increment nothing”?
™Repeat instructions
‰ All DSPs generate “modulo” or “circular” addresses ™Zero-overhead loops

37 AdvCompArch — Digital Signal Processors © Ienne 2003-05 38 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Repeat/Loop Instructions DSP World Is Slowly Changing

‰For loops made of a single instruction: ‰Need of a fast development turnaround


RPTS N-1 ; repeat next
MPYF3 *AR0++%, *AR1++%, R0
ÎCompilers!
|| ADDF3 R0, R2, R2

‰Zero-overhead Loop instruction: ‰In a sense DSPs have already the main
™Configures the Program Control Unit to features of VLIWs: explicit parallelism,
generate the appropriate next address static scheduling, no “dynamic” low
depending on a condition (e.g., predictability behaviour…
autodecrement of an AR)
ÎConvergence?

39 AdvCompArch — Digital Signal Processors © Ienne 2003-05 40 AdvCompArch — Digital Signal Processors © Ienne 2003-05
TI TMC320C64x Infineon Carmel

Source: Microprocessor Report, © MPR 2000

© Infineon 2000
Sort of VLIW but not all possible instructions are available: only 2048
via Configurable Long Instruction Words with compact coding

41 AdvCompArch — Digital Signal Processors © Ienne 2003-05 42 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Direct Carmel Translation for Optimal Carmel Code for G.723.1


G.723.1 DC Filter DC Filter

zero-overhead loop 1 CLIW™ instruction


repeat(Frame)
repeat(Frame) block
block
repeat(Frame)
repeat(Frame) block
block {{
{{ cliw
cliw dcf1(r0++)
dcf1(r0++)
a4
a4 == *r0++
*r0++ ** *r1;
*r1; {{
a4
a4 == *ma1
*ma1 ** ff1
© Infineon 2000

© Infineon 2000
a5
a5 == (unsigned)a0l
(unsigned)a0l ** *r1;
*r1; ff1
a5 ||
|| a0
a0 == a4a4 ++ a5
a5
a5 == (a5
(a5 >>
>> 16)
16) ++ a0h
a0h ** *r1--;
*r1--; ||
a0 || a5
a5 == (unsigned)a0l
(unsigned)a0l ** ff2
ff2
a0 == a4
a4 ++ a5;
a5; ||
*r4++ || a1h
a1h == a0h;
a0h;
*r4++ == round(a0);
round(a0); }}
}} cliw
cliw dcf2(r0,
dcf2(r0, r4++)
r4++)
{{
a4
a4 -=-= *ma1
*ma1 ** ff1
ff1
||
|| *ma2
*ma2 == round(a0)
round(a0)
||
|| a5
a5 == (a5
(a5 >>
>> 16)
16) ++ a1h
a1h ** ff2;
ff2;
}}
}}

1 CLIW™ instruction 2
5 cycles
cycles
43 AdvCompArch — Digital Signal Processors © Ienne 2003-05 44 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Summary

‰DSPs are very different from general-purpose


computers
™Dedicated to embedded applications
™Cost and power consumption come into the picture
(and cost is fundamental)
‰Relatively narrow variety of applications
™More application specialisation possible
‰Development cost (programming) relatively
irrelevant when compared to per-unit cost
™The most awkward and hard-to-program solutions are
ok if they bring enough savings
™Compilers? Useful for 90% of the code, but the rest…
45 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Você também pode gostar