Digital Signal Processors

Embedded Systems
Advanced Computer Architecture

— Application-Specific:
Part II: Embedded Computing
Application fixed in advance
Digital Signal Processors
Not or very moderately programmable by the user
Reactive:
Reacts on events coming from the environment
Has real time constraints
Paolo.Ienne@epfl.ch Efficient:
EPFL – I&C – LAP Cost-reduction must profit from specialisation
Low power, small size,…
2 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Embedded Processors Processor Sales Per Architecture
Source: Hennessy & Patterson, © MK 2005
Source: Hennessy & Patterson, © MK 2005

Until recently, embedded processors were almost always simple
lowest-cost devices (8-bit microcontrollers, etc.)
But it is changing!…
3 AdvCompArch — Digital Signal Processors © Ienne 2003-05 4 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Specificities of
High-end Embedded Processors Embedded Processors
Networking, wireless communication, printer and Cost used to be the only concern; now
disk controllers, DVDs and video, digital performance/cost is at premium and
photography, medical devices… still not performance alone as in PCs (Intel
Computing power ever growing model); performance is often a constraint
Sometimes products with 5-10 Digital Signal
Processors on a single die (e.g., xDSL, VoIP) Binary compatibility is less of an issue
Multimedia, cryptographic capabilities, adaptive signal for embedded systems
processing, etc.
Systems-on-Chip make processor
volume irrelevant (moderate motivation
toward single processor for all products)
General Purpose Processors General Purpose Processors

Cost and Pricing Policy Costs and Pricing Policy
Source: Microprocessor Report, © MPR 2002

Intel
List
Prices
vs.
Time
Cost and Performance Cost Is Not Just the Processor…
Performance is a design constraint Processors have tangible induced costs

General Purpose: new high-end processor must be
tangibly faster than best in class
Some could require:
Embedded: adequately programmed, must just satisfy Larger memories
some minimal performance constraints More expensive memories (e.g., dual-port)
Cost is the optimisation criteria Caches (I and/or D)
General Purpose: must be within some typical range Peripherals and accelerators
(e.g., 50-120 USD) Æ profit margin can be as high as
some factor (2-3x) Faster clock rate
Embedded: must be minimal Æ economic margin on …
the whole product can be as low as a few percent
points System cost can be extremely influenced
by the architecture of the processor
Types of Embedded Processors Completely Different Benchmarks
Microcontrollers General Purpose (Word, Powerpoint, gcc,…)

Relatively slow, microprogrammed, CISC processors SPEC Æ Commercial
Typically derivates of old or very old general purpose processor
Scientific computing
families (68k, 8051, 6502, etc.)
Regular and irregular typical user applications…
RISC Processors
Pipelined, relatively simple RISCs, often with special architectural DSPs (IIR, FIR, FFT,…)
features for the embedded market DSPstone (Aachen Uni) Æ Academic
Typical representatives: ARM7 and ARM9 EEMBC (pronounced “embassy”) Æ Commercial
Digital Signal Processors IIR, FIR, IDCT, FFT, IFFT, PWM,…
Special family of processors with peculiar architectures for Matrix arithmetic, bit manipulation, table lookup,
arithmetic intensive, signal processing applications interpolation,…
Typical representatives: TI C620, DSP56k, etc. Jpeg, RGB-to-CYMK, RGB-to-YIQ, Bezier curves, rotations,…
Multimedia Processors… Viterbi, autocorrelation, convolutional encoders,…
Trends in Computing? Pressure on the Compilers
Performance
Processor regularity
VLIW
Squeeze out every possible MIPS of performance
from irregular architectures
OOO SSC
Code Size
RISC
Memory is a key cost factor in embedded systems,
SMT
much more than in general purpose systems
Power Consumption
CISC Important metric in embedded systems, hardly of any
DSP relevance in general purpose computing (i.e., not
even considered by compilers)
Processor complexity
DSP Arithmetic:
Typical Features of DSPs Fixed-Point Vs. Floating-Point
Arithmetic and Datapath Typical example of embedded processor
Fixed-point arithmetic support
economics: much more complexity in designing
MAC = multiply-accumulate instruction
Special registers, not directly accessible
the algorithm (NRE cost) and in programming to
Memory Architecture
get much less complexity in the hardware
Harvard architecture (mfg. cost)
Multiple data memories Floating-point DSP ~2-4x cost of Fixed-point
Addressing Modes DSP and much slower…
Special address generators
Bit-reversed addressing
Very poor support in automatic tools yet Î
Circular buffers decisions taken by algorithm analysis,
Optimised Control simulation, and compliance tests (e.g.,
Zero-overhead loops accumulated error over a test set below some
Special-purpose peripherals… value)
Fixed Point Fixed Point Multiplication
In principle, if one adds a fractional point in a Multiplication typically introduces the need of arithmetic
fixed position, hardware for integers works just rescaling with shifts to the right (multiplicative constant
cannot be implicit anymore) Î Choice of accuracy
as well and there are no additional ISA needs
depending on how many bits one can keep…
0. 1 0 1 02 → 0.62510
0 1 0 0 12 → 910 0 1. 0 0 12 → 1.12510
× 0. 0 1 1 02 → 0.37510
+ 0 0 0 1 12 → 310 + 0 0. 0 1 12 → 0.37510
0. 0 0 1 1 1 1 0 02 → 0.23437510
0 1 1 0 02 → 1210 0 1. 1 0 02 → 1.50010
Æ 0. 0 0 1 1 1 1 02 → 0.23437510
24 23 22 21 20 21 20 2-1 2-2 2-3
Æ Æ 0. 0 0 1 1 1 12 → 0.23437510
It’s just a matter of representation! (I.e, implicit Æ Æ Æ 0. 0 0 1 1 12 → 0.2187510
constant multiplicative coefficient)
Æ Æ Æ Æ 0. 0 0 1 12 → 0.187510
Fixed-Point Programming
Different Approximation Choices Example
Truncate: Discard bits Æ Large bias /* an excerpt from adpcm.c */

00.011 Î 00 and 01.011 Î 01 /* adpcm_coder, mediabench */
00.100 Î 00 and 01.100 Î 01 /* Step 2 - Divide and clamp */
00.101 Î 00 and 01.101 Î 01 ** This code *approximately* computes:
Round: <.5 round down, >=.5 round up Æ Small bias
** delta = diff*4/step;
** vpdiff = (delta+0.5)*step/4;
00.011 Î 00 and 01.011 Î 01 ** but in shift step bits are dropped. The net result of this is
00.100 Î 01 and 01.100 Î 10 ** that even if you have fast mul/div hardware you cannot put it
** into good use since the fixup would be too expensive.
00.101 Î 01 and 01.101 Î 10 */
Convergent Round: <.5 round down, >.5 round up, delta = 0; vpdiff = (step >> 3);
=.5 round to nearest even Æ No bias if ( diff >= step ) { delta = 4; diff -= step; vpdiff += step; }
00.011 Î 00 and 01.011 Î 01 step >>= 1;
00.100 Î 00 and 01.100 Î 10 if (
step
diff >= step ) { delta |= 2; diff -= step; vpdiff += step; }
>>= 1;
00.101 Î 01 and 01.101 Î 10 if ( diff >= step ) { delta |= 1; vpdiff += step; }
Fixed-Point Programming
Example DSP Arithmetic Needs
Rather than having full floating-point

support (expensive and slow), one wants
/* an excerpt from adpcm.c */ Other classic DSP- in a DSP some simple and fast ad-hoc
/* adpcm_coder, mediabench */ type of operation: operations:
accumulation MUL + ADD in a single cycle (MAC)
int index, delta; with saturation
Accumulation register after MAC
Also, as in previous
…
index += indexTable[delta]; (precision?)
example, post
if (index < 0) index = 0;
multiplication Approximation mechanisms
if (index > 88) index = 88;
shift Nonuniform precision in the whole
architecture (e.g., 24bit x 24bit + 56bit)
Slower Clock Speed but 1-Cycle

Multiply-Accumulate Instruction Example of Pipelined MAC Datapath
MAC operations tend to dominate DSP code (maybe Data Bus Chained operations:
50% of critical code) Î highly optimised MAC instruction TR “Pipelined” MAC
RISC: Many special registers:
MUL
Typ. 2 cycles: MUL Dedicated pipelining
1 cycle: ADD MUL Reduced pressure on
1 cycle: SHR PR general-purpose register
Some more cycles: file
>>
Saturation, Rounding, Shorter instruction
etc. length (implicit operand
ALU addressing)
DSP: ALU
Architecturally visible
1 cycle: “rich” MAC ACC pipeline!
ACCU
Classic FIR Example Memory Bandwidth
Convolution: The MAC instruction/unit is not enough…

N −1
y ( z ) = ∑ Ci ⋅ x ( z − i )
C: X:
i =0
Goal: 1 tap
−1 −1 −1 (= MAC)
x(t ) Z Z Z per cycle
X(t-2) x
+ A
X(t-1)
C0 X C1 X C2 X C3 X
X(t)
+ + + + y (t )
Multiple Memory Ports RISC vs. DSP Organisation
Harvard architecture: RISC:

Separate instruction memory Von Neumann (Harvard but hidden from the user)
I-Memory at times accessible as another D-Memory (e.g., TI
~1 access/cycle
C2000) to spare memory ports
Heavily relies on caches to achieve performance
Multiple data memories:
X-Memory Complex blend of on-chip SRAM/SRAM/DRAM
© TI
Y-Memory
Sometimes more… DSP:
Multiple buses Harvard (architecturally visible)
1-4 memory accesses per cycle
No caches
SRAM
DSP vs. General Purpose
Caches and DSPs Memory Systems
Importance of real-time constraints: no data Fast and small SRAM, No Virtual Memory
Multiple D-Memories, (direct access to peripherals)
caches… Multiported D-Memory,…
Sometimes caches on the instruction memory,
Only I-Cache
but determinism is key in DSPs: (if at all…)
Caches under programmer control to “lock-in” some
critical instructions
Turn caches into fast program memory I-Mem
X-Mem
Once again, one is not after highest
performance but just the guaranteed Y-Mem
minimal performance one needs
Example
Motorola DSP56600
Baseband Chip
Courtesy of Motorola, © Motorola 2000

© Motorola
Addressing Modes Address Generation Units
To keep MAC busy all the time, with new data Dedicated simple datapaths to generate meaningful
from memory, one needs to generate memory sequences of addresses—usually 2-4 per DSP
addresses Address Reg. Immediate Modifier Reg.
Pointer Constant Pointer
Forget about Load/Store architectures
Complex addressing is now fully welcome if 1,2,3…
Allows automatic next address calculation
Does not require usage of the datapath (MAC is
busy…) AR0 MR0
AR1 MR1
AR2
+/- MR2
Explicit parallelism/pipelining
MPYF3 *AR0++%, *AR1++%, R0 AR3 MR3
|| ADDF3 R0, R2, R2
33 AdvCompArch — Digital Signal Processors © Ienne 2003-05 34

To Memory AdvCompArch — Digital Signal Processors © Ienne 2003-05
Typical Addressing Modes Radix-2 FFT

AR can be loaded with: 0000
Immediate load: constant from the instruction field loaded into 1000
the pointed AR 0100
Immediate modify: constant from the instruction field added to 1100
the pointed AR 0010
Autoincrement: small constant (typ. 1 and/or 2) added to the 1010
pointed AR 0110
1110
Automodify: value of the pointed MR added to the pointed AR
0001
Bit Reversing: value of the pointed AR bit-reversed and loaded 1001
into the pointed AR 0101
Modulo/Circular: autoincrement/automodify with modulo 1101
Also decrement/subtract 0011
Sometimes pre- and/or post-modification 1011
0111
1111
Circular Buffers Remove Control Bottlenecks
DSPs deal with continuous I/O flows, often organised in Remember typical goal: FIR with MAC
circular buffers
busy 100% of the time…
Buffer Start
DSP code made essentially of tight loops,
often with a statically determined number
Buffer of iterations (coefficients of a filter, etc.)
Size
Pointer How can one make the branches “cost
Increment nothing”?
Repeat instructions
All DSPs generate “modulo” or “circular” addresses Zero-overhead loops
Repeat/Loop Instructions DSP World Is Slowly Changing
For loops made of a single instruction: Need of a fast development turnaround

RPTS N-1 ; repeat next
MPYF3 *AR0++%, *AR1++%, R0
ÎCompilers!
|| ADDF3 R0, R2, R2
Zero-overhead Loop instruction: In a sense DSPs have already the main
Configures the Program Control Unit to features of VLIWs: explicit parallelism,
generate the appropriate next address static scheduling, no “dynamic” low
depending on a condition (e.g., predictability behaviour…
autodecrement of an AR)
ÎConvergence?
TI TMC320C64x Infineon Carmel
© Infineon 2000
Sort of VLIW but not all possible instructions are available: only 2048
via Configurable Long Instruction Words with compact coding
Direct Carmel Translation for Optimal Carmel Code for G.723.1

G.723.1 DC Filter DC Filter
zero-overhead loop 1 CLIW™ instruction

repeat(Frame)
repeat(Frame) block
block
repeat(Frame)
repeat(Frame) block
block {{
{{ cliw
cliw dcf1(r0++)
dcf1(r0++)
a4
a4 == *r0++
*r0++ ** *r1;
*r1; {{
a4
a4 == *ma1
*ma1 ** ff1
© Infineon 2000
© Infineon 2000
a5
a5 == (unsigned)a0l
(unsigned)a0l ** *r1;
*r1; ff1
a5 ||
|| a0
a0 == a4a4 ++ a5
a5
a5 == (a5
(a5 >>
>> 16)
16) ++ a0h
a0h ** *r1--;
*r1--; ||
a0 || a5
a5 == (unsigned)a0l
(unsigned)a0l ** ff2
ff2
a0 == a4
a4 ++ a5;
a5; ||
*r4++ || a1h
a1h == a0h;
a0h;
*r4++ == round(a0);
round(a0); }}
}} cliw
cliw dcf2(r0,
dcf2(r0, r4++)
r4++)
{{
a4
a4 -=-= *ma1
*ma1 ** ff1
ff1
||
|| *ma2
*ma2 == round(a0)
round(a0)
||
|| a5
a5 == (a5
(a5 >>
>> 16)
16) ++ a1h
a1h ** ff2;
ff2;
}}
}}
1 CLIW™ instruction 2
5 cycles
cycles
Summary
DSPs are very different from general-purpose

computers
Dedicated to embedded applications
Cost and power consumption come into the picture
(and cost is fundamental)
Relatively narrow variety of applications
More application specialisation possible
Development cost (programming) relatively
irrelevant when compared to per-unit cost
The most awkward and hard-to-program solutions are
ok if they bring enough savings
Compilers? Useful for 90% of the code, but the rest…
45 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Digital Signal Processors

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Digital Signal Processors

Enviado por

Direitos autorais:

Formatos disponíveis

Embedded Systems

Advanced Computer Architecture

2 AdvCompArch — Digital Signal Processors © Ienne 2003-05

Embedded Processors Processor Sales Per Architecture

Source: Hennessy & Patterson, © MK 2005

Source: Hennessy & Patterson, © MK 2005

General Purpose Processors General Purpose Processors

Source: Microprocessor Report, © MPR 2002

Source: Microprocessor Report, © MPR 2002

Performance is a design constraint Processors have tangible induced costs

Types of Embedded Processors Completely Different Benchmarks

 Microcontrollers General Purpose (Word, Powerpoint, gcc,…)

 Truncate: Discard bits Æ Large bias /* an excerpt from adpcm.c */

Rather than having full floating-point

Slower Clock Speed but 1-Cycle

Convolution: The MAC instruction/unit is not enough…

Multiple Memory Ports RISC vs. DSP Organisation

 Harvard architecture: RISC:

Courtesy of Motorola, © Motorola 2000

33 AdvCompArch — Digital Signal Processors © Ienne 2003-05 34

Typical Addressing Modes Radix-2 FFT

Repeat/Loop Instructions DSP World Is Slowly Changing

For loops made of a single instruction: Need of a fast development turnaround

Source: Microprocessor Report, © MPR 2000

Direct Carmel Translation for Optimal Carmel Code for G.723.1

zero-overhead loop 1 CLIW™ instruction

DSPs are very different from general-purpose

Você também pode gostar

Performance is a design constraint Processors have tangible induced costs

Microcontrollers General Purpose (Word, Powerpoint, gcc,…)

Truncate: Discard bits Æ Large bias /* an excerpt from adpcm.c */

Rather than having full floating-point

Convolution: The MAC instruction/unit is not enough…

Harvard architecture: RISC:

For loops made of a single instruction: Need of a fast development turnaround

DSPs are very different from general-purpose