Escolar Documentos
Profissional Documentos
Cultura Documentos
Paolo.Ienne@epfl.ch Efficient:
EPFL – I&C – LAP Cost-reduction must profit from specialisation
Low power, small size,…
But it is changing!…
3 AdvCompArch — Digital Signal Processors © Ienne 2003-05 4 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Specificities of
High-end Embedded Processors Embedded Processors
Networking, wireless communication, printer and Cost used to be the only concern; now
disk controllers, DVDs and video, digital performance/cost is at premium and
photography, medical devices… still not performance alone as in PCs (Intel
Computing power ever growing model); performance is often a constraint
Sometimes products with 5-10 Digital Signal
Processors on a single die (e.g., xDSL, VoIP) Binary compatibility is less of an issue
Multimedia, cryptographic capabilities, adaptive signal for embedded systems
processing, etc.
Systems-on-Chip make processor
volume irrelevant (moderate motivation
toward single processor for all products)
5 AdvCompArch — Digital Signal Processors © Ienne 2003-05 6 AdvCompArch — Digital Signal Processors © Ienne 2003-05
7 AdvCompArch — Digital Signal Processors © Ienne 2003-05 8 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Cost and Performance Cost Is Not Just the Processor…
11 AdvCompArch — Digital Signal Processors © Ienne 2003-05 12 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Trends in Computing? Pressure on the Compilers
Performance
Processor regularity
VLIW
Squeeze out every possible MIPS of performance
from irregular architectures
OOO SSC
Code Size
RISC
Memory is a key cost factor in embedded systems,
SMT
much more than in general purpose systems
Power Consumption
CISC Important metric in embedded systems, hardly of any
DSP relevance in general purpose computing (i.e., not
even considered by compilers)
Processor complexity
13 AdvCompArch — Digital Signal Processors © Ienne 2003-05 14 AdvCompArch — Digital Signal Processors © Ienne 2003-05
DSP Arithmetic:
Typical Features of DSPs Fixed-Point Vs. Floating-Point
Arithmetic and Datapath Typical example of embedded processor
Fixed-point arithmetic support
economics: much more complexity in designing
MAC = multiply-accumulate instruction
Special registers, not directly accessible
the algorithm (NRE cost) and in programming to
Memory Architecture
get much less complexity in the hardware
Harvard architecture (mfg. cost)
Multiple data memories Floating-point DSP ~2-4x cost of Fixed-point
Addressing Modes DSP and much slower…
Special address generators
Bit-reversed addressing
Very poor support in automatic tools yet Î
Circular buffers decisions taken by algorithm analysis,
Optimised Control simulation, and compliance tests (e.g.,
Zero-overhead loops accumulated error over a test set below some
Special-purpose peripherals… value)
15 AdvCompArch — Digital Signal Processors © Ienne 2003-05 16 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Fixed Point Fixed Point Multiplication
In principle, if one adds a fractional point in a Multiplication typically introduces the need of arithmetic
fixed position, hardware for integers works just rescaling with shifts to the right (multiplicative constant
cannot be implicit anymore) Î Choice of accuracy
as well and there are no additional ISA needs
depending on how many bits one can keep…
0. 1 0 1 02 → 0.62510
0 1 0 0 12 → 910 0 1. 0 0 12 → 1.12510
× 0. 0 1 1 02 → 0.37510
+ 0 0 0 1 12 → 310 + 0 0. 0 1 12 → 0.37510
0. 0 0 1 1 1 1 0 02 → 0.23437510
0 1 1 0 02 → 1210 0 1. 1 0 02 → 1.50010
Æ 0. 0 0 1 1 1 1 02 → 0.23437510
24 23 22 21 20 21 20 2-1 2-2 2-3
Æ Æ 0. 0 0 1 1 1 12 → 0.23437510
It’s just a matter of representation! (I.e, implicit Æ Æ Æ 0. 0 0 1 1 12 → 0.2187510
constant multiplicative coefficient)
Æ Æ Æ Æ 0. 0 0 1 12 → 0.187510
17 AdvCompArch — Digital Signal Processors © Ienne 2003-05 18 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Fixed-Point Programming
Different Approximation Choices Example
=.5 round to nearest even Æ No bias if ( diff >= step ) { delta = 4; diff -= step; vpdiff += step; }
00.011 Î 00 and 01.011 Î 01 step >>= 1;
00.100 Î 00 and 01.100 Î 10 if (
step
diff >= step ) { delta |= 2; diff -= step; vpdiff += step; }
>>= 1;
00.101 Î 01 and 01.101 Î 10 if ( diff >= step ) { delta |= 1; vpdiff += step; }
19 AdvCompArch — Digital Signal Processors © Ienne 2003-05 20 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Fixed-Point Programming
Example DSP Arithmetic Needs
21 AdvCompArch — Digital Signal Processors © Ienne 2003-05 22 AdvCompArch — Digital Signal Processors © Ienne 2003-05
MAC operations tend to dominate DSP code (maybe Data Bus Chained operations:
50% of critical code) Î highly optimised MAC instruction TR “Pipelined” MAC
RISC: Many special registers:
MUL
Typ. 2 cycles: MUL Dedicated pipelining
1 cycle: ADD MUL Reduced pressure on
1 cycle: SHR PR general-purpose register
Some more cycles: file
>>
Saturation, Rounding, Shorter instruction
etc. length (implicit operand
ALU addressing)
DSP: ALU
Architecturally visible
1 cycle: “rich” MAC ACC pipeline!
ACCU
23 AdvCompArch — Digital Signal Processors © Ienne 2003-05 24 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Classic FIR Example Memory Bandwidth
i =0
Goal: 1 tap
−1 −1 −1 (= MAC)
x(t ) Z Z Z per cycle
X(t-2) x
+ A
X(t-1)
C0 X C1 X C2 X C3 X
X(t)
+ + + + y (t )
25 AdvCompArch — Digital Signal Processors © Ienne 2003-05 26 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Y-Memory
Sometimes more… DSP:
Multiple buses Harvard (architecturally visible)
1-4 memory accesses per cycle
No caches
SRAM
27 AdvCompArch — Digital Signal Processors © Ienne 2003-05 28 AdvCompArch — Digital Signal Processors © Ienne 2003-05
DSP vs. General Purpose
Caches and DSPs Memory Systems
Importance of real-time constraints: no data Fast and small SRAM, No Virtual Memory
Multiple D-Memories, (direct access to peripherals)
caches… Multiported D-Memory,…
Sometimes caches on the instruction memory,
Only I-Cache
but determinism is key in DSPs: (if at all…)
Caches under programmer control to “lock-in” some
critical instructions
Turn caches into fast program memory I-Mem
X-Mem
Once again, one is not after highest
performance but just the guaranteed Y-Mem
minimal performance one needs
29 AdvCompArch — Digital Signal Processors © Ienne 2003-05 30 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Example
Motorola DSP56600
Baseband Chip
31 AdvCompArch — Digital Signal Processors © Ienne 2003-05 32 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Addressing Modes Address Generation Units
To keep MAC busy all the time, with new data Dedicated simple datapaths to generate meaningful
from memory, one needs to generate memory sequences of addresses—usually 2-4 per DSP
addresses Address Reg. Immediate Modifier Reg.
Pointer Constant Pointer
Forget about Load/Store architectures
Complex addressing is now fully welcome if 1,2,3…
Allows automatic next address calculation
Does not require usage of the datapath (MAC is
busy…) AR0 MR0
AR1 MR1
AR2
+/- MR2
Explicit parallelism/pipelining
MPYF3 *AR0++%, *AR1++%, R0 AR3 MR3
|| ADDF3 R0, R2, R2
DSPs deal with continuous I/O flows, often organised in Remember typical goal: FIR with MAC
circular buffers
busy 100% of the time…
Buffer Start
DSP code made essentially of tight loops,
often with a statically determined number
Buffer of iterations (coefficients of a filter, etc.)
Size
Pointer How can one make the branches “cost
Increment nothing”?
Repeat instructions
All DSPs generate “modulo” or “circular” addresses Zero-overhead loops
37 AdvCompArch — Digital Signal Processors © Ienne 2003-05 38 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Zero-overhead Loop instruction: In a sense DSPs have already the main
Configures the Program Control Unit to features of VLIWs: explicit parallelism,
generate the appropriate next address static scheduling, no “dynamic” low
depending on a condition (e.g., predictability behaviour…
autodecrement of an AR)
ÎConvergence?
39 AdvCompArch — Digital Signal Processors © Ienne 2003-05 40 AdvCompArch — Digital Signal Processors © Ienne 2003-05
TI TMC320C64x Infineon Carmel
© Infineon 2000
Sort of VLIW but not all possible instructions are available: only 2048
via Configurable Long Instruction Words with compact coding
41 AdvCompArch — Digital Signal Processors © Ienne 2003-05 42 AdvCompArch — Digital Signal Processors © Ienne 2003-05
© Infineon 2000
a5
a5 == (unsigned)a0l
(unsigned)a0l ** *r1;
*r1; ff1
a5 ||
|| a0
a0 == a4a4 ++ a5
a5
a5 == (a5
(a5 >>
>> 16)
16) ++ a0h
a0h ** *r1--;
*r1--; ||
a0 || a5
a5 == (unsigned)a0l
(unsigned)a0l ** ff2
ff2
a0 == a4
a4 ++ a5;
a5; ||
*r4++ || a1h
a1h == a0h;
a0h;
*r4++ == round(a0);
round(a0); }}
}} cliw
cliw dcf2(r0,
dcf2(r0, r4++)
r4++)
{{
a4
a4 -=-= *ma1
*ma1 ** ff1
ff1
||
|| *ma2
*ma2 == round(a0)
round(a0)
||
|| a5
a5 == (a5
(a5 >>
>> 16)
16) ++ a1h
a1h ** ff2;
ff2;
}}
}}
1 CLIW™ instruction 2
5 cycles
cycles
43 AdvCompArch — Digital Signal Processors © Ienne 2003-05 44 AdvCompArch — Digital Signal Processors © Ienne 2003-05
Summary