Você está na página 1de 102

Digital Filtering In

Hardware
Adnan Aziz

Slide 1
Introduction
 Digital filtering vs Analog filtering
– More robust (process variations, temperature),
flexible (bit precision, program), store & recover
– Lower performance (esp high freq), more
area/power, cannot sense, need data-converters
 Can perform digital filtering in hardware or software
– Software (DSP/generic microprocessors): flexible,
less up-front cost
– Hardware (ASIC/FPGA): customized, cheaper in
volume, lower area/power

Slide 2
Applications
 Applications: noise filtering, equalization, image
processing, seismology, radar, ECC, audio/image
compression
 Focus: implementing difference equations
– No feedback – FIR, feedback – IIR
– Assume coefficient synthesis done
– Operate almost exclusively in time domain (FFT
done)

Slide 3
Evolution

Slide 4
Various Representations
 3-tap FIR: y (n)  a. x(n)  b. x(n  1)  c. x(n  2)
 Non terminating, repeatedly execute same code
– iteration: Execute all operations, iteration period: time to
perform iteration, iteration rate: inverse of iteration period
– sampling rate (aka throughput): number of samples per
second, critical path: max combinational delay (no wave
pipelining!)
 Block Diagram
– Close to actual hardware – interconnected functional
blocks, potentially with delay elements between blocks

Slide 5
Block Diagram

Slide 6
Block Diagram

Slide 7
Signal Flow Graph
 Unique source, sink (input and output)
– Edges represent const multiplier, delay
– Nodes represent I/O, adder, mult
– Useful for wordlength effects, less for architecture
design

Slide 8
SFG

Slide 9
Dataflow Graph
 DFG
– Nodes: computations (functions, subtasks)
– Edges: datapaths
• Capture data-driven nature of DSP, intra-
iteration and inter-iteration constraints
 Very general: nonlinear, multirate, asynchronous,
synchronous
 Difference from block diagram:
– Hardware not allocated, scheduled in DFG

Slide 10
DFG

Slide 11
DFG

Slide 12
Multirate DFG

Slide 13
Iteration Bound
 In DFG, execution each node once in an iteration
– All nodes executed: iteration
 Critical path: combinational path with maximum total
execution time (Note: we’re reserving the term delay
for sequential delay)
 Loop (=cycle): path beginning and ending at same
node
– Loop bound for loop L = TL/WL
 Iteration Bound: maximum of all loop bounds
– Lower bound on execution time for DFG
(assuming only pipelining, retiming, unfolding)
Slide 14
Iteration Bound

Slide 15
Iteration Bound

Slide 16
Iteration Bound

Slide 17
2.3

Slide 18
2.4

Slide 19
2.5

Slide 20
2.6

Slide 21
2.7

Slide 22
Pipeline and Parallelize
 Pipelining: insert delay elements to reduce critical
path length
– Faster (more throughput), lower power
– Added latency, latches/clocking
 Parallelism: compute multiple outputs in a single
clock cycle
– Faster, lower power
– Added hardware, sequencing logic

Slide 23
Pipelining
 General: applicable to microprocessor architectures,
logic circuits, DFGs
– Have to place delays (=flops) carefully
– On “feed forward” cutsets

Slide 24
Pipelining

Slide 25
Pipelining & Parallel

Slide 26
Pipelining

Slide 27
Feed-forward Cutset

Slide 28
Transposition

Slide 29
Transposition

Slide 30
Data Broadcast

Slide 31
Fine-grain Pipelining

Slide 32
Parallel Processing
 Process blocks at a time
– Clock period =L * Sample Rate

Slide 33
Parallelism

Slide 34
Parallelism

Slide 35
Components

Slide 36
Need for Parallelism

Slide 37
Parallelism
 Why not use pipelining?
– May have a single large delay element that
cannot be divided (communication between
chips)
 Can use in conjunction with pipelining
– Relatively less efficient than pipelining (area cost
and power savings)
 Note that we’ve skirted the issue of parallizing
general DFGs
– Loops make life hard

Slide 38
Parallelize + Pipeline

Slide 39
Area Efficiency

Slide 40
Pipelining Processors
 Classic DLX processor
– ISA: Load/Store or Mem Access
– 5 stages: IF, ID, EX, MEM, WB
 Pipelining processors is hard
– Data hazards:
• ADD r1, r2, r3; SUB r4, r5, r1
• Solution: Use bypass logic
• LD r1, [r2]; ADD r4, r1, r2
• Solution?
– Branch hazards
• PC not changed till end of ID
• Solution: redo IF (only) if branch taken
 Pipelining DFGs is easy (no control flow!)

Slide 41
Pipelining Processors

Slide 42
Retiming
 Basic idea (for logic circuits)
– Move flops back and forth across gates
– Use for clock period reduction, flop minimization,
power minimization, resynthesis
 Same idea holds for DFGs
– Examples
– Algorithm
– C-slow retiming

Slide 43
Retiming

Slide 44
Retiming

Slide 45
Cutset Retiming

Slide 46
Cutset Retiming

Slide 47
C-Slow Retiming

Slide 48
Min Delay Retiming
 Formalize: use notion of “retiming function” on nodes
– Amount of delay pushed back of node (can be
negative – think of as retardation function)
 Want to know if cycle time TC is feasible
– set up constraints
• Long paths have to be broken
• No negative delays on edges
– Solve using a custom ILP
• Uses efficient graph algorithms

Slide 49
Unfolding
 Analagous to loop unrolling for programs
– for (I=1; I< 5; I++) { a[I] = b[I]+c[I]; }
– Many benefits, at the price of potential increase in
code size
 Look at 2-unfolding of
– Y(n) = x(n) + a y(n-9)
 General algorithm for J-unfolding a DFG
– Uses J nodes for each original node, new delay
values
– Nontrivial fact: algorithm works

Slide 50
Unfolding

Slide 51
Unfolding

Slide 52
Applications
 Meet iteration bound
– When a single node has large execution time
– When IB is nonintegral

Slide 53
Applications: IB

Slide 54
Application: fractional IB

Slide 55
Applications: Parallelize
 Recall in Chapter 3,
we never gave a
systematic way of
generating parallel
circuits
– Loop unfolding
gives a way

Slide 56
Applications: Bit-Digit
 Convert a bit-serial architecture to a digit-serial
architecture

Slide 57
Folding
 Trade area for time
– Use same hardware unit for multiple nodes in the
DFG
 Example: y(n) = a(n) + b(n) + c(n)
 Need general systematic approach to folding
– Math formulation: folding orders, folding sets,
folding factors

Slide 58
Folding

Slide 59
Folding

Slide 60
Folding

Slide 61
Folding

Slide 62
Folding

Slide 63
Folding

Slide 64
Register Minimization
 Consider DSP program that produces 3 variables:
– a: {1,2,3,4}
– b: {2,3,4,5,6,7}
– c: {5,6,7}
 Number of live variables: {1,2,2,2,2,2,2}
– Intuitively, should be able to get by with 2
registers
 However, DSP programs are periodic
– May have variable live across iterations

Slide 65
Linear Lifetime Chart

Slide 66
Lifetime Analysis: Matrix

Slide 67
Lifetime Chart: Matrix

Slide 68
Register Allocation Table

Slide 69
Reg Assignment: Matrix

Slide 70
Reg Assignment: Biquad

Slide 71
Reg Assignment: Biquad

Slide 72
Pipelined & Parallel IIR
 Feedback loops makes pipelining and parallelism
very hard
– Impossible to beat iteration bound without
rewriting the difference equation
 Example
– Pipeline interleaving of y(n+1) = a y(n) + b u(n)
– Note that IB goes up, but can run multiple
streams in parallel

Slide 73
Pipeline Interleaved IIR

Slide 74
Pipeline Interleaved IIR

Slide 75
Pipeline Interleaved IIR

Slide 76
Pipelining 1-st Order IIR
 Y(n) = a y(n) + u(n)
– Sample rate set by multiply and add time
 Can do better by “look ahead pipelining”
– Basically, changing the difference equation to get
more delays in the loop
• Key – functionality unchanged
– Best understood in terms of Z-transforms

Slide 77
Pipelining 1-st Order IIR

Slide 78
Pipelining 1-st Order IIR

Slide 79
Pipelining High Order IIR
 Three basic approaches
– Clustered look-ahead
– Scattered look-ahead
– Direct synthesis with constraints

Slide 80
Pipelining High Order IIR

Slide 81
Pipelining High Order IIR

Slide 82
Pipelining High Order IIR

Slide 83
Pipelining High Order IIR

Slide 84
Pipelining High Order IIR

Slide 85
Pipelining High Order IIR

Slide 86
Pipelining High Order IIR

Slide 87
Slide 88
Slide 89
Slide 90
Slide 91
Slide 92
Slide 93
Slide 94
Slide 95
Slide 96
Slide 97
Slide 98
Slide 99
Slide 100
Slide 101
Slide 102

Você também pode gostar