Você está na página 1de 5

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

Xiaoying Li1
Fuming Sun2
Enhua Wu1, 3
1
University of Macau, Macao, China
2
University of Science and Technology Beijing, Beijing, China
3
Institute of Software, Chinese Academy of Sciences, Beijing, China

ABSTRACT
In this paper, a hierarchical pipeline FIR filter structure is
proposed and implemented using FPGA hardware. It is a
flexible multi-rate structure. By adopting the clock of
computation several times faster than the sampling rate,
multiplications and additions can be finished using the
shared component to reduce the logic area. Only a few
more delay units are needed to separate the basic FIR
filter structure into two levels: in-group and betweengroup. As the number of taps of filter increases, the
structure can be easily extended without increasing the
delay of critical path. A Simulink-to-FPGA flow is
applied to the multi-rate structure of FIR filter with mixed
HDL and Simulink blockset design entry.

KEY WORDS

among the delay line, which separate computations into


N / M groups.
The remaining content of the paper is organized as
follows: a review of general structures of FIR filters is
given in Section 2. In Section 3, the new FIR structure
and its timing of multi-rate are explained in detail. The
FPGA design in the Simulink is described in Section 4
with experimental results. Finally, some discussions are
presented in Section 5.

2. OVERVIEW OF FIR STRUCTURES


An FIR filter is essentially a discrete convolution of the
input signal with a set of coefficients. Mathematically, the
input-output relations of an FIR filter with N taps (or of
order N-1), in the time-domain can be defined as Eq. 1.

y[n] = k =0 h[k ]x[n k ]


N 1

FIR filters, pipeline, FPGA, Simulink

1. INTRODUCTION
Finite Impulse Response (FIR) filters are one of the
primary types of digital filters used in various Digital
Signal Processing (DSP) applications such as audio signal
processing, video convolution functions and telecommunications by virtue of stability and easy
implementation. The standard FIR filters design contains
a great number of multiplications which require large
silicon area, increase the power consumption, and state
the upper limit of the maximum sampling rate. Early
works have been done on replacing multiplications by
decomposing them into simple operations such as addition,
subtraction, shift and sharing common sub-expressions [1],
on minimizing the delay and the number of adders [2],
and on the tradeoffs between truncated multipliers and the
accuracy of computation [3]. Various application specific
FIR filters are frequently implemented using FPGA [4].
In this paper, a new structure of FIR filter is proposed and
implemented with FPGA hardware. It is a flexible twolevel architecture with two clock rates. By adopting a
clock several times faster than the sampling rate, the
multiplying and adding component can be highly shared
for computation, which can greatly reduce the number of
multipliers and realize high throughput while it does not
augment the delay in the critical path. According to the
relation of N (taps) and M (ratio of two clock rates),
N / M 1 additional delay units should be added

(1)

where x is the input data stream, hk is the k-th tap


coefficient, and y is the output data stream. In general,
such an FIR filter requires N multipliers and N-1 twoinput adders.
The general FIR structures include direct form and
transposed form. The direct form realization of an FIR
filter can be readily developed from the convolution sum
description (Eq.1) as shown in Fig.1(a) (tap=5). In the
direct form, there are delay units between multipliers. At a
time, the current filter input x(n), and previous N-1
samples of the input data are applied to one input of
multiplier. The filter output y(n) is the sum of product of
every multiplier accumulated by N-1 adders. In the
transposed form shown in Fig. 1(b), however, delay units
are placed between adders so that the multipliers can be
fed simultaneously. Generally, direct form is potentially
better for high-frequency operation, but suffers from high
latency compared to the transposed one. In addition, the
input of each multiplier changes through the chain of taps
with the update of new data sample at every clock cycle.
Then it will cause a relatively high switching activity
within multipliers as a result of higher overall power
consumption. In the transposed form, since the data input
remains unchanged for a substantial number of
multiplications, corresponding to the order of filter,
switching activity is reduced with less power consumption.
But the input signal has to be multiplied and added to the
accumulated value in a single pipeline stage, which limits
the clock frequency. Moreover, the transposed form has a

disadvantage of imposing the additional pressure to the


implementation by high fan-out requirement of the input
signal.

(a)

(b)
Figure1. Two Basic Forms of FIR Filter (N=5)
(a) Direct
(b) Transposed
The symmetry property of a linear-phase FIR filter can be
exploited to reduce the number of multipliers into almost
half in the direct form implementations. Both odd and
even order symmetric FIR structures are illustrated in
Fig.2. Other forms such as cascade, lattice and poly-phase
structures can also be used as complex FIR filter
structures.

additional delay units separate the delay line of filtering


into two levels: in-group and between-group. We adopt
two different clock rates: one sampling rate for delay unit
and another faster clock for multiplying and adding
computations. Suppose the frequency of computation
clock is set to M-1 times faster than the sampling rate.
Then an N-tap FIR filter in the direct from can be
separated into N / M groups by adding N / M 1
delay units every M taps. Therefore, in each group, the
multiplication and additions can be controlled by the
faster clock. In other words, M MAC (MultiplyAccumulate) operations in one group can be finished in M
cycles of the faster clock, or one cycle of the sampling
clock. For one group, only one multiplier and accumulator
are needed and shared. Totally, the number of
groups, N / M , determines the number of MAC
components. Comparatively, N multipliers and N-1 adders
are needed for standard direct and transposed forms of an
N-tap FIR filter. By inserting the additional delay units,
the structure is changed into a hierarchical pipeline. Each
group is a stage of pipeline. The accumulation result of
current group in the last small cycle of one sampling
period is fed into next group in the first small cycle of
next sampling period for further accumulation. After
N / M stages, the final output y(n) can be calculated
corresponding to its input x(n).
As shown in Fig.3, the ratio of two clock rates M is set to
eight. So a 32-tap FIR filter is separated into four groups
by inserting another three taps. There are two levels of
pipelines. The first level is in-group with eight stages, and
the second is between-group with four stages. Each group
will share only one component for MAC computations.

Figure3. Two-level Pipeline FIR Filter Structure (tap=32)

Figure2. Symmetric Coefficients FIR Filter Structure

3. HIERARCHICAL FIR FILTER DESIGN


Based on but different from direct form, a two-level
pipeline FIR filter structure is designed by inserting extra
intermediate delay units among N-1 delay units. Those

The timing diagram of two-level pipeline FIR structure is


illustrated in Fig.4. In this example, N=8 and M=4. One
additional delay unit is inserted to separate the FIR filter
into two groups. Control signal En can be generated by a
counter using one-hot coding. By the signal En,
multiplexers in each group are required to select the input
sample from each delay unit in fast clock cycles (Ingroup1 and In-group2) as the input of multiplication. In
Fig. 4, Xi refers to the i-th input sample. i in the In-group1

and In-group2 refers to multiplexing of the i-th sample by


fast clock. Only one shared MAC component is used to do
M times computations in each group. Due to the inserted
delay unit, the delay relation of two groups can be seen
from Delay Group1 and Delay Group2 in Fig.4.
Therefore, each group can be organized as a stage of
pipeline. After two stages, 8-tap MAC from In-group1
and In-group2 can be summed together as the output.
Since a large number of multiplications in FIR filters are
excessively area and power consuming, previous works
concentrate on how to simplify them. If the coefficients of
FIR filters are constant, decomposition is a more efficient

way than employing multipliers. To minimize the number


of addition/subtractions required in each coefficient
multiplication, the coefficients can be restricted to powerof-two, expressed in CSD (Canonical Signed-Digit) or
graph representation [5]. In our method, one contribution
is the reduction of the number of MAC components. To
further improve the performance, methods of MAC
design should be considered. For high performance ASIC
structure, optimized multiplying and adding component
can be explored using partial product reduction by Booth
algorithm. For flexible FPGA design, the coefficients can
be preset into embedded RAMs and multipliers and
accumulators can be directly exploited for simplicity.

Figure 4. Timing Diagram (N=8)

4. MULTI-RATE DESIGN IN SIMULINK


With the continued growth in complexity of FPGA-based
designs, more flexible, efficient and higher-level design
methodology comes up to change the traditional HDLcentric flows. Matlab&Simulink is a well-known tool that
allows designers to model a system at a high-level and is
ideal for diverse applications, such as digital signal
processing, automotive control, image processing,
communication, etc. To incorporate the good modeling
and simulation functionality of Simulink, major FPGA
manufacturers have promoted new products, which are
integrated into Simulink as specified blocksets. Xilinx
System Generator for DSP [6] and Altera DSP Builder [7]
are the popular ones. AccelChip [8] also provides a DSP
synthesis tool for FPGA. Those blocksets and tools can

implement a full FPGA design flow from Simulink


modeling to simulation to hardware [9, 10]. It can
transform Simulink model into synthesizable HDL code
with test bench.
In this paper, we use Xilinx System Generator tool to
implement the hierarchical FIR filter on FPGA hardware.
For FIR filter design, various filters are already available
from Xilinx Reference Blockset in Simulink, which can
be easily customized and mapped to FPGA hardware by
System Generator. For the new proposed structure, we
explore a mixed HDL and Simulink block modeling to
this multi-rate design. Fortunately, System Generator
provides a means to bring VHDL, Verilog, and EDIF into
designs. It also provides HDL co-simulation interfaces to
simulate the mixed-module system.

Figure 5. Simulink-to-FPGA FIR Filter Structure


The modeling of multi-rate hierarchical FIR filter is
shown in Fig. 5 corresponding to the timing diagram in
Fig. 4 (tap=8). Two clocks control delay unit and MAC
component respectively. The shadowed delay unit is
additionally inserted. MAC component is in the MuxMul-Acc black box described in HDL. Rate relation of
sampling and computing clocks is declared in the
configuration M-function of HDL module. The
experimental results are shown in Tab.1 and Tab.2. The
target FPGA chip is Xilinx Virtex-II xc2v2000. Tab.1
lists the resource and performance of FIR filter blocks
provided by System Generator in Simulink, which can be
parameterized and exploited directly. With the increasing
number of taps, area consumption increases almost
linearly while the delay period remains a constant. In Tab.
2, the FPGA logic area and speed of the proposed twolevel FIR filter are illustrated in comparison with Tab. 1
(the Simulink FIR filter blocks). Due to the reduction of
MAC components, hardware logic resource has been
decreased a lot. Since MAC components are directly
described using HDL in our method, the maximum
frequency is inferior to the optimized Simulink blockset.
Further improvement might be achieved if the MAC
components can be optimized.
Table 1. Statistics of FPGA Resource Consumption and
Speed (Customized FIR Filter Block in Simulink)
No. of Taps (#)
Resource
and Speed
8
16
20
24
32
SLICES
165
326
462
501
656
FLIP FLOPS
297
584
826
906
1192
LUTS
159
352
524
580
793
Delay (ns)
4.2
4.2
4.2
4.2
4.2
Max.Frequen238
238
238
238
238
cy (MHz)
Other Info.
x is 8-bit, h is 10-bit, both are signed.

Table 2. Statistics of FPGA Resource Consumption and


Speed (Hierarchical FIR Filter)
No. of Taps (#)
Resource
and Speed
8
16
20
24
32
SLICES
72
154
163
203
284
FLIP FLOPS
101
205
213
264
364
LUTS
68
144
162
196
269
Delay (ns)
8.13
8.16
8.2
8.2
8.3
Max.Frequen123
122.6 121.9 121.9 120.5
cy (MHz)
x is 8-bit, h is 10-bit, both are signed.
Other Info.
M=Computing Clk / Sampling Rate=4

5. DISCUSSION
From the development of FPGA technology, the
methodology challenges the update of various EDA tools.
Based on the standard development flow (Fig. 6), initial
efforts have been transferred to high-level design and
synthesis. There are many conversion tools such as C-toFPGA, Stateflow diagram to VHDL (SF2VHD), Matlabto-FPGA (MATCH). The features of Simulink-to-FPGA
flow can be discussed as follows.
Friendly graphics interface. Although the
schematic entry is also a GUI interface, the
Simulink is easier to organize input data and
much convenient to observe output in many ways.
Easy to number format conversion. Double to
fixed point number conversion is parameterized to
functional blocks. But the consistence of data type
must be noticed during the data flow.

Flexible modeling and simulation. The design can


be well organized into hierarchical modules and
easy to be combined with other entry method for
design decision and convenient to debug and
simulation.
Fast time-to-market for DSP development. With
the assistance of specified DSP blocks for FPGA,
the Simulink-to-FPGA flow can greatly shorten
the development cycle from algorithm to
hardware. The arithmetic blocksets might be
further reinforced.
In this paper, a new FIR filter structure is presented and
implemented by different methods. The basic direct form
of FIR filter is rebuilt as a hierarchical structure by
inserting only a few additional delay units. This structure
is very flexible to meet different system requirement. Due
to the sharing mechanism of MAC components, much
area consumption has been reduced. With great concern
on the high-level hardware design, the Simulink-to-FPGA
modeling and simulation takes the advantage of good
graphics interface and flexible design choices. For many
DSP applications such as image processing and
communication, more functional blocks will be capsulated
into FPGA-mapped blocks in the Simulink and the
performance will be continuously improved in the future.

ACKNOWLEDGMENT
The research is supported by the Research Grant of
University of Macau.

REFERENCES
[1] Y. C. Lim, J. B. Evans, and B. Liu, Decomposition of
binary integers into signed power-of-two terms, IEEE
Trans. Circuits System., vol. 38, 1991, 667-672.
[2] Hyeone-Ju Kang and In-Cheol Park, FIR filter
synthesis algorithms for minimizing the delay and the
number of adders, IEEE Trans. Circuits System, vol.42,
2001, 770-777.
[3] E. G. Walters III, Design tradeoffs using truncated
multipliers in FIR filter implementations, Masters Thesis,
Lehigh University, May 2002
[4] L. Mintzer, FIR Filters with FPGA, Journal of VLSI
Signal Processing, 6, 1993, 119-127.
[5] Samueli, H., An improved search algorithm for the
design of multiplierless FIR filters with powers-of-two
coefficients, Circuits and Systems, IEEE Transactions on,
Volume: 36 Issue: 7, 1989, 1044 -1047.
[6] Xilinx, Xilinx System Generator, Version 6.2, Xilinx
Inc., USA.
[7] Altera,. Altera DSP Builder, Version 5.1, Altera Inc,
USA.
[8] AccelChip, Integrating MATLAB Algorithms into
FPGA Designs, in Xcell Journal, 2005, 73-75.

[9] M. A. Shanblatt, B. Foulds, A Simulink-to-FPGA


Implementation Tool for Enhanced Design Flow,
Proceedings of the 2005 IEEE International Conference
on Microelectronic Systems Education (MSE'05), 2005,
89-90.
[10] M. Haldar, A. Nayak, A. Choudhary, and P.
Banerjee, A System for Synthesizing Optimized FPGA
Hardware from MATLAB, Proceedings of the 2001
IEEE/ACM International Conference on Computer-Aided
Design, 2001, 314-319.

Você também pode gostar