Escolar Documentos
Profissional Documentos
Cultura Documentos
IMPLEMENTATION ON AN FPGA
Daniel J. Allred, Heejong Yoo, Venkatesh Krishnan, Walter Huang, and David V. Anderson
ABSTRACT implement adaptive filters using DA [3] [4], but the approxima-
In this paper, an FIR adaptive filter implementation using a tions made to standard adaptation algorithms may be unsuitable
multiplier-free architecture is presented. The implementation is for practical applications.
based on distributed arithmetic (DA) which substitutes multiply- In this paper, we develop and present an implementation of a
and-accumulate operations with a series of look-up-table (LUT) discrete-time linear FIR adaptive filter based on the well-known
accesses. This can be achieved at the cost of a moderate increase LMS algorithm using the DA concept. A novel approach for up-
in memory usage. The proposed design performs an LMS-type dating the LUT tables of the DA filter is presented. The details of
adaptation on a sample-by-sample basis. This is accomplished by the proposed design and a description of the constituent modules
an innovative LUT update using a matched auxiliary LUT. The of the DA-based LMS adaptive filter is provided in Sec. 3. The
system is implemented on an FPGA that enables rapid prototyping implementation results and a brief discussion of design trade-offs
of digital circuits. Implementation results are provided to demon- are presented in Sec. 4.
strate that a high-speed, low logic complexity LMS adaptive filter
can be realized employing the proposed architecture. 2. BACKGROUND CONCEPTS
K−1 x[n − k] = −bk0 + bkl 2−l (2)
y[n] = wk x[n − k]. (1) l=1
k=0 and Eq. (1) can be rewritten as
A typical digital implementation will require K multiply-and- K−1
B−1
K−1
accumulate (MAC) operations, which are expensive to compute y[n] = − wk bk0 + wk bkl 2−l . (3)
in hardware due to logic complexity, area usage, and throughput. k=0 l=1 k=0
Alternatively, the MAC operations may be replaced by a series
of look-up-table (LUT) accesses and summations. Such an im- It is noted that the terms in the square brackets may take only one
plementation of the filter, known as distributed arithmetic (DA), of 2K possible values; these values, which are all the possible
achieves higher throughput and lower logic complexity at the cost combination sums of the filter coefficents, are stored in an LUT,
of increased memory usage. Recent advances in memory design denoted as the DA filtering LUT (DA-F-LUT). The filtering oper-
technology have resulted in shrinking memory sizes, making this ation may then be implemented, according to Eq. 3, by B look-up,
tradeoff an attractive option. A good tutorial review of DA linear shift, and accumulate operations.
filters is given in [1]. The block diagram of a typical DA implementation of a four
Many DSP applications require linear filters that can adapt to tap (K = 4) filter is shown in Fig. 1. The bank of shift registers in
changes in the input signals. The implementation of an adaptive Fig. 1 stores four consecutive input samples. The concatenation of
filter [2] based on the DA concept poses several challenges. Since the rightmost bits of the shift registers becomes the address of the
the DA filtering operation is based on an LUT, changes to the filter LUT. The shift registers are shifted right at every clock cycle. The
can require extensive changes to the LUT. This can be impracti- corresponding LUT entries are also shifted and accumulated B
cal for large LUT sizes. Several past attempts have been made to consecutive times where B is the precision of the input data. The
It must be noted that as the filter size K increases, the mem- x[n] y[n]
would be m×2k memory words. The total number of clock cycles sample_clk system_clk
16
18
tion, we can choose k = 4 and m = 32 which would only require 4 Table Module Update
Controller
w_in
sample_table_out
512 memory words. The number of clock cycles required for this 18 Module
18
wr_en
implementation would be 21 clock cycles as compared with the 16 sample_update_done
term µe[n] is quantized to a power of 2. Implementation results wl [n + 1] = wl [n] + µe[n] x[n − l]. (5)
demonstrate that this quantized error LMS (QE-LMS) outperforms l=0 l=0 l=0
the SE-LMS and is comparable to the LMS algorithm in terms of The DA-F-LUT update then consists of reading the same
the convergence speed. memory location in both the DA-F-LUT and DA-A-LUT, multi-
plying the output of the DA-A-LUT by µe[n], adding this quantity
3. DA ADAPTIVE FILTER to the output of the DA-F-LUT, and finally storing the result back
in the same memory location of the DA-F-LUT. This process is re-
The LMS adaptation algorithm requires the filter weights wk [n] to peated for addresses 1 to 2k − 1 (we can ignore memory location
be updated according to Eq. (4) every sample that is filtered. After 0 as it will always have a zero value).
9
Since it is area inefficient to implement µe[n]x[n − k] using Time n Time n + 1
a hardware multiplier, the following two schemes are considered. Internal Address External Address External Address
The first scheme is the SE-LMS with µ set as a power of two. The 0000 0000 0000
multiplication is replaced by a right shift and the result is added or 0001 0001 0010
subtracted to the DA-F-LUT output depending on the sign of the 0010 0010 0100
error. The second scheme is the QE-LMS, which was mentioned .. .. ..
. . .
in Sec. 2.2. In this case mu is fixed to some power of two and
1101 1101 1011
the error value is quantized to the closest power of two, essentially
1110 1110 1101
quantizing the product µe[n] to a power of two. Now the magni-
1111 1111 1111
tude of the error, and not just the sign, plays a role in the rate of
adaptation. When the error is large, a large update is applied to
Table 1. Rotation of address lines from time n to time n + 1.
the filter weights and faster convergence occurs. When the error
is small, a small update is applied thus minimizing the jitter of the
weights as they near the optimum values.
and whose select lines are the bits of a counter. This counter is in-
3.2. DA Auxiliary Table Module cremented when a new sample arrives, instantaneously remapping
the table. The DA auxiliary table module then begins the update
In this section, the update process of the DA-A-LUT is described. of the DA-A-LUT, as described above. When the update is done
The brute-force method would be to recalculate all possible com- the sample update done signal (see Fig. 2) goes high, indicating
bination sums of the k most recent samples (after each new sample to the DA filter update controller that the DA-A-LUT is ready to
arrives) and then store the results in the DA-A-LUT. A more effi- be used in updating the DA-F-LUT.
cient update method can be devised by considering how the DA-
A-LUT changes as a new sample arrives and the oldest sample is
discarded. 3.3. DA Filter Module
Fig. 3 shows the change in the DA-A-LUT from time n to time
n+1 for k = 4. The high address locations in the table (addressed The DA FIR filter module performs the filtering operation on the
by 1000 to 1111) contains sums involving the oldest sample x[n − incoming data samples with the current values of the weights. The
3], which is the sample being discarded at time n + 1. The sums DA filtering operation has been explained in detail in Sec. 2.1.
in the low address locations of the DA-A-LUT (addressed by 0000 When filtering is complete, the filtering done signal (see Fig. 2)
to 0111 at time n) can be reused by mapping these values to the goes high.
even address locations of the table at time n + 1, as shown by the
arrows in Fig. 3.
DA Auxillary Table
3.4. DA Filter Update Controller Module
External External
Address LUT Value LUT Value Address
0000 0 0 0000
0001 x[n] x[n+1] 0001 The DA filter update controller provides the control logic for the
0010 x[n-1] x[n] 0010
0011 x[n-1]+x[n] x[n]+x[n+1] 0011 system after the DA-A-LUT is updated and filtering is complete.
0100
0101
x[n-2]
x[n-2]+x[n]
x[n-1]
x[n-1]+x[n+1]
0100
0101 Once the filtering is done, the error, e[n] = d[n] − y[n], is calcu-
0110
0111
x[n-2]+x[n-1]
x[n-2]+x[n-1]+x[n]
x[n-1]+x[n]
x[n-1]+x[n]+x[n+1]
0110
0111
lated. The term µe[n] is then quantized to the appropriate power of
1000
1001
x[n-3]
x[n-3]+x[n]
x[n-2]
x[n-2]+x[n+1]
1000
1001
two (see Sec. 3.1). Upon completion of the DA-A-LUT update as
1010
1011
x[n-3]+x[n-1]
x[n-3]+x[n-1]+x[n]
x[n-2]+x[n]
x[n-2]+x[n]+x[n+1]
1010
1011
described in Sec. 3.2, the update controller accesses the memory
not be
1100
1101
x[n-3]+x[n-2]
x[n-3]+x[n-2]+x[n] reused
x[n-2]+x[n-1]
x[n-2]+x[n-1]+x[n+1]
1100
1101
locations of the two LUTs and stores the new weight combination
at time ‘n+1’
1110
1111
x[n-3]+x[n-2]+x[n-1]
x[n-3]+x[n-2]+x[n-1]+x[n]
x[n-2]+x[n-1]+x[n]
x[n-2]+x[n-1]+x[n]+x[n+1]
1110
1111
sums in the DA-F-LUT.
At time n At time n+1
Fig. 3. Change in the content of the DA-A-LUT from time n to time n+1. 3.5. Timing Analysis
The values in the odd address locations can then be formed by For a K tap FIR adaptive filter implemented using m base DA
simply reading the value in the preceding even address location, filtering units, each of size k, the update of the DA-A-LUT can be
adding the newest sample x[n + 1] and then storing the result at done in 2k−1 clock cycles as described in Section 3.2. This may
the odd address. This can be represented by be done at the same time (in parallel) as the filtering and adder tree
operation, which take a total of B + log2 (m) clock cycles. Thus
DA-A-LUT(2l + 1) = DA-A-LUT(2l) + x[n + 1], the total number of clock cycles for filtering
and updating the DA-
l = 0, . . . , 2k−1 − 1. (6) A-LUT is max B + log2 (m), 2k−1 . The updated DA-A-LUT
is then used to update the DA-F-LUT; this operation requires 2k
k
The low address locations of the DA-A-LUT at time n can be clockcycles. Thus the overall
k−1
K tap adaptive filter requires 2 +
mapped to even address locations at time n+1 by a simple rotation max B + log2 (m), 2 clock cycles. The number of clock
of the k address lines. This allows the physical contents of the LUT cycles required for a 128 tap adaptive filter as a function of the size
to remain the same, even as external logic sees the table as shifted. of the base unit k is shown in Fig. 4. It can be observed that the
The address rotation from time n to n + 1 is shown in Table 1. proposed implementation can achieve a much higher throughput
The address line rotation is accomplished via k, k-to-1 multiplex- than a brute-force implementation if k and m are appropriately
ers, whose outputs connect to the DA-A-LUT’s internal address chosen.
9
4 0
10 QEŦLMS
SEŦLMS
Ŧ5
Ŧ10
Ŧ15
MSE(dB)
3
10
Ŧ20
Clock cyles
Ŧ25
Ŧ30
Ŧ35
2
10
0 500 1000 1500 2000 2500 3000 3500 4000
Iteration
(a)
0
QEŦLMS
SEŦLMS
Ŧ5
1
10
2 3 4 5 6 7 8 9 10
Ŧ10
Number of taps in base DA filter unit
Ŧ15
MSE(dB)
Fig. 4. Number of clock cycles required to filter one time sample as a Ŧ20
function of the base DA filter unit size. (K = 128)
Ŧ25
Ŧ30
9