Você está na página 1de 4



Joel J. Fúster and Karl S. Gugel

Dept. of Electrical and Computer Engineering, University of Florida, Gainesville, FL

ABSTRACT the inputs be scaled relative to the largest value that will
occur in the input data. Additional logic or processing
This paper describes the design and implementation of a may be required to achieve this condition. The mapping
fully pipelined 64-point Fast Fourier Transform (FFT) in between input numbers and their corresponding
programmable logic. The FFT takes 20-bit fixed point hexadecimal values is as follows:
complex numbers as input and after a known pipeline
latency produces 20-bit complex values representing the
FFT of the input. It is designed to allow continuous input 1 0x7FFFF
of samples and is therefore suitable for use in real-time … …
systems. The modular design allows it to be used together 1.907x10-6 0x00001
with other 64-point FFTs to create larger sizes, much as 0 0x00000
this design is built using smaller 8-point FFTs. Such a -1.907x10-6 0xFFFFF
design has many applications in high-speed real-time … …
systems such as wireless networking, spectral analysis, -1 0x80000
recognition systems, and more.
This mapping is more practically represented by

 floor (219 s ) 1> s ≥ 0
n= (2.1)
 floor (2 (2 + s )) 0 ≥ s > −1
FFTs have use in innumerable signal processing
applications and are often an important building block in
such systems. Many of these applications require real- where s is the -1 to 1 scaled input number and n is the
time operation in order to be useful. While Digital Signal decimal value of the binary number to be fed into the
Processors (DSPs) are available that can perform an FFT input of the FFT.
fast enough to keep up with many real-time applications, Output values follow the same format as described
some systems require additional computation or have above. However, down-scaling is done as a byproduct of
speed requirements that exceed the capabilities of a DSP some of the internal stages, as described later in this
alone. It is in these situations that dedicated logic for paper. As a result, the output must be multiplied by 256
computing an FFT can be useful. Described in this paper to correct for this. The mapping for the output of the FFT
is the interface, design, implementation, and testing of a to their equivalent values is given by
64-point FFT implementation that takes advantage of
pipelining, memory bank switching, and smaller FFTs to
create a design capable of continuous real-time operation  n
 219 0 ≤ n ≤ 0 x7 FFFF
at high speeds.
s= (2.2)
 − 2 0 x80000 ≤ n ≤ 0 xFFFFF
 219
The FFT is designed to take complex values at the input,
where the real and imaginary components each have 20 where n is the hexadecimal equivalent of the binary output
bits of precision. A twos-complement fixed-point format of the FFT and s is the corresponding fractional decimal
is used, with all numbers scaled to between -1.0 and 1.0. value.
Getting the most accuracy out of the design requires that
20-bit x 2 Bank 8-point Bank To Twiddle
Complex Switched FFT Unit A Switched Factor
Input Memory A Memory B Multiplier

From Bank Twiddle 8-point Bank 20-bit x 2

Switched Factor FFT Unit B Switched Complex
Memory B Multiplier Memory C Output

Address Generator

Figure 1: Pipeline structure of FFT Implementation

t (1) = x (1) + x (5)

Both the input and output busses to the FFT are t (2) = x (2) + x (6)
synchronized to the rising edge of the clock, so that an t (3) = x (3) + x (7)
t (4) = x (4) + x (8)
input value is captured and an output value is available on t (5) = x (1) − x (5)
the rising edge. The maximum clock rate for the design is t (6) = x (2) − x (6)
t (7) = x (3) − x (7)
determined by the speed and design of the programmable t (8) = x (4) − x (8)
logic device (PLD) that will be used. The simulated q (1) = t (1) + t (3)
timing information for this design on an Altera Apex q (2) = t (2) + t (4)
q (3) = t (1) − t (3)
FPGA device is described in later in this paper. q (4) = t (2) − t (4)
q (5) = t (5)
q (6) = t (6) + t (8)
3. DESIGN q (7) = t (7)
q (8) = t (6) = t (8)
The core FFT algorithm chosen for this design is the s (1) = q (1) + q (2)
Winograd 8-point FFT. This algorithm significantly s (2) = q (1) − q (2)
s (3) = q (3) − jq (4)
reduces the number of multiplications needed versus other s (4) = q (3) + jq (4)
algorithms at the expense of an increase in the number of s (5) = q (5) − j (1/ 2 ) q (6)
s (6) = q (5) + j (1/ 2 ) q (6)
additions and memory needed [1-3]. For PLDs, s (7) = (1/ 2 ) q (8) − jq (7)
multiplication is more expensive to implement than s (8) = (1/ 2 ) q (8) + jq (7)
addition in terms of computation time and number of
y (1) = s (1)
gates, and therefore the Winograd algorithm was chosen. y (2) = s (5) + s (7)
The equations that describe how it is computed are shown y (3) = s (3)
y (4) = s (5) − s (7)
in Figure 2. y (5) = s (2)
The pipeline layout of the 64-point FFT is shown in y (6) = s (6) − s (8)
y (7) = s (4)
Figure 1. There are 6 stages, including two 8-point y (8) = s (6) + s (8)
Winograd FFTs, one twiddle factor multiplier, and three
bank switched memories. The 8-point FFT blocks have
clocked shift registers at the input and output, but the
FFT itself is computed with purely combinatorial logic. Figure 2: 8-point Winograd FFT
The only multiplications needed within this stage are a
few multiplies by 1/ 2 , which are also built using
straight combinatorial logic units. These units perform
the multiplication by using shift-add techniques. The
Input: x(0..63) Output: X(0..63) avoided by instead multiplexing the use of two units at the
expense of increased latency.
The three bank switched memory blocks are used to
x0 8-pt s(0..7)
realize the multiplexing as well as to facilitate the sample
FFT reordering that is done at three different times in the data
flow. Specifically, each memory block consists of two
x1 8-pt s(8..15) banks, each of which can store 64 20-bit x 2 complex
FFT numbers. While one bank is being written with the data
from the previous stage the other bank can be read from
separately. When the bank being written to is filled with
64 new samples, the pipeline stages following are timed to
x7 8-pt s(56..63)
be finished reading the 64 samples from the other bank.
FFT The banks are then switched, allowing the new data to be
read out and the old bank to be loaded with more samples.
s(0..63) Twiddle factor t(0..63) In this way, continuous operation is possible.
multiplication Data reordering is accomplished by controlling the
memory access pattern when reading the data out of the
memory banks. The controller unit for the pipeline
t0 8-pt X0 generates the memory read addresses for each block,
FFT creating the modulo-8 reordering system. The controller
is also responsible for timing the start sequences between
t1 X1 each pipeline stage, and generating the proper indices for
the ROM in the twiddle factor unit. The controller is
implemented simply as a 128-state state machine, using a
counter as an address generator for a ROM that stores the
values for all the control signals and addresses at each
t7 8-pt X7 state.
FFT The twiddle factor multiplier is simply a ROM
coupled with a complex multiplier. The ROM stores the
64 pre-computed twiddle factors. The complex
{x, t, X}k = {x, t, X}(n) where (n mod 8) = k
multiplication is accomplished by breaking the operation
Figure 3: Data flow for 64-point FFT down into three multiplies and five additions, as shown in
(3.3). The total latency for this stage is 7 cycles.

actual multiplication value used is 0.7071. It should be ( xr + jxi )(tr + jti ) (3.1)
noted that the internal precision of these multiplication
units and the rest of the 8-point FFT blocks is 24-bit, and xr tr − ti xi + jti xr + jtr xi (3.2)
that the output is scaled down by a divisor of 16 as a
result of how the algorithm is implemented. tr ( xr − xi ) + xi (tr − ti ) + j ( xr (tr − ti ) − tr ( xr − xi ))
The 8-point FFT units have shift registers on their
inputs and outputs, each one with eight positions for a 20-
bit by 2 complex number. Each time the input register is
loaded with a new group of eight values, it is copied to a
latch from where the actual FFT is computed. In this way,
VHDL was chosen as the hardware description
the shift register can continue to load itself with new
language with which to build the FFT. The choice was
values while the FFT is running. The output register
made based mostly on the ready availability of tools to
operates in a similar fashion, copying the output of the
compile and simulate VHDL designs. The Quartus II
FFT from a parallel output latch and shifting the values
design system from Altera was used to compile and
out one at a time.
simulate the system. The target device used for
Combined with reordering and a twiddle factor
performing the timing simulation was the Apex
multiplication stage, the two 8-point FFT units are used to
EP20K600E [4]. This FPGA contains 24320 logic
produce the 64-point FFT. The data flow diagram for the
elements (LEs), 7326 of which are used by the FFT. A
algorithm is shown in Figure 3. It should be noted that
timing analysis of the worst case propagation time shows
the implementation of sixteen 8-point FFT units was
that the maximum speed for the FFT design in this FPGA
Software FFT Hardware FFT 1/ 2 or simply the fact that the software implementation
FFT Output Real Imag Real Imag has much higher internal precision, particularly in the
Sample #
multiplication units.
1 0x07D62 0x07D62 0x07D61 0x07D61
2 0xF7222 0xF6115 0xF7221 0xF6114
3 0x02E15 0xF7099 0x02E14 0xF7098 5. CONCLUSIONS
4 0x062ED 0xFDB5E 0x062EC 0xFDB5E
5 0x03302 0x023AA 0x03301 0x023A9 This paper presented an architecture for a pipelined 64-
6 0xFF3D3 0x01C12 0xFF3D2 0x01C12 point FFT for implementation in a PLD. It is suitable for
7 0xFE819 0xFED87 0xFE818 0xFED86 relatively high-speed applications where the typical DSP
8 0x00A2E 0xFD8DD 0x00A2E 0xFD8DC is not sufficiently fast to process the data, and particularly
for real-time designs. Further work might include a
simple extension of the radix-8 algorithm to the next step,
Table 1: First 8 points of simulation results a 512-point design, further investigation of the off-by-one
is 33.54 MHz. This means that the design can perform a errors, or perhaps further optimizations of the FFT design.
64-point FFT in 1.908 µs. In contrast, a 64-point FFT on
common DSP chip, the TI TMS320C3X at 75 MHz, takes ACKNOWLEDGEMENT
19.75 µs [5]. Thus the dedicated hardware FFT is faster
by more than a factor of ten. It should be noted however, This work was supported in part by the University
that the C3X family of DSPs are 32-bit floating-point, Scholars Program at the University of Florida.
which would mean greater dynamic range than the fixed-
point implementation. However, the quoted speed of the 7. REFERENCES
C3X does not include the time required to convert the
samples to the TMS320-specific floating point format, [1] Oppenheim, A.V., Schafer, R.W., Discrete-time Signal
which may be a concern in actual system implementations Processing, 2nd ed., Prentice Hall, New Jersey, 1999.
and would slow the algorithm down further.
While this design is capable of processing up to 33.54 [2] Press, W.H., Flannery, B.P., Teukolsky S.A., and
Msamples/s in at least one type of FPGA, the cost of the Vetterling W.T., Numerical Recipes in C: The Art of
pipelined architecture is in the latency of each stage. Each Scientific Computing, Cambridge University Press,
bank switched memory stage adds 66 cycles of latency, January 1993.
due to the number of cycles it takes to load 64 samples in
before a bank switch. The 8-point Winograd FFT stages [3] Smith, Steven, The Scientist and Engineer’s Guide to
each add 16 cycles. These cycles are the time that it takes Digital Signal Processsing, California Technical
to shift in 8 samples as well as another 8 cycles to allow Publishing, 1997.
the FFT to complete. It should be noted that the FFT
itself, without the attached shift registers, is not pipelined. [4] “APEX 20K Programmable Logic Device Family”,
It is given the maximum amount of time possible, 8 Product Data Sheet, ver. 4.0, Altera Corporation, August
cycles, to compute its result. After 8 cycles, the input 2001.
shift register would begin to lose data, making this the
upper limit. This “wait state” allows the rest of the system [5] “TMS320C3x General-Purpose Applications User
to be pipelined while keeping the 8-point FFT Guide”, Texas Instruments, January 1998.
The twiddle factor multiplier stage contributes 7
cycles of latency. This latency arises due to the delays
associated with the twiddle factor ROM, as well as the
pipelined multipliers used to form the complex
multiplication unit. This makes the total latency for the
entire 64-point FFT is 237 cycles.
To test the FFT, it was given a set of artificially
generated complex input values, and the outputs were
compared with a software FFT implementation. The first
eight points of the results are shown in Table 1. It is
noted that the output values of the hardware
implementation sometimes differ from the software
version by one. This is likely due to the rounding caused
by the shift-add implementation of the multiplication by