Escolar Documentos
Profissional Documentos
Cultura Documentos
AbstractThe Fast Fourier Transform (FFT) and its inverse We applied both methods to a simple 8-point FFT and we
transform (IFFT) processor are key components in many compared them to the conventional FFT and to the R2MDC
communication systems. An optimized implementation of the 8- processor in order to a comparative evaluation.
point FFT processor with radix-2 algorithm in R2MDC
architecture is presented in this paper. The butterfly- Processing The paper is organized as follows: Section II discusses the
Element (PE) used in the 8-FFT processor reduces the FFT algorithm implementation (Cooley-Tukey) and complex
multiplicative complexity by using a real constant multiplication multiplication used inside the butterfly-processing element.
in one method and eliminates the multiplicative complexity by Section III devoted for an architectural description of the FFT
using add and shift operations in other proposed method. The used module. Section IV shows the resulting implementation
pipeline architecture R2MDC has been implemented with the 8- and finally a conclusion is given in section V.
point module and simulation results show that this module
significantly achieves a better performance with lower resource II. FFT ALGORITHM
usage. In this section, a brief overview of IFFT and FFT
algorithms is provided to be effectively used in OFDM
Keywords- Cooley-Tukey, FFT/IFFT, OFDM, R2MDC,VHDL.
applications.
I. INTRODUCTION The N-point discrete Fast Fourier Transform (DFT) is
The FFT/IFFT is one of the most commonly used digital defined as:
signal processing algorithm. Recently, FFT processor has been N 1
widely used in digital signal processing field applied for X [ k ] = x[ n ].W N
nk
communication systems. FFT/IFFT processors are key n =0
components for an Orthogonal Frequency Division 2 nk
j
Multiplexing (OFDM) based wireless broadband nk
Where WN =e N
0 k N 1
communication system; it is one of the most complex and
intensive computation module of various wireless standards
PHY layer (OFDM-802.11a, MIMO-OFDM 802.11n) [1]. X(k) is the k-th harmonic and x(n) is the n-th input sample.
However, the main constraints nowadays for FFT processors Direct DFT calculation requires a computational complexity of
used in wireless communication systems are execution time O(N2). By using The CooleyTukey FFT algorithm, the
and lower power consumption [2]. The main issue in FFT/IFFT complexity can be reduced to O (N.logr N) [2] [5].
processors is complex multiplication, which is the most A. The Cooley-Tukey FFT Algorithm
prominent arithmetic operation used in FFT/IFFT blocks. It is
an expensive operation and consumes a large chip area and The Cooley-Tukey FFT is the most universal of all FFT
power especially when it comes to a large FFT point [3]. To algorithms, because of any factorization of N is possible [5]
reduce the complexity of the multiplication, one of the two [6]. The most popular Cooley-Tukey FFTs are those were the
proposed methods in this article replaces the expensive transform length is a power of a basis r, i.e., N = rS. These
complex multiplication with real and constant multiplications algorithms are referred to as radix-r algorithms. The most
[4]. The other method optimizes further the processing, by commonly used are those of basis r = 2 and r = 4.
wiping out the non-trivial complex multiplication with the For r = 2 and S stages, for instance, the following index
twiddle factors and fulfills the processing with no complex mapping of the The CooleyTukey algorithm gives:
multiplication.
A=a+b
+
S 1 S 2
k = 2 kS + 2 k S 1 + ... + 2 k 2 + k 1 WN
k S , k S 1 ,..., k 2 , k 1 = 0 ,1
Figure 2. Basic Butterfly computation.
The CooleyTukey algorithm is based on a divide-and-
conquer approach in the frequency domain and therefore is Radix-2 butterfly processor consists of a complex adder and
referred to as decimation-in-frequency (DIF) FFT. The DFT complex subtraction. Beside that, an additional complex
formula is split into two summations: multiplier for the twiddle factors WN is implemented. The
N
1
N
1
N
1 complex multiplication with the twiddle factor requires four
N 1
2
N ( n+ N ) k 2 2
real multiplications and two add/subtract operations [1] [3] [7].
X (k) = x(n).WN + x(n).WN = x(n).WN + x(n + ).WN 2
nk nk nk
n=0
n=
N n=0 n=0 2 C. Complex Multiplication
2
N N Since complex multiplication is an expensive operation, we
1 1
2
N 2 tend to reduce the multiplicative complexity of the twiddle
= x(n).WN + x(n + ).WN .WN 2
N N
nk nk .k .k
and WN 2 = (1)k factor inside the butterfly processor by calculating only three
n=0 n=0 2 real multiplications and three add/subtract operations as in (5)
N
1 and (6).
2
N nk
= x(n) + (1)k .x(n + ).WN (1) The twiddle factor multiplication:
n=0 2
X(k) can be decimated into even-and odd-indexed R + jI = (X + jY ).(C + jS) (4)
frequency samples: However the complex multiplication can be simplified:
N N
1 1 R = (C - S) .Y + Z (5)
2
N 2.n.k N n.k 2
X(2k) = x(n) +(n+ ).WN = x(n) +(n+ ).WN (2)
n=0 2 n=0 2 2 I = (C + S) .X - Z (6)
N N
1 1 With: Z = C. (X - Y) (7)
2
N 2.n.k N n.k 2
X(2k +1) = x(n) (n+ ).WN = x(n) (n+ ).WN (3)
n=0 2 n=0 2 2 C and S are pre-computed and stored in a memory table.
Therefore it is necessary to store the following three
The computational procedure can be repeated through coefficients C, C + S, and C - S.
decimation of the N/2-point DFTs X(2k) and DFTs X(2K+1). The implemented algorithm of complex multiplication used
The entire algorithm involves log2N stages, where each stage in this work uses three multiplications, one addition and two
involves N/2 operation units (butterflies). The computation of subtractions as shown in Fig. 3. This is done at the cost of an
the N point DFT via the decimation-in-frequency FFT, as in the additional third memory table which is given in (7). In the
decimation-in-time algorithm requires (N/2).log2N complex hardware description language (VHDL) program, the twiddle
multiplication and N.log2N complex addition [5]. factor multiplier was implemented using component
B. 8-point FFT Module instantiations of three lpm-mult and three lpm-add-sub modules
from Altera library. Worth to note that lpm modules are
The flow graph of complete DIF decomposition of 8-point supported by most of EDA vendors and LPM provides an
DFT computation is represented in Fig. 1. The basic operation architecture-independent library of logic functions or modules
in the signal flow graph is the butterfly operation; its a 2-point that are parameterized to achieve scalability and adaptability.
DFT computation as shown in Fig. 2.
x(0) X(0) C+S
x(1) X(4)
-1 X {in}
W80
x(2) -1
-1 X(2)
I {Out}
W82 C-S
x(3)
-1
-1 -1 X(6)
Y {in}
W80
x(4) X(1) C
-1 W81 R {Out}
x(5) X(5)
-1 W82 W80 -1
x(6) X(3)
-1 W83 -1 W82
x(7)
-1
X(7)
-1 -1 -1 Figure 3. Implementation of complex multiplication.
Figure 1. 8-point decimation-in-frequency FFT algorithm.
In the 8-point FFT with radix-2 algorithm, the Third butterfly calculation:
multiplication with W82=-j and W80 factors is trivial, 1
multiplication with W82 simply can be done by swapping from (13)
x 3 ( n 0 , n 1, n 2 ) = x ' 2 ( n 0 , n 1 , k 0 ). W 4n2k0
sign [7] [8]. The number of complex multiplication in this B. R2MDC Architecture
scheme of FFT is two: W81, W83. This non-trivial complex
multiplication was implemented with two multiplications for Simplicity, modularity, high throughput and real-time
W81 and three multiplications for W83. Therefore, the number of applications are required for FFT/IFFT processors in
real multiplications is 5. However, this solution was not communication systems. The pipeline architecture is suitable
suitable in practice; because the first stage has to process four for those ends [7] [9].
different twiddle factors (trivial and non-trivial multiplications) The sequential input stream in pipeline architecture
in pipeline architecture with one complex multiplier. Therefore, unfortunately doesnt match the FFT/IFFT algorithm since the
it will require more elements in the structure to implement the bloc FFT/IFFT requires temporal separation of data. In this
two complex multiplications in addition to the two trivial case the data memory is required in the pipeline processor to be
multiplications in one block. Eventually, we used four complex rearranged according to FFT/IFFT algorithm as shown in Fig.1.
multiplications with one complex multiplier in the first
proposed architecture of the 8-point FFT processor as shown in One of the most straightforward approaches for pipeline
TABLE I. implementation of radix-2 FFT algorithm is Radix-2 Multi-path
Delay Commutator (R2MDC) architecture. Its the simplest
Another architecture solution was proposed in order to wipe way to rearrange data for the FFT/IFFT algorithm [2] [9]. The
out completely the complex multiplication inside the butterfly input data sequence are broken into two parallel data stream
processor. The implementation complexity of non-trivial flowing forward, with correct distance between data elements
twiddle factors W81, W83 was replaced by basic operation of entering the butterfly scheduled by proper delays. 8-point FFT
shift-and-add which is shown in Fig. 5. in R2MDC architecture is shown in Fig. 4.
III. PIPELINE 8-POINT FFT/IFFT PROCESSOR ARCHITECTURE At each stage of this architecture half of the data flow is
delayed via the memory (Reg) and processed with the second
A. 8-point FFT Calculation Method half data stream. The delay for each stage is 4, 2, and 1
The application of the FFT algorithm for computation of respectively. The total number of delay elements is 4 + 2 + 2 +
the 8-point DFT required calculation of three of 2-point DFT 1 +1 = 10.
(radix-2) [9].
In this R2MDC architecture, both Butterflies (BF) and
n = 4n2 + 2n1 + n0 n2=0,1 n1=0,1 and n0=0,1 multipliers are idle half the time waiting for the new inputs.
k = 4k2 + 2k1 + n0 k2=0,1 k1=0,1 and k0=0,1 The 8-point FFT/IFFT processor has one multiplier, 3 of radix-
2 butterflies, 10 registers (R) (delay elements) and 2 switches
The 8-point FFT can be expressed in the following (S).
sequence:
IV. IMPLEMENTATION AND RESULTS
7
X ( n 2 , n1, n 0 ) = x ( k 2 , k 1, k 0 )W
k =0
( 4 n 2 + 2 n 1 + n 0 ).( 4 k 2 + 2 k 1 + k 0 )
(8 ) The 8-point FFT computation with radix-2 in R2MDC
architecture was first coded in VHDL using Quartus software
Where: tool from Altera and then simulated and synthesized on Altera
Cyclone 2 EP2C35F672C6 device. The purpose is to determine
( 4 n 2 + 2 n 1+ n 0 ).( 4 k 2 + 2 k 1+ k 0 )
W = the resource usage of the two proposed designs. The first
4n0k 2 2n0k1 4 n1 k 1 ( 2 n 1+ n 0 ) k 0 4n2k 0
proposed structure combines two resource reductions, the
W .W .W .W .W complex multiplication reduction inside the butterfly, and the
The process of computation of the 8-point FFT is done as fact of using the pipeline architecture. The second structure
follow: attempts to eliminate the complex multiplication, hence avoid
this expensive operation of multiplication and consumes less
First butterfly calculation: chip area.
1
(9)
x1( n 0 , k 1, k 0 ) =
k 2= 0
x ( k 2 , k 1 , k 0 ). W 4 n 0 k 2
R
x3x2x1x0 R R X3X2X1X0
Twiddle factor multiplication: R
e B S B S B
2n0k1
x '1 ( n 0 , k 1 , k 0 ) = x 1 ( n 0 , k 1 , k 0 )W (10) g F F F
R R R
x7x6x5x4
Second butterfly calculation: X7X6X5X4
*-j
1
(11)
x 2 ( n 0 , n1, k 0 ) =
k 1= 0
x '1 ( n 0 , k 1 , k 0 ). W 4 n1k1
Twiddle
Factor
Twiddle factor multiplication:
( 2 n 1+ n 0 ) k 0
Figure 4. 8-point FFT implemented in R2MDC architecture.
x ' 2 ( n 0 , n1, k 0 ) = x 2 ( n 0 , n1, k 0 )W (12)
TABLE I. COMPARISON OF THE FIRST PROPOSED ARCHITECTURE WITH The number of Embedded 9-bit multiplier elements used in
COOLEY-TUKEY ALGORITHM AND CONVENTIONAL R2MDC ARCHITECTURE IN
TERMS OF NUMBER OF ARITHMETIC OPERATIONS AND MULTIPLIERS.
the proposed architecture was reduced from 12 elements to 3
by using the complex multiplication reduction in the FFT
Algorithm Complex Real Real Complex Real R2MDC pipelined architecture as shown in TABLE I. Hence,
Multipli- Multipli- add/ multipliers multipliers
cations cations sub
this solution is more efficient than the conventional R2MDC.
Cooley- 12 48 72 12 48 However, the minimum size of data memory remains the same.
Tukey
In order to further optimize the processor, another method
R2MDC 12 48 72 3 12
proposed eliminates the non-trivial complex multiplication with
1stProposed 4 12 60 1 3 the twiddle factors (W81, W83) and implements the processor
architecture without complex multiplication. The proposed butterfly
processor performs the multiplication with the trivial factor
W82=-j by switching from real to imaginary part and imaginary
In the first structure, the twiddle factors of first stage of the to real part, with the factor W80 by a simple cable. With the
8-point FFT are computed and stored in a register in order to j j 3
synchronize the multiplication of those coefficients with the non-trivial factors W 8 1 = e 4 , W 8 3 = e 4 , the processor
two radix-2 FFT points entered in the pipeline architecture as realize the multiplication by the factor 1/2 using hardwired
shown in Fig. 4. shift-and-add operation as shown in Fig. 5.
From the algorithmic perspective, this proposed
+
architecture demands less number of arithmetic operations x(3)x(2)x(1)x(0) S(3)S(2)S(1)S(0)
resource usage, which leads to a high speed operation [10]. Figure 5. Butterfly processor with no complex multiplication.
Add0
var[1..0] Mux4a1_bus2:Mux
A[1..0]
PRE
2' h1 --
B[1..0] + D Q sel[1..0]
a0_r[7..0]
registr:Regi5
registr:Regi4
mult_W1:Behav2
clk1
clk1 Q0_r[7..0]
clk Q0_r[7..0] D0_r[7..0]
Dre_out[7..0] D0_r[7..0] Q0_i[7..0]
Are_in[7..0] Q0_i[7..0] D0_i[7..0]
Dim_out[7..0] D0_i[7..0]
Aim_in[7..0]
mult_W2:Behav4
clk
Dre_out[7..0]
Are_in[7..0]
Dim_out[7..0]
Aim_in[7..0]
mult_W3:Behav3
clk
Dre_out[7..0]
Are_in[7..0]
Dim_out[7..0]
Aim_in[7..0]