Impact of DPU 2017

International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)
Impact of Datapath Unit for an Efficient

Implementation of FFT Processor
Kazi Nikhat Parvin Md.Zakir Hussain Md.Ali Ghazi Islam

Asst. Professor, ECED Asst.Professor, ECED M.E. Student, ECED
Bhoj Reddy Engineering College for Muffakham Jah college of Muffakham Jah college of
Women Engineering and Technology Engineering and Technology
Hyderabad, INDIA Hyderabad, INDIA Hyderabad, INDIA
iamnikhatparvin@gmail.com zakirhussainsm@gmail.com mdalighazi123@gmail.com
Abstract—Fast Fourier transform (FFT) is always an (SDF) and multipath delay commutator (MDC) for efficient
accepted topic for research from past many years for different hardware utilization.
applications in the digital system. The computational unit of
FFT consists butterfly units which have complex multiplication The FFT algorithm has evolved in order achieve the
and complex addition. As multipliers are the basic efficient processor with less power, area, speed and high
computational element and also the slow and power consuming accuracy. It is noticed that each FFT algorithm is efficient
element in the processor special multipliers are used to than other with a variation of complex multipliers in the
overcome the downside. In this study, we have proposed a architectures [7-8]. As multipliers are the most power and
computationally efficient Fast Fourier Transform (FFT) based area consuming and slow to compute. Thus, most of the
on radix-2 and radix-4 decimation in frequency algorithm, FFT algorithms focuses on reducing the multipliers which
where different multipliers are carried out in FFT algorithm results in complex architecture. Hence using an efficient
and parameters are compared. This entire work is performed complex multiplier in the architectures can reduce the
on Xilinx ISE 14.7 and implemented on FPGA Xilinx vertex6 overall consumption of power and area resulting an efficient
xc6vlx760-2ff1760. FFT processor. In FFT algorithms fixed point multipliers
and floating point multipliers [9] are used most preferably
Keyword— Fast Fourier transform (FFT), Butterfly unit, fixed point multipliers are used when application focus on
Computational unit, Multipliers.
speed, power and area but application were dynamic range
is required then floating point multipliers is preferred at
I. INTRODUCTION expenses of higher cost.
Fast Fourier Transform(FFT) is a standard algorithm as Implementation of FFT processor is an old field but
it is computationally efficient over Discrete Fourier advances are still to be made this paper presents low
transform (DFT) [1] which is widely used in various complexity and less area consuming datapath unit of FFT by
applications of digital signal processing, communication combining algorithms, arithmetic and architecture [10].
system, image processing and bio-medical etc. As FFT has Implementation of radix-2 and radix-4 FFT algorithm is
large scale of computational requirements, it utilises large done using different types multipliers such as Array
area and consumes more power when implemented on multiplier, Wallace tree multiplier, Vedic multiplier, Radix-
hardware. Therefore, an efficient and accurate 4 booth multiplier and constant multiplier which are
implementation of FFT processor is always preferred. represented in fixed point (Q-format) [11] and also
FFT algorithm was proposed by Cooley and Tukey [2] synthesized on Xilinx vertex6 xc6vlx760-2ff1760. Post,
which uses divide and conquer approach to reduce the place and route simulation is performed using Verilog code
computational process, this algorithm led to two main to estimate delay, area and power on Xilinx ISE 14.7.
algorithms Radix-2 decimation in time (DIT) and Radix-2
decimation in frequency (DIF). After this to reduce the II. REVIEW OF FFT ALGORITHM
computation, higher radix [3] (Radix-4, Radix-8) algorithm
The N-point discrete Fourier transform(DFT) [1] is
came into picture which reduces the complex multiplication,
de¿ned by
but the butterfly structure becomes more complex when
radix is higher. So, mixed radix algorithm [4] was
introduced which utilise the benefits of both radix-2 and
Radix-4 algorithms. FFT algorithm further evolved with
radix-2k algorithm [5] with more computational efficiency.
Pipelined FFT architectures [6] were introduced which is a where is the twiddle factor, FFT uses
special class of FFT algorithm were FFT is sequentially the symmetry and periodicity of the complex twiddle factor
computed, they are classified into single delay feedback
978-1-5386-1887-5/17/$31.00 ©2017 IEEE
2748
to compute DFT. The Cooley-Tukey FFT algorithm is a order. Each of the numbers between two stages represents
divide-and-conquer algorithm which recursively split the the twiddle factor values of the FFT algorithm according to
DFT computation into odd and even half parts. the N number off the input sequence.
A. Radix-2 DIFFFT algorithm: B. Radix-4 DIFFFT algorithm:

When N is a power of r = 2, this is called Radix-2 and In radix-4 algorithm number of input sequence N should
“divide and conquer approach” is to split the sequence into be a power of 4(i.e., N = 4v), where “v” indicates number of
two sequences of length N/2. The radix-2 FFT is derived by stages involved in the butterfly operation. The main idea of
rewriting eq (1) as: radix-4 FFT is to divide the original input sequence into
four smaller sub-sequences. The relation is not an N/4-point
DFT because the twiddle factor depends on N and not
on N/4. To convert it into an N/4-point DFT we subdivide
X(2m) = the DFT sequence into four N/4-point subsequence, X(4m),
X(4m+1), X(4m+2), and X(4m+3), m = 0, 1, ..., N/4. As
shown in figure3
4(m) W0N
X(2m+1) =
-j W qN
4(m+1)
FFT is calculated by butterfly operation as shown in 1
j
figure 1, where a basic radix-2 butterfly unit requires four

W2qN
multiplications and two additions to compute the input
-1
4(m+2) 1
-1
sequence. The N-point FFT is computed in N stages,
where it has N/2 butterfly operations in each stage and 1
j
W3qN
4(m+3)
requires (N/2) N number of butterfly units. -j
Fig. 3. signal flow graph of basic Radix-4 DIF butterfly operation.

x(n) + x(n)+x(n+N/2) Thus, we obtain the radix-4 decimation-in frequency
WnN DFT as,
x(n+N/2) - X (x(n)-x(n+N/2))WnN
+
Fig. 1. Basic Radix-2 DIF butterfly operation + ]
STAGE 1 STAGE 2 STAGE 3 STAGE 4

x[0] 0 0 0 X[0]
x[1] 0 0 0 X[8] -
x[2] 0 0 0 X[4]
x[3] 0 0 4
X[12] + ]
x[4] 0 0 0 X[2]
x[5] 0 2 0 X[10]
x[6] 0 4 0 X[6] + -
x[7] 0 6 4
X[14]
x[8] 0 0 0
X[1] ]
x[9] 1 0 0 X[9]
x[10] 2 0 0 X[5]
x[11] 3 0 4 X[13]
x[12] 4 0 0 X[3]
- -j
x[13] 5 2 0 X[11]
x[14] 6 4 0 X[7]
]
x[15] 7 6 4
X[15]
Fig. 2. signal flow graph of 16- Point Radix-2 DIF butterfly operation.
The signal flow graph of 16- Point Radix-2 DIFFFT is

shown in figure 2, it can be observed that the inputs x(n) are
ordered and the outputs X(k) are obtained in bit-reversed
2749
STAGE 1 STAGE 2
x[0]
x[1]
0
0
X[0]
X[4]
110101010
x[2] 0 X[8]
0 Fig. 5. grouping of numbers by overlapping technique
x[3] X[12]
x[4] 0 X[1]
1
x[5] X[5]
2 TABLE II. TABLE II: BOOTH RECODED TABLE
x[6] X[9]
3 X[13]
x[7]
x[8] 0 X[2] Multiplier Bits Encoded Partial Operation
x[9] 2 X[6] Qi+1 Qi Qi-1 Multiplier products
x[10]
4
X[10] value
6 0 0 0 0 0M No action
x[11] X[14]
x[12] 0 X[3] 0 0 1 1 1M Add
3
x[13] X[7] 0 1 0 1 1M Add
x[14] 6 X[11] 0 1 1 2 2M Shift left and add
9
x[15] X[15]
1 0 0 -2 -2M Shift left and subtract
Fig. 4. signal flow graph of 16- Point Radix-4 DIF butterfly operation. 1 0 1 -1 -1M Subtract
1 1 0 -1 -1M Subtract
Figure 4 shows a 16-point radix-4 where it is clear that
1 1 1 0 0M No action
the non-trivial complex multiplications of radix-4 FFT
algorithms will only appear after every two butterfly stages.
As such, it provides better spatial regularity than radix-2 Table II describes the Booth Recoding table. In the
FFT algorithms, which is beneficial to hardware above table, 3 bits of the Multiplier Q are taken for
implementation. inspection to find the encoded value. is the encoded value
which is to be multiplied with the Multiplicand ‘M’ and is
III. TYPPES OF MULTIPLIERS given by = Qi-1 + Qi -2 * Qi+1.
A. MODIFIED BOOTH ALGORITHM B. VEDIC MULTIPLIER

Mac Soorley [12] proposed a modification to the Vedic multiplier or Urdhva Tiryakbhyam Sutra is a
multiplication algorithm [13-14] which is one of the easiest
Booth Algorithm. The advantage of this algorithm and fastest approach to perform mathematical operation.
is minimizing the partial products by half Due to coherence and symmetry, these algorithms when
regardless of the input bits. The partial products deployed in Digital processors consume less area with lower
generated in the second stage of multiplication are power consumption. If the number of bits are increased,
then the gate delay and area increase very slowly as
reduced, hence it is known as one of the fastest compared to other multipliers.
multiplication algorithm.
TABLE I. TABLE I: BOOTH BITS
Qi Qi-1 No. of partial products
a3 a2 a1 a0
0 0 Zero partial products
0 1 One partial product
1 0 One partial product
1 1 Two partial products
b3 b2 b1 b0
In Table I we can infer that in the last operation (1 1), Fig. 6. Illustration of Urdhva Tiryakbhyam Sutra for 4 bits.
two partial products are generated. In order to reduce the
partial products, the last operation of (1 1) should be Figure 6 shows multiplication of two binary numbers,
reduced or replaced according to the Booth Recoding Table. the vertical and crosswise combination of the binary bits
Three bits are taken at a time and are encoded into one of {- generates the partial products concurrently. The partial
2, -1,0,1,2}. Grouping is started from the LSB, the first products of 4x4 multiplier (P7, P6, P5, P4, P3, P2, P1, P0)
block uses only two bits of the multiplier and assumes a are given by the following equations:
zero for the third bit as shown in figure 5:
P0 = a0b0
P1C1 = a1b0 + a0b1
P2C2 = a2b0 + a1b1 + a0b2 +C1
2750
P3C3 = a3b0 + a2b1 + a1b2 + a0b3 + C2 D. ARRAY MULTIPLIER

P4C4 = a3b1 + a2b2 + a1b3+ C3 Array multiplier [17] is known for its regular structure.
The Multiplier circuit performs add and shift algorithm.
P5C5 = a3b2 + a2b3 + C4 One-bit multiplier and multiplicand are multiplied to
P6C6 = a3b3 + C5 generate the partial products. The partial product is shifted
according to their bit orders and then added. The
P7C7 = C6 multiplication of two binary number is performed by
In 4*4 Vedic multiplier, multiplication is done in a employing an array of half-adders and full-adders. Figure 8
single line using Urdhva Tiryakbhyam Sutra, where in 4x4 shows the 4x4 array multiplier operation.
multiplier partial products are generated to obtain the (AXB) A3 A2 A1 A0
resultant product. Initially carry is taken as zero, if we B3 B2 B1 B0
encounter more than one line in any step, then all the results A3B0 A2B0 A1B0 A0B0
are added to the previous carry. In each step, LSB acts as A3B1 A2B1 A1B1 A0B1
the result bit and all other bits act as a carry for further step. A3B2 A2B2 A1B2 A0B2
Hence the number of steps required for the multiplication
A3B3 A2B3 A1B2 A0B3
process gets reduced and speed of the multiplier increases.
S7 S6 S5 S4 S3 S2 S1 S0
C. CONSTANT MULTIPLIER Fig. 8. 4x4 array multiplier
Constant multiplication [15] is one of the important
For example, Consider the multiplication of two
operation in different application of Digital signal
unsigned n-bit numbers, where A=An-1, An-2…. A0 is the
processing (DSP) image processing and linear transform for
multiplicand and B=Bn-1, Bn-2…. B0 is the multiplier. The
example DCT and FFT. As FFT architecture has constant
product of these two bits can be written as S7, S6, S5, S4,
coefficients multiplied with variable set of data, constant
S3, S2, S1, S0. Carry save adders can also be used in array
multiplication can be performed which uses shift and add
multiplier to perform multiplication operation.
operation corresponding to nonzero bit position.
Representing the constant coefficient in canonical signed
digit (CSD) [16] representation the number of nonzero bits E. WALLACE-TREE MULTIPLIER
is reduced hence, it utilizes less area and power than other Fig. 9. The Wallace tree multiplier [18] is a parallel multiplier and offers
multipliers. faster performance for larger operands because its height is logarithmic in
word size, not linear. It uses the carry save addition algorithm to reduce the
Example: consider a binary number x=0.011000011 propagation delay.
Representing the binary number in CSD then, x=
0.100_100010_1
Now, consider the multiplication
z = 0.100_100010_1*y
this can be implemented as
z = x>>1-x>>3+x>>7-x>>10
x
X>>2
+
-
X>>7
X>>10 (a) (b)
+ Fig 9. (a)Wallace tree structure for 8x8 multiplication. (b) Reorganized
Matrix of Wallace tree matrix.
+
The multiplication operation performed in the Wallace
z tree, every possible bit in every column is covered by the
3:2 (full adder) or 2:2 (half adder) compressors repetitively
Fig. 7. Representation of the result in CSD format until the final partial product is left with a depth of only 2.
Thus, to compress the partial products a Wallace tree
The figure 7 shows the constant multiplier using shift multiplier uses more hardware is utilized to get final product
and add approach is used where CSD representation is used as quickly as possible. Fig. 9 shows the logic used for 8x8
to reduce the non-zero bits. bits Wallace tree multiplication and the tree structure
organised according to the addition performed for the partial
products.
2751
IV. PROPOSED WORK TABLE IV. COMPARISON TABLE OF RADIX-4 DIFFFT WITH
DIFFERENT MULTIPLIERS.
FFT processor are most important digital device which
Paramet Array Wallace Vedic Modifie Constant
plays a valuable role in different application of er multiplier tree multiplie d booth multiplier
communication, signal processing and bio-medical etc. multiplie r multiplie
There is always a trade of between speed, area, power and r r
precision based upon the algorithm, architecture and No. of A
slice r 14,237 16,207 14,198 16.273 15,080
computational elements used. In this work detailed study of e
FFT architecture and multipliers is done to get an efficient LUT’s
a
utilization of speed, area and delay for FFT processor in Logic D 4.5ns 4.2ns 4.38ns 6.54ns 5.02ns
e
various applications. l
Routing a
19ns 17.9ns 17.52 24.07ns 14.37ns
According to Dr. Oscar Gustafsson [10] Ideally, “one y
should select FFT algorithms based on the architecture, Total 23.6ns 22.18ns 21.9ns 30.62ns 19.39ns
which in turn should be selected based on the processing
applications”. This work presented will be extensively
presenting data-path components i.e. computational
elements in Radix-2 and Radix-4 FFT algorithm, the main From the comparison table mentioned above it is
computational elements are multipliers and adders where observed that using a constant multiplier there is a drastic
multipliers are important computational element in the change in the delay by 35% decrease from the other
implementation of area, power and delay optimized FFT multiplier’s and also there is 7% decrease in area. Hence,
processor. using constant multiplier in FFT architecture it will provide
high speed with comparable less area utilization.
The presented work focuses on efficient and high speed
multipliers for a better performance of FFT processor. Also,
implemented FFT architecture using following efficient VI. CONCLUSION
multipliers and compared the effect on area and delay This work has presented efficient computational unit of
utilized by the FFT processor. FFT processor, comparing different multiplier as mentioned
in paper were used in Radix-2 FFT algorithm and Radix-4
V. IMPLIMENTATION AND RESULT ANALYSIS FFT algorithm to analyze the effect of area and delay
depending upon the efficiency of multiplier. It was observed
Implementation of Radix-2 and Radix-4 FFT algorithm that using a constant multiplier in CSD format provides high
is performed using different multipliers in Q-point format speed and less area utilization. Post-place and route
on Xilinx vertex6 xc6vlx760-2ff1760. Post-place and route simulation and synthesis were performed using Verilog code
simulation is performed and synthesis analysis is performed in Xilinx ISE 14.7 and implemented on Xilinx vertex6
using Verilog code on Xilinx ISE 14.7. xc6vlx760-2ff1760 FPGA.
TABLE III. COMPARISON TABLE OF RADIX-2 DIFFFT WITH

DIFFERENT MULTIPLIERS.
References
[1] Ngoc-hung Nguyen, Sheraz Ali khan, Cheol-hong Kim and Jong-
Paramet Array Wallace Vedic Modifie Constant
myon kim, “An FPGA based implementation of pipelined FFT
er multiplier tree multiplie d booth multiplier
processor for high speed signal processing application” springers
multiplie r multiplie
international publishing, LNCS 10216, pp.81-89,2017.
r r
No. of A [2] J. Cooley and J. W. Tukey, “An algorithm for the machine calculation
slice r 12,010 12,954 13,234 15,033 13,096 of complex Fourier series,” Math. Comput., vol. 19, pp. 297–301,
LUT’s e Apr. 1965.
a [3] W. M. Gentleman and G. Sande, “Fast Fourier transforms for fun and
Logic D 6.36ns 5.97ns 6.67ns 10.66ns 6.92ns pro¿t,” in Proc. Joint Comput. Conf., vol. 29, Nov. 1966, pp. 563–
e
l
578.
Routing a
32.1ns 29.55ns 29.07ns 38.42ns 20.06ns [4] J. Cooley, P. Lewis, and P. Welch, “Historical notes on the fast
y Fourier transform,” Proc. IEEE,vol. 55, no. 10, pp. 1675–1677, Oct.
Total 38.4ns 35.52ns 35.74ns 49.08ns 26.98ns 1967.
[5] A. Cortés, I. Vélez, and J. F. Sevillano, “Radix rk FFTs: Matricial
rep- representation and SDC/SDF pipeline implementation,” IEEE
Trans. Signal Process., vol. 57, no. 7, pp. 2824–2839, Jul. 2009.
Table III and Table IV shows the comparison of area [6] E. Wold and A. Despain, “Pipeline and parallel-pipeline FFT
and delay on Radix-2 DIFFFT and Radix-4 DIFFFT with processors for VLSI implementations,” IEEE Transactions on
different types of multiplier, it is observed that there is a Computers, vol. C- 33, no. 5, pp. 414–426, 1984.
trade-off between array multiplier, Wallace tree multiplier [7] Ma, Z.-G., Yin, X.-B., Yu, F.: A novel memory-based FFT
and constant multiplier, as array multiplier has very less architecture for real-valued signals based on a Radix-2 decimation-in-
frequency algorithm. IEEE Trans. Circ. Syst. II: Exp. Briefs 62, 876–
area consumed but delay is higher when compared with 880 (2015).
constant multiplier with a huge difference. Hence, using [8] Luo, H.-F., Liu, Y.-J., Shieh, M.-D.: Efficient memory-addressing
constant multiplier in FFT architecture it will provide high algorithms for FFT processor design. IEEE Trans. Very Large Scale
speed with comparable less area utilization. Integr. (VLSI) Syst. 23, 2162–2172 (2015).
2752
[9] Amir Kaivani and Seokbum Ko, “floating point butterfly architecture [14] Sree Nivas A, Kayalvizhi N, “Implementation of Power Efficient
based on binary signed digit representation” IEEE transaction on very Vedic Multiplier”, International Journal of Computer
large scale integration system 10.1109/TVLSI.2015.2437999. Applications(IJCA), ISSN 0975 – 8887, Volume 43, pp. 21-24, April
[10] Oscar Gustafsson, “ELECTRONICS LETTERS 28th February 2013 2012.
Vol.49 No.5” doi: 10.1049/el.2013.0549. [15] Kyung-Ju Cho, Suhyun Jo, Yong-Eun Kim, Yi-Nan Xu, Jin-Gyun
[11] Sandesh S. Saokar, R. M. Banakar, Saroja Siddamal, “High Speed Chung, “Constant Multiplier Design using Specialized Bit Pattern
Signed Multiplier for Digital Signal Processing Applications” source: Adders” source: 978-1-4244-2182-4/08/$25.00 ©2008 IEEE.
978-1-4673-1318-6/12/$31.00 ©2012 IEEE. [16] R. Hartley, “Subexpression sharing in filters using canonic signed
[12] “
David Villeger, Vojin G Oklobdzija, Evaluation of Booth Encoding digit multipliers,” IEEE Trans. Circuits & Syst. II, vol. 43, Oct. 1996.
Techniques for Parallel Multiplier Implementation” source: Volume [17] S. D. Pezaris, "A 40-ns 17-Bit by 17-Bit Array Multiplier", IEEE
29, Issue 23, 11 November 1993, ISSN 0013-5194. Trans. on Computers, pp. 442-447, Abr. 1971.
[13] Kanhe, Aniruddha, Shishir Kumar Das, and Ankit Kumar Singh. [18] Ron S. Waters, Earl E. Swartz lander, “A Reduced Complexity
"Design and implementation of low power multiplier using vedic Wallace Multiplier Reduction,” IEEE TRANSACTIONS ON
multiplication technique." International Journal of Computer Science COMPUTERS, VOL. 59, NO. 8, pp. 1134 – 1137, AUGUST 2010.
and Communication 3, no. 1, pp.131-132, 2012.
2753

Impact of DPU 2017

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Impact of DPU 2017

Enviado por

Direitos autorais:

Formatos disponíveis

International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

Impact of Datapath Unit for an Efficient

Kazi Nikhat Parvin Md.Zakir Hussain Md.Ali Ghazi Islam

978-1-5386-1887-5/17/$31.00 ©2017 IEEE

A. Radix-2 DIFFFT algorithm: B. Radix-4 DIFFFT algorithm:

figure 1, where a basic radix-2 butterfly unit requires four

Fig. 3. signal flow graph of basic Radix-4 DIF butterfly operation.

Fig. 1. Basic Radix-2 DIF butterfly operation + ]

STAGE 1 STAGE 2 STAGE 3 STAGE 4

The signal flow graph of 16- Point Radix-2 DIFFFT is

A. MODIFIED BOOTH ALGORITHM B. VEDIC MULTIPLIER

P3C3 = a3b0 + a2b1 + a1b2 + a0b3 + C2 D. ARRAY MULTIPLIER

TABLE III. COMPARISON TABLE OF RADIX-2 DIFFFT WITH

Você também pode gostar