Você está na página 1de 11

DESIGN OF FPGA BASED FAST MULTIPLIERS WITH OPTIMUM PLACEMENT &

ROUTING USING FINITE FIELD ARCHITECTURE


Abstract:
As the characteristic dimension shrinks to the nanometer scale, Multiplication unit in modern
processors will become increasingly vulnerable in consuming much combinational path delay
and area. Existing logical optimization for placement and routing approaches in multiplication
unit primarily focus on reducing the overhead, Memory usage and delay of FPGAs.
In this work, the proposed low latency Montgomery multiplier over GF (2m) based on general
irreducible polynomials is presented. An efficient algorithm is presented to decompose the
multiplication into a number of independent units to facilitate parallel.
The proposed techniques will be experimentally analyzed and validated in high end Spartan-6
low power FPGAs of 45nm technology. Placement and routing results show that, the proposed
approach outperforms existing approach in terms of area, Peak memory usage and Delay. The
proposed scheme can reduce the area overhead by 80-90%, memory usage of 80-90% and
decrease the delay by 60-80%.
The proposed method is analyzed in ISCAS-85-Circuit6288 Gives promising option for
designing Multiplication unit high end Processors.
Key Words: - FPGA, Finite field multiplier, logical optimization, polynomials.
Introduction:
A finite field or Galois field is a field that contains only finitely many elements. The finite fields
are classified by size. This classification specifies the order of the field. Notations for the finite
fields are GF (pm) where the letters GF stand for Galois field. The order or cardinal or
number of elements, of a finite field is of the form pm, where p is a prime number called the
characteristic of the field and m is a positive integer called the dimension of the field. Finite field
arithmetic operations in GF (2m) were frequently desired in coding theory, cryptography, digital
signal processing every computer, cellular telephone, and digital audio/video equipment. In fact,
essentially any digital device used to handle speech, stereo, image, graphics, and multimedia
content contains one or more multiplier circuits. The multiplier circuits are usually integrated
within microprocessor, media co-processor, and digital signal processor chips. These multipliers
are used to perform a wide range of functions such as address generation, Discrete Cosine
Transformations (DCT), Fast Fourier Transforms (FFT), multiply-accumulate, FIR(Finite
impulse response) filter etc. As such, multipliers play a critical role in processing audio,
graphics, video, and multimedia data. A multiplying circuit is able to perform a multiplication of
n-bits X n-bits at a high speed by increasing the speed of the forming process of the partial
products so that the delay time may be inhibited from increasing for a large n, and which can

inhibit the chip size becoming large. Multiplication is more complicated than addition, being
implemented by shifting as well as addition. Because of the partial products involved in most
multiplication algorithms, more time and more circuit area is required to compute, allocate, and
sum the partial products to obtain the multiplication result. Multiplier is a part of a processor that
is widely used in digital devices such as computers, laptops, mobile phones, and so forth. It plays
an important role in digital signal processing (DSPs) and image processing. Thus, the multipliers
which have high speed and low power consumption are important. For the last several decades,
many attempts have been made by various researchers to develop algorithms for multipliers
design, which have high speed as well as low power consumption. However, to meet both these
specifications simultaneously has been difficult. Some algorithms are suitable to design high
speed multipliers, while others are appropriate for reducing area usage.
However, they use less area in ASIC or FPGAs implementation and power consumption is low.
On the other hand, Wallace, CSE (common sub-expression elimination), Karatsuba-based
Montgomery multipliers over GF (2m) and Mastrovito-based Montgomery multipliers over GF
(2m) [3] are high speed multipliers with extra area used as a coding hardware. These multipliers
are often used in applications requiring high speed. In the present work, an attempt is made to
design a multiplier based on finite field based structure which is analyzed in that structure
Multiplication of binary polynomials can be implemented as simple bit-shift and XOR Based
approach and exploiting 6-input LUTs of Xilinx FPGAs such that the three specifications, high
speed, Memory and low area overhead are simultaneously met.
The remainder of this paper is organized as follows. The new multiplication formula is presented
in Section II. In Section III, we present the architecture of the proposed multiplier using the new
multiplication formula. We nish our paper with a conclusion in Section IV.
Existing Method:
There are three popular types of bases over finite fields: polynomial basis (PB), normal basis
(NB) and dual basis (DB). Basis is a set of vectors that, in a linear combination, can represent
every vector in a given vector space. Polynomial basis is a mathematical function that is the sum
of a number of terms. Normal basis in field theory is a special kind of basis for Galois extensions
of finite degree, characterized as a forming a single orbit for the Galois group. Dual basis is a set
of vectors that forms a basis for the dual space of a vector space. One advantage of the normal
basis is that the squaring of an element is computed by a cyclic shift of the binary representation.
The dual basis multipliers require less chip area than other two types. The polynomial basis
multipliers are widely used and lead to efficient implementations of multipliers. As compared to
other two bases multipliers, the polynomial basis multipliers have low design complexity and
their sizes are easier to extend to meet various applications due to their simplicity, regularity, and
modularity in architecture. It appears that polynomial multipliers for classes of trinomials still
achieve the lowest circuit complexity

Start

Wallace
Multiplier
Multiplication of
binary
polynomials
based on

Common sub
expression
elimination
Mastrovito
Multiplier

Input: Multiplicand,
Multiplier, Mod value,
R-integer value (A*B*R
mod M)

Montgomery multipliers
over GF (2m).

Parameter analyzed:
Area, memory and delay

Stop

Fig.1.Flow chart of conventional flow.

Arithmetic operations such as addition and multiplication are the two basic operations in the
finite field GF (2m). Addition in GF (2m) is easily realized using m two-input XOR gates while
multiplication is costly in terms of gate count and time delay. The other operations of finite
fields, such as exponentiation, division and inversion can be performed by repeated
multiplications. As a result there is a need to have fast multiplication architecture with low
complexity. The hardware/software implementation efficiency of finite field arithmetic is
measured in terms of the associated space and time complexities. The space complexity is
defined as the number of XOR and AND gates needed implementation of the circuit. The space
and time complexities of a multiplier heavily depend on how the field elements are represented.
Finite field multipliers with different bases of representation have been realized to be used for
various applications. The polynomial basis multipliers are more efficient and more widely used
compared with multipliers in the other bases of representations.

Wallace, CSE (common sub-expression elimination), Mastrovito-based Montgomery multipliers


over GF (2m) [3] are high speed multipliers with extra area used as a coding hardware. These
multipliers are often used in applications requiring high speed. In the Proposed work, an attempt
is made to design a multiplier based on finite field based structure which is analyzed. In that
structure Multiplication of binary polynomials can be implemented as simple bit-shift and XOR
Based approach and implemented in Montgomery multipliers over GF (2m).A shown in the
Fig.1.

Wallace approach:
For real-time signal processing, a high speed and throughput Multipliers-Accumulator (MAC) is
always a key to achieve high performance in the digital signal processing system. The main
consideration of MAC design is to enhance its speeds. That high speed is achieved through this
well-known Wallace tree multiplier. Wallace introduced parallel multiplier architecture [9], [10]
to achieve high speed. Wallace Tree algorithm can be used to reduce the number of sequential
adding stages. The advantage of high speed becomes an enhanced feature for multipliers having
operand of greater than 16 bits. The Wallace tree was being constructed using carry save adder to
reduce an N-row bit product matrix to an equivalent two row matrix that is then fed into carry
propagating adder to sum up those rows of bits and to produce the product. The carry save adders
are those conventional full adders [11] in which carries are not connected and three bits of inputs
are taken in and two bits are given as output. Instead of using carry save adders in this multiplier,
full adders and half adders of 4:2 compressors and 3:2 compressors can be used in their reduction
phase. As shown in the Fig.2.

Fig.2. Wallace multiplier.

The Wallace tree has three steps:


1. Multiply (that is - AND) each bit of one of the arguments, by each bit of the other,
yielding n2 results. Depending on position of the multiplied bits, the wires carry different
weights.
2. Reduce the number of partial products to two by layers of full and half adders.
3. Group the wires in two numbers, and add them with a conventional adder.[2]

The second phase works as follows. As long as there are three or more wires with the same
weight add a following layer:

Take any three wires with the same weights and input them into a full adder. The result will
be an output wire of the same weight and an output wire with a higher weight for each three
input wires.
If there are two wires of the same weight left, input them into a half adder.
If there is just one wire left, connect it to the next layer.

Common Sub-Expression Elimination:


The design of multiplier less implementations (which use only adders, subtracters and binary
shifts) of fixed-point matrix multipliers is considered and a new common sub expression
elimination method is described that recursively extracts signed two-term common sub
expressions. Examples are given that show that the resulting adder-cost is significantly lower
than for existing algorithms.
Eg: Common digit patterns
F1 = 7*X = (0111)*X = X + X<<1 + X<<2
F2 = 13*X = (1101)*X = X + X<<2 + X<<3
D1 = X + X<<2
F1 = D1 + X<<1
F2 = D1 + X<<3

Mastrovito Approach:
Mastrovito multiplier is a class of parallel multipliers over Galois fields based on polynomial
basis representations. The polynomial basis of GF (pm) is given as a vector (m-1, m-2, 0). As an
example, consider a Galois field GF (28) obtained by an irreducible polynomial
IP(x)=x8+x4+x3+x+1. When IP () =0, a vector (7, 6, 0) is a polynomial basis. Figure 1 shows
the architecture of Mastrovito multipliers handled in GF, which consists of Matrix generator and
Matrix operator. The Matrix generator first generates a matrix determined by a reminder which is
given by the division of multiplicand by polynomial basis. The Matrix operator then performs
the multiplication of the multiplier (i.e., another input) and the generated matrix and finally
produces the product. Such Mastrovito multiplier is known as a GF parallel multiplier with the
minimal area cost. As shown in the Fig.3.

Fig.3. Mastrovito multiplier.

Proposed Method:
In this paper, we consider the polynomial basis (PB), the most widely used, and irreducible
trinomials for the design of bit-parallel nite eld multipliers. In bit-parallel designs, a complete
operand word is processed in every cycle, where the bits of input multiplicands are fed in parallel
and the bits of output product word are also obtained in parallel. The multiplication based on the
PB is often accomplished in two-step algorithms which are polynomial multiplication and
modular reduction. The finite field GF based Montgomery proposed a bit parallel multiplier
which combines the above two steps to reduce time complexity [5], [6]. The speed of multiply
operation is of great importance in digital signal processing as well as in the general purpose
processors. In this paper, we present a new multiplication formula which is a variant of
Karatsuba algorithm and a straightforward architecture of non-pipelined bit-parallel multiplier
using the proposed formula. The proposed formula combines bit-shift, XOR Based approach with
m
Montgomery over GF (2 ) which makes the peak memory usage, area overhead, time complexity
of the multiplier lower than the existing multipliers.

Start

Input: Multiplicand,
Multiplier, Mod value,
R-integer value (A*B*R
mod M)

Multiplication of
binary
polynomials
based on

Bit-shift and
XOR Based
approach

Montgomery multipliers
over GF (2m).

Parameter analyzed:
Area, memory and delay

Stop

Fig.4. Flow chart of proposed flow.

If the modulus g(x) is an irreducible polynomial of degree m over GF (p), then the finite field GF
(pm) can be constructed by the set of polynomials over GF (p) whose degree is at most m-1,
where addition and multiplication are done modulo g(x).For example, the finite field GF(32) can
be constructed as the set of polynomials whose degrees are at most 1, with addition and
multiplication done modulo the irreducible polynomial x2+1 (you can also choose another
modulus, as long as it is irreducible and has degree 2). Finite fields of order 2m are called binary
fields or characteristic-two finite fields. They are of special interest because they are particularly
efficient for implementation in hardware, or on a binary computer.
Formulae-1+Formulae-2=Proposed finite field GF based Montgomery.
Formulae: 1The elements of GF (2m) are binary polynomials, i.e. polynomials whose coefficients are either 0
or 1. There are 2m such polynomials in the field and the degree of each polynomial is no more
than m-1. Therefore the elements can be represented as m-bit strings. Each bit in the bit string
corresponding to the coefficient in the polynomial at the same position. For example, GF(23)
contains 8 element {0, 1, x, x+1, x2, x2+1, x2+x, x2+x+1}. x+1 is actually 0x2+1x+1, so it can be
represented as a bit string 011. Similarly, x2+x = 1x2+1x+0, so it can be represented as 110.
In modulo 2 arithmetic, 1+1 0 mod 2, 1+0 1 mod 2 and 0+0 0 mod 2, which coincide with
bit-XOR, i.e. 1 XOR 1=0, 1 XOR 0=1 0 XOR 0=0. Therefore for binary polynomials, addition is
simply bit-by-bit XOR. Also, in modulo 2 arithmetic, -1 1 mod 2, so the result of subtraction
of elements is the same as addition. For example:

(x2+x+1) +(x+1) =x2+2x+2, since 2 0 mod 2 the final result is x2. It can also be
computed as 111 XOR 011=100. 100 is the bit string representation of x2.
(x2+x+1) -(x+1) =x2

Multiplication of binary polynomials can be implemented as simple bit-shift and XOR. For
example:

(x2+x+1)*(x2+1) = x4+x3+2x2+x+1. The final result is x4+x3+x+1 after reduction modulo


2.
It can also be computed as 111*101=11011, which is exactly the bit string representation
of x4+x3+x+1.
In GF (2m), when the degree of the result is more than m-1, it needs to be reduced modulo
a irreducible polynomial. This can be implemented as bit-shift and XOR. For example,
x3+x+1 is an irreducible polynomial and x4+x3+x+1 x2+x mod (x3+x+1). The bit-string
representation of x4+x3+x+1 are 11011 and the bit-string representation of x3+x+1 is
1011. The degree of 11011 is 4 and the degree of the irreducible polynomial is 3, so the
reduction starts by shifting the irreducible polynomial 1011 one bit left, you get 10110,
then 11011 XOR 10110 = 1101. The degree of 1101 is 3 which is still greater than m-1=2,
so you need another XOR. But you don't need to shift the irreducible polynomial this
time. 1101 XOR 1011 =0110, which is the bit-string representation of x2+x.

Formulae: 2Multiplications use Montgomery reduction


Pick some R = 2k
To compute x*y mod q, convert x and y into their Montgomery form xR mod q
and yR mod q
Compute (xR * yR) * R-1 = zR mod q
Multiplication by R-1 can be done very efficiently Problem: Given A, B,
M, compute AB mod M
Idea: Works in an isomorphic ring
AAR mod M and BBR mod M
Need a way to compute ABR mod M
Solution: (x,y) M (xy)/R mod M
T(AR mod M)(BR mod M)
Can add multiple of M since mod M
T + xM = 0 mod R, therefore x = M1T mod R
(AR,BR) M(T + (M1T mod R)M)/R = ABR mod M

A. Xilinx/ISE Simulations
The proposed multiplier and its corresponding blocks are described using structural VHDL and
synthesized employing Xilinx Synthesis Tool (XST), Web PACK version 14.2.The
implementation was targeted to Xilinx spartan-6 low power, Selected Device: 6slx9ltqg144-1l
The logical routing can be observed from the obtained Place and route result from the FPGA
Editor option in Xilinx synthesizer. It is observed that about 40% area for the targeted FPGA is
covered for the implementation of this System. The CLBs are connected in cascade manner to
obtain the functionality for the designed system. To ensure that the hardware implementation
works properly, simulation test was performed using I-Sim (O.76.xd).
B. Impact of the Proposed Flow on peak memory usage, Timing and Area
In this paper, the conventional approach and the proposed method is analyzed based on the cost
function of placer and router. As shown in the (Table I, II and III) the number of LUTs and
CLBs are reduced in proposed flow due to less consumption of adder circuit in the design.
As the Table. I show:
We can see that the conventional approach increases the critical path delay by 42.5% (in the case
of the minimum route channel width) and 38.3% (in the case of the route channel width being
30% larger than the minimum), while the proposed method increases the critical path delay
31.1% respectively. In terms of overhead, since the conventional approach and the proposed
method only change the placement and routing of the design, as the usage of the CLB
(configurable logic blocks) varies which provides the overhead and delay lesser than existing
approach. In addition, no unreachable CLBs are reported by the original method and the
proposed method which helps to overcome the limitation of the original approach. Hence, the
conventional approach and the proposed method sustain CLB overhead
As Table II shows:
Compared to the conventional approach needs (29.979ns logic, 22.170ns route) (57.5% logic,
42.5% route) more CBs when using the minimum route channel width and (13.343ns logic,
8.289ns route)(61.7% logic, 38.3% route) more when using a route channel width 30% larger
than the minimum. In contrast, the proposed ow needs (6.723ns logic, 3.035ns route) (68.9%
logic, 31.1% route) more CBs when using the minimum route channel width. The lower delay
comes from that the number of glitches is smaller when the carry propagates Quicker through the
logic.
However, in this work the main target is using the GF algorithm together with Montgomery in
conjunction with combinational logic that result in less area, memory and delay.(As shown in the
Table I and II).
C.Implementation in FIR Filters:
In order to show the performance further conventional and proposed architectures are
implemented in FINITE impulse response (FIR) lters are of great importance in digital signal
processing (DSP) systems since their characteristics in linear-phase and feed-forward
implementations make them very useful for building stable high-performance lters. The direct
and transposed-form FIR lter implementations are illustrated in Fig.5. (a) and (b), respectively.

Although both architectures have similar complexity in hardware, the transposed form is
generally preferred because of its higher performance and power efficiency [1]. In FIR filters, the
logical routing can be observed from the obtained Place and route result from the FPGA Editor
option in Xilinx synthesizer. It is observed that about 40% area for the targeted FPGA is covered
for the implementation of this System. The CLBs are connected in cascade manner to obtain the
functionality for the designed system. To ensure that the hardware implementation works
properly, simulation test was performed using I-Sim (O.76.xd).

Fig.5. (a) Direct form (b) Transposed form

Conclusion:
Finite field multipliers play a very important role in the areas of digital communication
especially in the areas of cryptography, error control coding and digital signal processing.
This paper presents a new approach for design and hardware realization of multipliers based
upon the concept of finite field Montgomery over GF (2m). The technique is applied on submultiplication blocks, which are the fundamental blocks of the Montgomery architecture. The
sub-multiplication blocks are optimized by employing 6-input LUTs and multiplexers within the
same slices or CLBs. By setting placement and design goal strategies, the results of the proposed
algorithm profitably bridges the gap between the placement and routing guidance metric and the
reliability evaluation metric which in turn reduce the area overhead, memory and delay of the

circuit. Finally, it may be mentioned that in the present work hardware realization is based on
FPGAs. The circuits were experimentally analyzed and synthesized using standard library cells
and implemented in Spartan 3 Low power board with 45nm technology. It is suggested that the
proposed multiplier technique may be considered for transistor level optimization using fullcustom implementation in order to get the best performance and allows the multiplier to be
scaled easily.

Você também pode gostar