Project Final Document

Chapter 1
INTRODUCTION
1.1
INTRODUCTION
Multiplication is an important fundamental function in arithmetic operations.
Multiplication-based operations such as Multiply and Accumulate (MAC) and inner

product are among some of the frequently used Computation Intensive Arithmetic
Functions (CIAF) currently implemented in many Digital Signal Processing (DSP)
applications such as convolution, Fast Fourier Transform(FFT), filtering and in
microprocessors in its arithmetic and logic unit [1]. The demand for high speed
multiplier is increasing as the need of high speed processors are increasing. Higher
throughput arithmetic operations are important to achieve the desired performance in
many real time signal and image processing applications [2]. One of the key
arithmetic operations in such applications is multiplication and the development of
fast multiplier circuit has been a subject of interest over decades. Multiplication based
on Vedic mathematics is one of the fast and low power multiplier. Vedic mathematics
is the name given to the ancient system of mathematics, which was rediscovered from
the ancient Indian scriptures between 1911 and 1918 by Jagadguru Swami Sri Bharati
Krisna Tirthaji (1884-1960), a scholar of Sanskrit, mathematics, history and
philosophy. Vedic mathematics is mainly based on sixteen principles which are termed
as Sutras, which are actually word formulae describing natural ways of solving a
whole range of mathematical problems [3]. A simple digital multiplier (referred
henceforth as Vedic multiplier) architecture based on the Urdhva Triyakbhyam
(Vertically and Cross wise) Sutra is presented [4]. It is also equally likely that many
such similar technical applications might come up from the storehouse of knowledge,
Veda, if investigated properly.
1.2
OBJECTIVES
Digital multipliers are the most commonly used components in any digital
circuit design. They are fast, reliable and efficient components that are utilized to
implement any operation. Depending upon the arrangement of the components, there
are different types of multipliers available. Particular multiplier architecture is chosen
based on the application [5]. In many DSP algorithms, the multiplier lies in the critical
delay path and ultimately determines the performance of algorithm. The speed of
1
multiplication operation is of great importance in DSP as well as in general processor.

In the past multiplication was implemented generally with a sequence of addition,
subtraction and shift operations. There have been many algorithms proposals in
literature to perform multiplication, each offering different advantages and having
tradeoff in terms of speed, circuit complexity, area and power consumption. The
multiplier is a fairly large block of a computing system. The amount of circuitry
involved is directly proportional to the square of its resolution i.e. A multiplier of size
n bits has n^2 gates. For multiplication algorithms performed in DSP applications
latency and throughput are the two major concerns from delay perspective. Latency is
the real delay of computing a function, a measure of how long the inputs to a device
are stable is the final result available on outputs. Throughput is the measure of how
many multiplications can be performed in a given period of time; multiplier is not
only a high delay block but also a major source of power dissipation. Thats why if
one also aims to minimize power consumption, it is of great interest to reduce the
delay by using various delay optimizations. So, the main objective is to reduce delay,
area and power consumption. In order to achieve these factors, 8x8 Vedic multiplier is
being implemented.
1.3
PROBLEM SPECIFICATION
In the digital hardware, two multiplication algorithms commonly followed are
array multiplication algorithm and Booth multiplication algorithm. The computation

time taken by the array multiplier is comparatively less because the partial products
are calculated independently in parallel. The delay associated with the array multiplier
is the time taken by the signals to propagate through the gates that form the
multiplication array. In Booth multiplication, large booth arrays are required for high
speed multiplication and exponential operations which in turn require large partial
sum and partial carry registers. Multiplication of two n-bit operands using a radix-4
booth recording multiplier requires approximately n / (2m) clock cycles to generate
the least significant half of the final product, where m is the number of Booth recorder
adder stages. Thus, a large propagation delay is associated with this case. So, in order
to achieve a reduced delay multiplier, Urdhva tiryakbhyam Sutra is first applied to the
binary number system and is used to develop digital multiplier architecture. This is
shown to be very similar to the popular array multiplier architecture. This
Sutra also
shows the effectiveness of to reduce the NxN multiplier structure into an efficient 4x4
multiplier structures.
1.4
METHODOLOGIES
To design a fast and low power multiplier, the following steps are to be
performed, they are:

1.
To perform a literature survey of different multiplier techniques and to

identify the disadvantages in each of these multipliers.
2.
To determine the factors involved in speed of multipliers for different

multipliers by using XILINX ISE 10.1 simulation software.
3.
To carry out a detailed study in order to increase the speed of proposed

multiplier
4.
To procure basic knowledge about EDA tools and Field programmable

gate array (FPGA).
1.5
LAYOUT OF THE PROJECT

This thesis is organized into eight subsequent chapters as follows. Chapter II
describes about the importance of Vedic mathematics and methodology involved in

conventional multipliers. Chapter III gives the basic methodology of Vedic
multiplication technique. Chapter IV gives the basic knowledge of software and
hardware description language used. Chapter V Encloses the basic description of
design methodology and design flow of FPGA, how the design is implemented on
FPGA (Spartan 3E). Chapter VI describes the design and implementation of Vedic
multiplier module in XilinxISE10.1. Chapter VII comprises of Result and Discussion
in which device utilization summary and computational path delay obtained for the
proposed Vedic multiplier (after synthesis) is discussed. Finally chapter VIII
comprises of Conclusion.
Chapter 2
LITERATURE SURVEY
2.1 VEDIC MATHEMATICS
Many Indian Secondary School students consider Mathematics a very difficult
subject. Some students encounter difficulty with basic arithmetical operations. Some
students feel it difficult to manipulate symbols and balance equations. In other words,
abstract and logical reasoning is their hurdle. Many such difficulties in learning
Mathematics enter into a long list if prepared by an experienced teacher of
Mathematics. Volumes have been written on the diagnosis of 'learning difficulties'
related to Mathematics and remedial techniques. Learning Mathematics is an
unpleasant experience to some students mainly because it involves mental exercise.
Of late, a few teachers and scholars have revived interest in Vedic Mathematics which
was developed, as a system derived from Vedic principles. Vedic mathematics is the
name given to the ancient system of mathematics which was rediscovered from the
Vedas. To be more specific, it has originated from Atharva Vedas the fourth Veda.
Atharva Veda deals with the branches like Engineering, Mathematics, Sculpture,
Medicine and all other sciences with which we are today aware of. Its a unique
technique of calculations based on simple principles and rules, with which any
mathematical problem - be it arithmetic, algebra, geometry or trigonometry can be
solved mentally [6]. Swami Bharati Krishna Tirthaji Maharaj, Shankaracharya of
Goverdhan Peath collected the lost formulae from Atharva Veda and wrote them in
the form of sixteen mathematical formulae or sutras and their corollaries derived from
the Vedas and thirteen sub-sutras. Vedic mathematics introduces wonderful
applications
to
arithmetic
computations,
theory
of
numbers,
compound
multiplications, algebraic operations, factorizations, simple quadratic and higher order

equations, simultaneous quadratic equations, partial fractions, squaring, cubing,
coordinate geometry and the wonderful Vedic numerical code.
Uses of Vedic mathematics:
1. It helps a person to solve mathematical problems10-15 times faster
2. It helps in intelligent guessing
3. It reduces burden (Need to learn tables up to nine only)
4. It provides one line answer and helps in reducing silly mistakes

5. It is a magical tool to reduce scratch work and finger counting
6. It increases and improves concentration
7. Logical thinking process gets enhanced
Vedic mathematics Sutras:
1. (Anurupyena) Shunyamanyat -If one is in ratio, the other is zero.
2. ChalanaKalanabyham -Differences and similarities.
3. Ekadhikina Purvena- By one more than the previous One.
4. Ekanyunena Purvena - By one less than the previous one.
5. Gunakasamuchyah-Factors of the sum is equal to the sum of factors.
6. Gunitasamuchyah-The product of sum is equal to sum of the product.
7. Nikhilam Navatashcaramam Dashatah -All from 9 and last from 10.
8. Paraavartya Yojayet-Transpose and adjust.
9. Puranapuranabyham -By the completion or no completion.
10. Sankalana- vyavakalanabhyam -By addition and by subtraction.
11. Shesanyankena Charamena- The remainders by the last digit.
12. Shunyam Saamyasamuccaye -When the sum is same then sum is zero.
13. Sopaantyadvayamantyam -The ultimate and twice the penultimate.
14. Urdhva Tiryakbhyam -Vertically and crosswise.
15. Vyashtisamanstih -Part and Whole.
16. Yaavadunam- Whatever the extent of its deficiency
Using one of these principles of Vedic mathematics, Vedic multiplier is being

implemented.
2.2 MULTIPLIERS
Multipliers play an important role in todays digital signal processing and
various other applications. With advances in technology, many researchers have tried
and are trying to design multipliers which offer either of the following design targets
high speed, low power consumption, regularity of layout and hence less area or even
combination of them in one multiplier thus making them suitable for various high
speed, low power and compact VLSI implementation.
The common multiplication method is add and shift algorithm. In parallel
multipliers number of partial products to be added is the main parameter that
determines the performance of the multiplier. To reduce the number of partial
products to be added, Modified Booth algorithm is one of the most popular
algorithms. To achieve speed improvements Wallace Tree algorithm can be used to
reduce the number of sequential adding stages. Further by combining both Modified
Booth algorithm and Wallace Tree technique we can see advantage of both algorithms
in one multiplier. However with increasing parallelism, the amount of shifts between
the partial products and intermediate sums to be added will increase which may result
in reduced speed, increase in silicon area due to irregularity of structure and also
increased power consumption due to increase in interconnect resulting from complex
routing. On the other hand serial-parallel multipliers compromise speed to achieve
better performance for area and power consumption. The selection of a parallel or
serial multiplier actually depends on the nature of application.
Multiplication algorithm [7]:
1. If the LSB of Multiplier is 1, then add the multiplicand into an accumulator.
2. Shift the multiplier one bit to the right and multiplicand one bit to the left.
3. Stop when all bits of the multiplier are zero.
Types of multipliers:
1. Serial multiplier:
Where area and power is of utmost importance and delay can be tolerated the
serial multiplier is used. This circuit uses one adder to add the M * N partial products.
Multiplicand and Multiplier inputs have to be arranged in a special manner

synchronized with circuit behavior. The inputs could be presented at different rates
depending on the length of the multiplicand and multiplier. This approach is not
suitable for large values of M or N.
Fig 2.1 Serial multiplier

2. Serial/Parallel Multiplier:
Fig 2.2 Serial/parallel multiplier
The general architecture of this multiplier is shown in the figure 2.2. To

overcome the disadvantages of serial and parallel multiplier, serial-parallel multiplier
is invented. It comprises of advantages of both serial and parallel multiplier. One
operand is fed to the circuit in parallel while the other is serial. N partial products are
formed each cycle. On successive cycles, each cycle does the addition of one column
of the multiplication table of M*N PPs. The final results are stored in the output
register after N+M cycles while the area required is N-1 for M=N.
3. Shift and Add Multiplier:
Depending on the value of multiplier LSB bit, a value of the multiplicand is
added and accumulated. At each clock cycle the multiplier is shifted one bit to the
right and its value is tested. If it is a 0, then only a shift operation is performed. If the
value is a 1, then the multiplicand is added to the accumulator and is shifted by one bit
to the right.
Fig 2.3 Shift and add multiplier

After all the multiplier bits have been tested the product is in the accumulator. The
accumulator is 2N (M+N) in size and initially the N, LSBs contains the Multiplier.
The delay is N cycles maximum. This circuit has several advantages in asynchronous
circuits.
4. Array Multiplier:
Array multiplier is well known due to its regular structure. Multiplier circuit is
based on add and shift algorithm. Each partial product is generated by the
multiplication of the multiplicand with one multiplier bit. The partial product are
shifted according to their bit orders and then added. The addition can be performed
with normal carry propagate adder. N-1 adders are required where N is the multiplier
length. Although the method is simple, the addition is done serially as well as in
parallel.
Fig 2.4 Array multiplier

To improve on the delay and area the CRAs are replaced with Carry Save
Adders, in which every carry and sum signal is passed to the adders of the next stage.
Final product is obtained in a final adder by any fast adder (usually carry ripple
adder). In array multiplication we need to add, as many partial products as there are
multiplier bits. Total Area = (N-1) * M * Area FA, Delay= 2(M-1) *FA.
5. Booth Multiplier:
It is a powerful algorithm for signed-number multiplication, which treats both

positive and negative numbers uniformly. For the standard add-shift operation, each
multiplier bit generates one multiple of the multiplicand to be added to the partial
product. If the multiplier is very large, then a large number of multiplicands have to
be added. In this case the delay of multiplier is determined mainly by the number of
additions to be performed. If there is a way to reduce the number of the additions, the
performance will get better. Booth algorithm [8] is a method that will reduce the
number of multiplicand multiples. For a given range of numbers to be represented, a
higher representation radix leads to fewer digits. Since a k-bit binary number can be
interpreted as K/2-digit radix-4 number, a K/3-digit radix-8 number, and so on, it can
deal with more than one bit of the multiplier in each cycle by using high radix
multiplication.
Fig 2.5 Booth multiplier

6. Wallace Tree Multiplier:
Several popular and well-known schemes, with the objective of improving the
speed of the parallel multiplier, have been developed in past. Wallace introduced a
very important iterative realization of parallel multiplier [9]. This advantage becomes
more pronounced for multipliers of bigger than 16 bits. In Wallace tree architecture,
all the bits of all of the partial products in each column are added together by a set of
counters in parallel without propagating any carries. Another set of counters then
1
0
reduces this new matrix and so on, until a two-row matrix is generated. The most
common counter used is the 3:2 counter which is a Full Adder. The final results are
added using usually carry propagate adder. The advantage of Wallace tree is speed
because the addition of partial products is now O (logN). The result of these additions
is the final product bits and sum and carry bits which are added in the final fast adder
(CRA). Block diagram of Wallace tree multiplier is shown in figure 2.6
Fig 2.6 Wallace tree multiplier

7. Combinational multiplier:
Fig 2.7 Combinational multiplier

Each bit of the multiplier is multiplied against the multiplicand, the product is
associated according to the position of the bit within the multiplier and the resulting
11
products are then added to form the final result. Main advantage of binary
multiplication is that the generation of intermediate products is easy. If the multiplier
bit is a 1, the product is a correctly shifted copy of the multiplicand; if the multiplier
bit is a 0, the product is simply 0. The architecture of combinational multiplier is
shown in figure 2.7. In most of the systems combinational multipliers are slow and
take a lot of area.
Generally, it is not possible to say that an exact multiplier yields to greater
cost-effectiveness, since trade-off is design and technology dependent. These basic
array multipliers consume low power and exhibit good performance, however there
use is limited to sixteen bits. But due to the regular structure of Vedic multiplier,
power consumed and delay will be reduced with the increase in order of the
multiplier.
2.3 ADDERS
An adder or summer is a digital circuit that performs addition of numbers. In
many computers and other kinds of processors, adders are used not only in the
arithmetic logic units, but also in other parts of the processor, where they are used to
calculate addresses, table indices, and similar operations.
Although adders can be constructed for many numerical representations, such
as binary-coded decimal or excess-3, the most common adders operate on binary
numbers. In cases where two's complement or ones' complement is being used to
represent negative numbers, it is trivial to modify an adder into an addersubtractor.
Other signed number representations require a more complex adder. Adder circuits are
of two types: Half adder ad Full adder.
Half adder
Half adder is a combinational arithmetic circuit that adds two numbers and
produces a sum bit (S) and carry bit (C) as the output. If A and B are the input bits,
then sum bit (S) is the X-OR of A and B and the carry bit (C) will be the AND of A
and B. From this it is clear that a half adder circuit can be easily constructed using one
X-OR gate and one AND gate. Half adder is the simplest of all adder circuit, but it has
a major disadvantage. The half adder can add only two input bits (A and B) and has
nothing to do with the carry if there is any in the input. So if the input to a half adder
have a carry, then it will be neglected it and adds only the A and B bits. That means
the binary addition process is not complete and thats why it is called a half adder.
Fig 2.8 Half adder

In order to add the carry, we need a adder which can add 3 bits at a time. This
is done by full adder. Schematic and working of full adder is shown below.
Full adder
Full adder is a logic circuit that adds two input operand bits plus a Carry in bit
and outputs a Carry out bit and a sum bit. The Sum out (Sout) of a full adder is the
XOR of input operand bits A, B and the Carry in (Cin) bit.
Fig 2.9 Full adder

There is a simple trick to find results of a full adder. Consider the second last
row of the truth table, here the operands are 1, 1, 0 ie (A, B, Cin). Add them together
ie 1+1+0 = 10. In binary system, the number order is 0, 1, 10, 11. and so the
result of 1+1+0 is 10 just like we get 1+1+0 =2 in decimal system. 2 in the decimal
system correspond to 10 in the binary system. Swapping the result 10 will give S=0
and Cout = 1 and the second last row is justified. This can be applied to any row in the
table.
Other adders
1. Ripple Carry Adder (RCA)
The ripple carry adder is constructed by cascading full adders (FA) blocks in
series. One full adder is responsible for the addition of two binary digits at any stage
of the ripple carry. The carryout of one stage is fed directly to the carry-in of the next
stage. Even though this is a simple adder and can be used to add unrestricted bit
length numbers, it is however not very efficient when large bit numbers are used. One
of the most serious drawbacks of this adder is that the delay increases linearly with
the bit length. The worst-case delay of the RCA is when a carry signal transition
ripples through all stages of adder chain from the least significant bit to the most
significant bit, which is approximated by:
t= (n+1)tc + ts
Eq.1.1
Where tc is the delay through the carry stage of a full adder, and ts is the delay to
compute the sum of the last stage. The delay of ripple carry adder is linearly
proportional to n, the number of bits; therefore the performance of the RCA is limited
when n grows bigger. The advantages of the RCA are lower power consumption as
well as compact layout giving smaller chip area.
Fig 2.10 Schematic of RCA
2. Carry Save Adder (CSA)

The carry-save adder reduces the addition of 3 numbers to the addition of 2
numbers. The propagation delay is 3 gates regardless of the number of bits. The carrysave unit consists of n full adders, each of which computes a single sum and carries
bit based solely on the corresponding bits of the three input numbers. The entire sum
can then be computed by shifting the carry sequence left by one place and appending
a 0 to the front (most significant bit) of the partial sum sequence and adding this
sequence with RCA produces the resulting n+1-bit value. This process can be
continued indenitely, adding an input for each stage of full adders, without any
intermediate carry propagation. These stages can be arranged in a binary tree
structure, with cumulative delay logarithmic in the number of inputs to be added, and
invariant of the number of bits per input. The main application of carry save algorithm
is, well known for multiplier architecture is used for efficient CMOS implementation
of much wider variety of algorithms for high speed digital signal processing .CSA
applied in the partial product line of array multipliers will speed up the carry
propagation in the array. Block diagram of carry save adder is shown inn figure 2.11
Fig 2.11 Carry save adder

3. Carry Look-Ahead Adder
Carry look-ahead adder is designed to overcome the latency introduced by the
rippling effect of the carry bits. The propagation delay occurred in the parallel adders
can be eliminated by carry look ahead adder. This adder is based on the principle of
looking at the lower order bits of the augends and addend if a higher order carry is
generated. This adder reduces the carry delay by reducing the number of gates
through which a carry signal must propagate. Carry look ahead depends on two
things: Calculating for each digit position, whether that position is going to propagate
a carry if one comes in from the right and combining these calculated values to be
able to deduce quickly whether, for each group of digits, that group is going to
propagate a carry that comes in from the right. The net effect is that the carries start
by propagating slowly through each 4-bit group, just as in a ripple-carry system, but
then moves 4 times faster, leaping from one look ahead carry unit to the next. Finally,
within each group that receives a carry, the carry propagates slowly within the digits
in that group.
Fig 2.12 Schematic of Carry Look-Ahead Adder

This adder consists of three stages: a propagate block/ generate block, a sum
generator and carry generator. The generate block can be realized using the
expression
Gi = Ai&Bi for i=0, 1, 2, 3
Eq.2.2
Similarly the propagate block can be realized using the expression

Pi = Ai^Bi for i=0, 1, 2, 3
Eq.2.3
The carry output of the (i-1)th stage is obtained from

Ci = Gi + PiCi 1 for i=0, 1, 2, 3
Eq.2.4
The sum output can be obtained using

Si = Ai^Bi^Ci 1 for i=0, 1, 2, 3
Eq.2.5
4. Carry Increment Adder (CIA)

An 8-bit increment adder includes two RCA (Ripple carry adder) of four bit
each. The first ripple carry adder adds a desired number of first 4-bit inputs generating
a plurality of partitioned sum and partitioned carry. Now the carry out of the first
block RCA is given to CIN of the conditional increment block. Thus the first four bit
sum is directly taken from the ripple carry output. The second RCA block regardless
of the first RCA output will carry out the addition operation and will give out results
which are fed to the conditional increment block. The input CIN to the first RCA
block is given always low value. The conditional increment block consists of half
adders. Based on the value of cout of the 1st RCA block, the increment operation will
take place. Here the half adder in carry increment block performs the increment
operation. Hence the output sum of the second RCA is taken through the carry
increment block. The design schematic of Carry Increment Adder is shown in figure
2.13.
Fig 2.13 Schematic of Carry Increment Adder

5. Carry Skip Adder (CSkA)
A carry-skip adder consists of a simple ripple carry-adder with a special speed
up carry chain called a skip chain. Carry skip adder is a fast adder compared to ripple
carry adder when addition of large number of bits take place; carry skip adder has
O(n) delay provides a good compromise in terms of delay, along with a simple and
regular layout This chain defines the distribution of ripple carry blocks, which
compose the skip adder. A carry-skip adder is designed to speed up a wide adder by
aiding the propagation of a carry bit around a portion of the entire adder. Actually the
ripple carry adder is faster for small values of N. However the industrial demands
these days, which most desktop computers use word lengths of 32 bits like
multimedia processors, makes the carry skip structure more interesting. The crossover
point between the ripple-carry adder and the carry skip adder is dependent on
technology considerations and is normally situated 4 to 8 bits. The carry-skip circuitry
consists of two logic gates. The AND gate accepts the carry-in bit and compares it to
the group propagate signal using the individual propagate values.
p[i ,i + 3] = pi + 3 pi + 2 pi + 1 pi
Eq.2.6
The output from the AND gate is ORed with cout of RCA to produce a stage output
carry = ci + 4 + p[i, i + 3] ci
Eq.2.7
If p [i, i + 3] =0, then the carry-out of the group is determined by the value of
ci + 4. However, if p [i , i + 3] =1 when the carry-in bit is ci =1, then the group carryin is automatically sent to the next group of adders. The design schematic of Carry
Skip Adder is shown in figure 2.14.
Fig 2.14 Schematic of Carry Skip adder

6. Carry Bypass Adder (CByA)
As in a ripple-carry adder, every full adder cell has to wait for the incoming
carry before an outgoing carry can be generated. This dependency can be eliminated
by introducing an additional bypass (skip) to speed up the operation of the adder.
An incoming carry Ci,0=1 propagates through complete adder chain and
causes an outgoing carry C0,7=1 under the conditions that all propagation signals are
1. This information can be used to speed up the operation of the adder, as shown in
the figure 2.15.When BP = P0P1P3P4P5P6P7P8 = 1, the incoming carry is forwarded
immediately to the next block through the bypass and if it is not the case, the carry is
obtained via the normal route. If (P0P1P3P4P5P6P7 = 1) then C0,7 = Ci,0 else either
Delete or Generate occurred. Hence, in a CBA the full adders are divided into groups,
each of them is bypassed by a multiplexer if its full adders are all in propagating.
Fig 2.15 Schematic of Carry Bypass Adder

7. Carry Select Adder (CSelA)
A carry-select adder is divided into sectors, each of which except for the
least-significant performs two additions in parallel, one assuming a carry-in of zero,
the other a carry-in of one.
Fig 2.16 Carry Select adder
A four bit carry select adder generally consists of two ripple carry adders and a
multiplexer. The carry-select adder is simple but rather fast, having a gate level depth
of. Adding two n-bit numbers with a carry select adder is done with two adders (two
ripple carry adders) in order to perform the calculation twice, one time with the
assumption of the carry being zero and the other assuming one. After the two results
are calculated, the correct sum, as well as the correct carry, is then selected with the
multiplexer once the correct carry is known. A carry-select adder speeds 40% to
90%faster than RCA by performing additions in parallel and reducing the maximum
carry path.
Of all the adders shown above, there are both advantages as well as
disadvantages in each of those multipliers. As delay is the major factor in any
application, to reduce the delay, carry lookahead adder is used in the proposed design.
2
0
Chapter 3
PROBLEM SPECIFICATION
3.1
VEDIC MULTIPLIER ALGORITHM

Multiplication methods are extensively discussed in Vedic mathematics.
Various tricks and short cuts are suggested by Vedic multiplier to optimize the
process. These methods are based on concept of
1. Multiplication using deficits and excess
2. Changing the base to simplify the operation.
Various methods of multiplication are proposed in Vedic multiplier. Among
sixteen sutras mainly three sutras are used for multiplication in Vedic mathematics.
1. Urdhva Tiryagbhyam: vertically and crosswise
2. Nikhilam Navatashcharamam Dashatah: All from nine and last from ten
3. Anurupyena: Proportionately Vinculum
URDHVA TIRYAGBHYAM
Urdhva Tiryagbhyam is the general formula applicable to all cases of
multiplication and also in the division of a large number by another large number.
This method can be applied to decimal numbers as well as binary numbers [10-11].
Here, the partial products and their sums are calculated in parallel. So, the multiplier
is independent of the clock frequency of the processor. Due to its regular structure, it
can be easily layout in microprocessors and designers can easily circumvent these
problems to avoid catastrophic device failures.
The processing power of multiplier can easily be increased by increasing the
input and output data bus widths since it has a quite a regular structure. Due to its
regular structure, it can be easily layout in a silicon chip. The Multiplier based on this
sutra has the advantage that as the number of bits increases, gate delay and area
increases very slowly as compared to other conventional multipliers.
Urdhva Tiryagbhyam sutra which is used in the proposed multiplier is
illustrated with the help of a simple example. Although this sutra can be used for both
21
binary and decimal numbers, the example shown is used for binary numbers as binary
multiplication is required in any processor.
Multiplication of two 2 digit binary numbers:
Example 1: Find the product 3(11) X 3(11)
Step 1: The right hand most digit of the multiplicand, the first number (3) i.e., 1 is
multiplied by the right hand most digit of the multiplier, the second number (3) i.e., 1.
The product 1 X 1 = 1 forms the right hand most part of the answer.
Step 2: Now, diagonally multiply the first digit of the multiplicand (3) i.e., 1 and
second digit of the multiplier (3) i.e., 1 (answer 1 X 1=1); then multiply the second
digit of the multiplicand i.e., 1 and first digit of the multiplier i.e., 1 (answer 1 X 1 =
1); add these two i.e.,1 + 1 = 10. It gives the next, i.e., second digit of the answer.
Hence second digit of the answer is 0 and carry for the next digit is1.
Step 3: Now, multiply the second digit of the multiplicand i.e., 1 and second digit of
the multiplier i.e., 1 vertically, i.e., 1 X 1 = 1. Then add this 1 to the previous carry
which is 1 (1+1=10). It gives the left hand most part of the answer.
Thus the product obtained is 1001.
Symbolically we can represent the Vedic multiplication process for two bit numbers
as follows:
Fig 3.1 Example for multiplication of 2digit binary numbers
3.2
SYSTEM DESIGN, BLOCK DIAGRAMS

The hardware architecture of 2x2, 4x4 and 8x8 bit Vedic multiplier module
are displayed in the below sections. Here, Urdhva Tiryagbhyam (Vertically and
Crosswise) sutra is used to propose such architecture for the multiplication of two
binary numbers. The beauty of Vedic multiplier is that here partial product generation
and additions are done concurrently. Hence, it is well adapted to parallel processing.
The feature makes it more attractive for binary multiplications. This in turn reduces
delay, which is the primary motivation behind this work.
Vedic Multiplier for 2x2 bit Module
The method is explained below for two, 2 bit binary numbers A and B where
A = a1a0 and B = b1b0 as shown in previous example. Firstly, the least significant
bits are multiplied which gives the least significant bit of the final product (vertical).
Then, the LSB of the multiplicand is multiplied with the next higher bit of the
multiplier and added with, the product of LSB of multiplier and next higher bit of the
multiplicand (crosswise). The sum gives second bit of the final product and the carry
is added with the partial product obtained by multiplying the most significant bits to
give the sum and carry. The sum is the third corresponding bit and carry becomes the
fourth bit of the final product.
The 2x2 Vedic multiplier module is implemented using four input AND gates
& two half-adders which is displayed in its block diagram in figure 3.2. It is found
that the hardware architecture of 2x2 bit Vedic multiplier is same as the hardware
architecture of 2x2 bit conventional Array Multiplier. Hence it is concluded that
multiplication of 2 bit binary numbers by Vedic method does not made significant
effect in improvement of the multipliers efficiency. Very precisely we can state that
the total delay is only 2-half adder delays, after final bit products are generated, which
is very similar to Array multiplier. So we switch over to the implementation of 4x4 bit
Vedic multiplier which uses the 2x2 bit multiplier as a basic building block. The same
method can be extended for input bits 4 & 8. But for higher number of bits in input,
little modification is required such as when in a 2x2 multiplier only half adders are
enough to implement the design where as for higher order multipliers higher order
adders are required such as carry look ahead adder, ripple carry adder etc.
Block diagram of 2x2 multiplier:
Fig 3.2 2x2 Vedic multiplier

The 4x4 bit Vedic multiplier module is implemented using four 2x2 bit Vedic
multiplier modules as discussed in figure 3.2. Lets analyze 4x4 multiplications, say
A= A3A2A1A0 and B= B3B2B1B0. The output line for the multiplication result is
S7S6S5S4S3S2S1S0. Lets divide A and B into two parts, say A3A2 & A1A0 for A
and B3B2 & B1B0 for B. Using the fundamental of Vedic multiplication, taking two
bit at a time and using 2 bit multiplier block.
We can have the following structure for multiplication as shown in Fig 3.3
Fig 3.3 Sample representation for 4x4 bit Vedic multiplier
Each block as shown above is 2x2 bit Vedic multiplier. First 2x2 bit multiplier
inputs are A1A0 and B1B0. The last block is 2x2 bit multiplier with inputs A3A2 and
B3B2. The middle one shows two 2x2 bit multiplier with inputs A3A2 & B1B0 and
A1A0 & B3B2. So the final result of multiplication, which is of 8 bit,
S7S6S5S4S3S2S1S0. To understand the concept, block diagram of 4x4 bit Vedic
multiplier is shown in figure 3.4. To get final product (S7S6S5S4S3S2S1S0), four
2x2 bit Vedic multiplier and two 4-bit Carry Lookahead (CLA) Adders, 2 bit or gate
and two half adders are required. The proposed Vedic multiplier can be used to reduce
delay. Early literature speaks about Vedic multipliers based on array multiplier
structures. On the other hand, we proposed a new architecture, which is efficient in
terms of speed.
Block diagram of 4x4 multiplier:
Fig 3.4 4x4 multiplier

The 8x8 Vedic multiplier module as shown in the block diagram in figure 3.5
can be easily implemented by using four 4x4 bit Vedic multiplier modules
as
discussed in the previous section. Lets analyze 8x8 multiplications, say A= A7A6A5
A4A3A2A1A0 and B= B7B6B5B4B3B2B1B0. The output line for the multiplication
result will be of 16 bits as S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0. Let s
divide A and B into two parts, say the 8 bit multiplicand A can be decomposed into
pair of 4 bits AH-AL. Similarly multiplicand B can be decomposed into BH-BL. The
16 bit product can be written as: S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0.
Using the fundamental of Vedic multiplication, taking four bits at a time and
using 4 bit multiplier block as discussed we can perform the multiplication. The
outputs of 4x4 bit multipliers are added accordingly to obtain the final product. Here
total two 8 bit Carry lookahead Adders are required as shown in figure 3.5.
Fig 3.5 8X8 Multiplier
Chapter 4
SOFTWARE IMPLEMENTATION
4.1 EDA TOOLS
Digital circuit design has evolved rapidly over the last 25 years. The earliest
digital circuits were designed with vacuum tubes and transistors. Integrated circuits
were then invented where logic gates were placed on a single chip. The first
integrated circuit chips were SSI (small scale integration) chips where the gate count
was very small. As technologies became sophisticated, designers were able to place
circuits with hundreds of gates on a chip. These chips are called MSI (medium scale
integration) chips. With the advent of LSI (large scale integration), designers could
put thousands of gates on a single chip. At this point, design processes started getting
very complicated, and designers felt the need to automate these processes. Electronic
design automation (EDA) techniques began to evolve. Chip designers began to use
circuit and logic simulation techniques to verify the functionality of building blocks of
the order of about 100 transistors. The circuits were still tested on the breadboard, and
the layout was done on paper or by hand on a graphic computer terminal.
With the advent of VLSI (very large scale integration) technology, designers
could design single chips with more than 100,000 transistors. Because of the
complexity of these circuits, it was not possible to verify these circuits on a
breadboard. Computer aided techniques became critical for verification and design of
VLSI digital circuits. Computer programs to do automatic placement and routing of
circuit layouts also became popular. The designers were now building gate level
digital circuits manually on graphic terminals. They would build small block and then
derive higher level blocks from them. Design flow consists of several steps and there
is a need for a toolset in each step of the process. Modern FPGA/ASIC projects
require a complete set of CAD (Computer Aided Design) design tools. Following are
the most common tools available.
Design Capture Tools
Design entry tool encapsulates a circuit description. These tools capture a
design and prepare it for simulation. Design requirements dictate type of the design
capture tool as well as the options needed.
Some of the options would be:

1. Manual netlist entry
2. Schematic capture
3. Hardware Description Language (HDL) capture (VHDL, Verilog, )
4. State diagram entry
Simulation and Verification Tools
Functional verification tool confirms that the functionality of a model of a
circuit conforms to the intended or specified behavior, by simulation or by formal
verification methods. These tools are must have tools. There are two major tool sets
for simulation: Functional (Logic) simulation tools and Timing simulation tools.
Functional simulators verify the logical behavior of a design based on design entry.
The design primitives used in this stage must be characterized completely. Timing
simulators on the other hand perform timing verifications at multiple stages of the
design. In this simulation the real behavior of the system is verified when
encountering the circuit delays and circuit elements in actual device. In general, the
simulation information reflects the actual length of the device interconnects. This
information is back annotated to the corresponding design entry for final logic
simulation. That is why this process of simulation sometimes is called back
annotation simulation.
Layout Tools
ASIC designers usually use these tools. Designers transform a logic
representation of an ASIC into a physical representation that allows the ASIC to be
manufactured. The transistor layout tools take a cell level ASIC representation and for
a given technology create a set of layers representing transistors for each cell.
Physical design tool works in conjunction with floor planning tools that show where
various cells should go on the ASIC die.
Synthesis and Optimization Tools
Synthesis tools translate abstract descriptions of functionality such as HDL

into optimal physical realizations, creating netlists that can be passed to a place and
route tool. Then, the designer maps the gate level description or netlist to the target
design library and optimizes for speed, area or power consumption. The objective is
to provide a tool set for FPGA / ASIC design such that the number of vendors to
design and build the ASIC /FPGA is minimized. Each vendor brings a new set of
learning curves, maintenance and interfaces that eventually end up spending more
time and money. Tool maturity is another important factor. New tools almost always
come with certain number of bugs and issues.
Design Hierarchy
Hierarchical systems in general refer to the systems that are organized in the
shape of pyramid with multiple rows of system objects. Each object in a row may be
linked to the objects beneath it. Hierarchical systems in computer designs are as
popular as they are in everyday life. A good example of a hierarchical system in
everyday life could be a monarchy system that has the King on top and then Prime
Minister in the next level down. People are the base of this pyramid. An obvious
example in computer world would be a file system that has the root directory on top in
which directories have files and subdirectories underneath. Generally speaking
hierarchical systems have strong connectivity inside rather than outside the modules.
Design hierarchy trees do not have a regular pattern in size and number of
nodes and it is really up to designer to decide how the tree should look like. Figure 11 outlines partitioning process in a top-down design methodology. This figure shows
that the partitioning procedure is recursively called until the design of all subcomponents is feasible by the hardware mapping procedure.
Design Methodology
Digital circuits are becoming more complex while time-to-market windows
are shrinking. Designers cannot keep up with the advancements in engineering unless
they adopt a more methodological process for design and verification of digital
circuits i.e. Design methodology. This involves more than simply coming up with
block diagram of a design. Rather, it requires developing and following a formal
verification plan and an incremental approach for transforming the design from an
abstract block diagram to a detailed transistor level implementation. High-level ASIC
or FPGA designs start off with capturing the design idea with a hardware description
language (HDL) at the behavioral or register-transfer (RTL) level. The design will be
then verified with an HDL simulator and synthesized to the gates. Gate level design
method such as schematic capture was a typical design approach up to few years ago
but when the average gate count passed the 10, 000 gate threshold, they started to
break down. On the other hand the pressure to reduce the design cycle increased and
as a result, high-level design became an imperative part of the digital design
engineers. Industry experts agree nowadays that most of the FPGA/ASIC designers
will turn to high-level design methodologies in the near future. This happens
primarily because of the technology improvements that have taken place in the EDA
(Electronic Design Automation) tools, hardware and software. HDL allows the
designer to organize and integrate complex functions and verify the individual blocks
and eventually the entire design with tools like HDL simulators. Designers making
the switch to a high level design methodology leverage some obvious benefits. First,
individual designers are able to handle increased complexity by working at higher
levels of abstraction and delaying the design details. Second, designers can shorten
cycles and improve quality by verifying functionality earlier in the design cycle, when
design changes are easier and less expensive to make. Some designers may think that
FPGAs require significantly less functionality and features compared to ASICs. The
truth is that the improvements in FPGA technology including hardware and software
tools have enabled the FPGA manufacturers to come up with a high level of
integration in FPGAs. Today, several features are being implemented in FPGAs such
as processors, transceivers and even debugging tools are available inside the FPGA.
This makes the design cycle faster and simpler than what it was before. FPGA
designers are using as many features and as much functionality as is made available to
them. Economic requirements on the other hand play a role in the design methodology
arena.
Functionality and flexibility are two key feature of the design tool sets. FPGA
design tools are more cost effective compared to the ASIC design tools, however they
lack the functionalities and features available in the ASIC design tools. An ASIC
design tool seat software may cost over $100,000 while FPGA design tool seat barely
runs over $10,000. There is always a compromise between different factors in order to
decide which way to go. These could be time to market, NRE (Non Recurring
Engineering) cost, ease of use, programmability, flexibility, etc.
3
0
A number of FPGA designs are more complex and have higher density
compared to some ASIC designs that are being implemented. Furthermore, high end
FPGA design methodology nowadays mirrors an ASIC design methodology. To meet
this emerging requirement, tool vendors must provide solutions that will enable FPGA
designers to deal with complex designs and at the same time be able to integrate
variety of functions and features at high level of abstraction. HDL based designs
represent a new paradigm for some designers. Design flow is straightforward, but like
anything else, it requires going through a learning curve.
Regardless of which tools you use, the bottom line is that advanced HDL
design tools are a must for FPGA and ASIC designers. Price is not the only factor in
making the decision for the appropriate tools. Rather, designers must consider the
features required to meet their design goals and find out which tools are leaders in
their class.
Typical HDL Design Flow
Figure 4.1 shows a typical HDL design flow. This figure does not take into
account whether an ASIC or FPGA is being designed. In fact it is a high level view to
HDL designs. In later chapters the detailed view for each technology will be covered.
As it can be seen, once the design is created, it must be verified prior to the RTL to
make sure that its functionality is as intended. A test bench should be created to verify
the functionality of the design. This test bench will be used throughout the design
flow to verify the functionality at the functional, RTL and timing levels. A test bench
is a separate VHDL or Verilog piece of code that is connected to the design's inputs
and outputs. In fact, design itself is considered as a black box with a set of inputs and
outputs. The input test stimulus is applied to the design inputs and the outputs are
observed to ensure the correct functionality. Since the test vectors stay constant
throughout synthesis and place and route, they can be used at each stage to verify
functionality.
The main purpose of the test bench is to provide the stimulus and response
information, such as clocks, reset, and input data, that the design will encounter when
implemented in an ASIC or FPGA and installed into the final system. Design
Specification Coding Synthesis Optimization Implementation
31
Design Flow
Fig 4.1 EDA Design Flow

The complexity of digital systems has been increasing rapidly over the past
years. Fortunately fully automated EDA tools support design synthesis. But there is
still a great need for efficient techniques to accelerate the design process and this
essentially means choosing the right methodology. The first step in designing a
product is to choose the design methodology. For the vast majority of applications it
is hard to prefer one implementation to the other: bottom-up or top-down. There is
always a certain amount of manipulation involved and evaluating between two correct
implementations is usually based on the designers preferences. One may prefer one
method for developing a certain type of application while another person prefers the
other method. The key idea of both methodologies is the hierarchical propagation of
the design units based on behavioral modeling and optimization at each level.
A Bottom-up design methodology starts with individual blocks, which are
then combined to form the system. Design of each block starts with a set of
specifications and ends with a transistor level implementation, but each block is
verified individually and eventually all the blocks will be combined and verified
together.
On the other hand a top-down design approach defines the architecture of the
whole design as a single unit. The whole design is simulated and optimized
afterwards. The requirements for lower level blocks are derived based on the results
obtained in the previous steps. Each level is completely designed before proceeding to
the next step.
Circuits can be designed individually to meet the specifications and finally, the
entire design is laid out and verified against the original requirements. Top-down
design is referred to as partitioning of a system into its sub-components until all subcomponents become manageable design parts. Design of a component is available as
part of a library, it can be implemented by modifying an already available part or it
can be described for a synthesis program or an automatic hardware generator.
Partitioning the design into smaller modules provides the facility for the designers to
work as a team and be more productive. It also reduces the total time required to
complete the design because it reduces the effect of the late changes in the design
process. Any change in a module needs updating the rest of the system. Generally
speaking, bottom-up design methodology is effective for small designs. As size and
complexity of the digital designs continues to increase, they cause some implications
in this approach. It is very likely to have a number of blocks in a design and once they
are combined, simulation takes a long time and verification becomes difficult. Also
performance, cost and functionality are typically found at the architectural level and
this makes the design modification difficult since any change at the higher levels of
abstraction should propagate to the lower level modules and it needs redesigning of
the lower level blocks. The other challenge is that in bottom-up design methodology
several steps have to be done sequentially which elongates the design process
especially when it comes to modification of the design. To address all these
challenges, many designers prefer a substitute method namely top-down design
methodology.
The idea is to break down the design into smaller pieces so that each piece can
be designed at a time. Of course, all the pieces have to be put together again and this
should provide the solution to original design. Assembly of the smaller blocks is
referred to as bottom-up design. This approach has been applied to complex
engineering projects and is now finding its way into digital designs. Top-level design
hierarchy specifies the partitioning of the system into manageable blocks as well as
each block interface. One of the strengths of this method is that once the top-level
schematic is specified, the design process for all the blocks can be started
concurrently. It is imperative to avoid complex models if not necessary in top-down
design methodology since this makes the design verification process complicated.
Each block can be specified by its behavior and design can be expanded gradually.
Mapping to hardware depends on target technology, available libraries, and available
tools.
Generally, unavailability of good tools and/or libraries can be compensated by
further partitioning of a system into simpler components. After the completion of this
top-down design process, the bottom-up implementation phase begins. In this phase,
hardware components corresponding to the terminals of the tree are recursively wired
to form the hierarchical wiring of the complete system. This figure also shows that the
original design is initially described at the behavioral level. In the first level of
partitioning, one of its sub-components is mapped to hardware. Further partitioning is
required for hardware implementation of the other two components. This procedure
goes on until the hardware implementations of all components are available.
At each and every step of a top-down design process a multilevel simulation
tool plays an important role in the correct implementation of the design. Initially a
behavioral description of the system under design (SUD) must be simulated to verify
the designers understanding of the problem. After the first level of partitioning, a
behavioral description of each of the sub-components must be developed, and these
descriptions must be wired to form a structural hardware model of SUD. Simulation
of this new model and comparing the results with those of the original SUD
description will verify the correctness of the first level of partitioning. After verifying
the first level of partitioning, hardware implementation of each sub-component must
be verified. For this purpose, another simulation run in which behavioral models of
sub-components are replaced by more detailed hardware level models will be
performed.
The process of partitioning and verification stated above continues throughout
the design process. At the end, a simulation model, consisting of the interconnection
specification of hardware-level models of the terminals of the partition tree, will be
performed. The simulation of this model and comparing the results with those of the
original behavioral description of SUD verify the correctness of the complete design.
In a large design where simulation of a complete hardware-level model, is too
time consuming, subsections of the partition tree will be independently verified.
Verified behavioral models of such subsections will be used in forming the simulation
model for final design verification.
4.2 HARDWARE DESCRIPTION LANGUAGES (HDLs)

With only gate-level design tools available, it soon became clear that better,
more structured design methods and tools would be needed. On the other hand, there
was an increase in the demand for electronic components in the continual growth of
personal computers, cellular phones and high-speed data communications. Electronic
vendors are providing devices with increasingly greater functionality, higher
performance, lower cost, smaller packaging and lower power consumption. The trend
for electronic design is continuing to provide complex designs and systems with more
capability using fewer devices that take up less area on Printed Circuit Boards
(PCBs). Problems were encountered with increasingly complex electronic systems
using the existing Electronic Design Automation (EDA) tools coupled with the
accelerated time-to-market schedules. ASICs and high-Density Programmable Logic
devices (HDPLDs) and Very High Speed Integrated Circuit (VHSIC) Hardware
Descriptive Language (VHDL) became the key element for
developing
methodologies in handling complex electronic design. To meet these challenges, a

team of engineers from three companies; IBM, Texas Instruments and Intermetrics
were contracted by the DOD to complete the specification and implementation of a

new, language-based design description method. The first publicly available version
of VHDL, version 7.2, was released in 1985. In 1986, the IEEE (Institute of Electrical
and Electronics Engineers) was presented with a proposal to standardize the language.
The resulting standard, IEEE 1076-1987, is the basis for virtually every simulation
and synthesis product sold today. An enhanced and updated version of the language,
IEEE 1076-1993, was released in 1994, and VHDL tool vendors have been
responding by adding these new language features to their products.
There are currently two main Hardware Descriptive Languages, VHDL and
Verilog. Verilog syntax is not as complicated as VHDL and is less verbose. It lacks
features and capabilities that VHDL can provide. Verilog is easier to grasp and
understand and its constructs are based on 50% C programming and 50% ADA. EDA
environments support them both with documentation, simulation and synthesis
capabilities.
4.3 VERILOG CODE OVERVIEW

Hardware description languages such as Verilog differ from software
programming languages because they include ways of describing the propagation time
and signal strengths (sensitivity). There are two types of assignment operators; a
blocking assignment (=), and a non-blocking (<=) assignment. The non-blocking
assignment allows designers to describe a state-machine update without needing to
declare and use temporary storage variables. Since these concepts are part of Verilog's
language semantics, designers could quickly write descriptions of large circuits in a
relatively compact and concise form. At the time of Verilog's introduction (1984),
Verilog represented a tremendous productivity improvement for circuit designers who
were already using graphical schematic capture software and specially written
software programs to document and simulate electronic circuits.
The designers of Verilog wanted a language with syntax similar to the C
programming language, which was already widely used in engineering software
development. Like C, Verilog is case-sensitive and has a basic preprocessor (though
less sophisticated than that of ANSI C/C++). Its control flow keywords (if/else, for,
while, case, etc.) are equivalent, and its operator precedence is compatible with C.
Syntactic differences include: required bit-widths for variable declarations,

demarcation of procedural blocks (Verilog uses begin/end instead of curly braces {}),
and many other minor differences. Verilog requires that variables be given a definite
size.
In C these sizes are assumed from the 'type' of the variable (for instance an
integer type may be 8 bits). A Verilog design consists of a hierarchy of modules.
Modules encapsulate design hierarchy, and communicate with other modules through
a set of declared input, output, and bidirectional ports. Internally, a module can
contain any combination of the following: net/variable declarations (wire, reg, integer,
etc.), concurrent and sequential statement blocks, and instances of other modules
(sub-hierarchies). Sequential statements are placed inside a begin/end block and
executed in sequential order within the block. However, the blocks themselves are
executed concurrently, making Verilog a dataflow language.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0,
floating, undefined") and signal strengths (strong, weak, etc.). This system allows
abstract modeling of shared signal lines, where multiple sources drive a common net.
When a wire has multiple drivers, the wire's (readable) value is resolved by a function
of the source drivers and their strengths. A subset of statements in the Verilog
language is synthesizable.
Verilog modules that conform to a synthesizable coding style, known as RTL
(register-transfer level), can be physically realized by synthesis software. Synthesis
software algorithmically transforms the (abstract) Verilog source into a netlist, a
logically equivalent description consisting only of elementary logic primitives (AND,
OR, NOT, flip-flops, etc.) that are available in a specific FPGA or VLSI technology.
Further manipulations to the netlist ultimately lead to a circuit fabrication blueprint
(such as a photo mask set for an ASIC or a bit stream file for an FPGA).
4.4 XILINX
Xilinx ISE (Integrated Synythesis Environment) [12] is a software tool
produced by Xilinx for synthesis and analysis of HDL designs, enabling the developer
to synthesize their designs, perform timing analysis, examine RTL diagrams, simulate
a designs reaction to different stimuli, and configure the target device with the
programmer.
The Xilinx ISE is a design environment for FPGA products from Xilinx, and
is tightly-coupled to the architecture of such chips, and cannot be used with FPGA
products from other vendors. The Xilinx ISE is primarily used for circuit synthesis
and design, while the ModelSim logic simulator is used for system-level testing.
Other components shipped with the Xilinx ISE include the Embedded Development
Kit (EDK), a Software Development Kit (SDK) and ChipScope Pro.
User Interface
The primary user interface of the ISE is the Project Navigator, which includes
the design hierarchy (Sources), a source code editor (Workplace), an output console
(Transcript), and a processes tree (Processes). The Design hierarchy consists of design
files (modules), whose dependencies are interpreted by the ISE and displayed as a tree
structure. For single-chip designs there may be one main module, with other modules
included by the main module, similar to the main() subroutine in C++ programs.
Design constraints are specified in modules, which include pin configuration and
mapping.
The Processes hierarchy describes the operations that the ISE will perform on
the currently active module. The hierarchy includes compilation functions, their
dependency functions, and other utilities. The window also denotes issues or errors
that arise with each function. The Transcript window provides status of currently
running operations, and informs engineers on design issues. Such issues may be
filtered to show Warnings, Errors, or both.
Simulation
System-level testing may be performed with the ModelSim logic simulator,
and such test programs must also be written in HDL languages. Test bench programs
may include simulated input signal waveforms, or monitors which observe and verify
the outputs of the device under test. ModelSim may be used to perform the following
types of simulations:
1. Logical verification, to ensure the module produces expected results
2. Behavioral verification, to verify logical and timing issues

3. Post-place & route simulation, to verify behavior after placement of the
module within the reconfigurable logic of the FPGA
4.5 SYNTHESIS
Xilinx's patented algorithms for synthesis allow designs to run up to 30%
faster than competing programs, and allow greater logic density which reduces project
costs. Also, due to the increasing complexity of FPGA fabric, including memory
blocks and I/O blocks, more complex synthesis algorithms were developed that
separate unrelated modules into slices, reducing post-placement errors. IP Cores are
offered by Xilinx and other third-party vendors, to implement system-level functions
such as digital signal processing (DSP), bus interfaces, networking protocols, image
processing, embedded processors, and peripherals. Xilinx has been instrumental in
shifting designs from ASIC-based implementation to FPGA-based implementation.
Chapter 5
HARDWARE IMPLEMENTATION
After simulating the verilog code using Xilinx, it is dumped in to an field
programmable gate array (FPGA) [12]. An FPGA is a device that contains a matrix of
reconfigurable gate array logic circuitry. When a FPGA is configured, the internal
circuitry is connected in a way that creates a hardware implementation of the software
application. Unlike processors, FPGAs use dedicated hardware for processing logic
and do not have an operating system. FPGAs are truly parallel in nature so different
processing operations do not have to compete for the same resources. As a result, the
performance of one part of the application is not affected when additional processing
is added. Also, multiple control loops can run on a single FPGA device at different
rates. FPGA-based control systems can enforce critical interlock logic and can be
designed to prevent I/O forcing by an operator. However, unlike hard-wired printed
circuit board (PCB) designs which have fixed hardware resources, FPGA-based
systems can literally rewire their internal circuitry to allow reconfiguration after the
control system is deployed to the field. FPGA devices deliver the performance and
reliability of dedicated hardware circuitry. A single FPGA can replace thousands of
discrete components by incorporating millions of logic gates in a single integrated
circuit (IC) chip. FPGA are constructed of three basic elements: logic blocks, I/O
cells, and interconnection resources.
5.1 INTERNAL STRUCTURE OF FPGA

In an FPGA logic blocks are implemented using multiple level low fan-in
gates, which gives it a more compact design compared to an implementation with
two-level AND-OR logic. FPGA provides its user a way to configure:
1. The intersection between the logic blocks and
2. The function of each logic block.
Logic block of an FPGA can be configured in such a way that it can provide
functionality as simple as that of transistor or as complex as that of a microprocessor.
It can used to implement different combinations of combinational and sequential logic
functions. Logic blocks of an FPGA can be implemented by any of the following:
4
0
1. Transistor pairs
2. Combinational gates like basic NAND gates or XOR gates
3. n-input Lookup tables
4. Multiplexers
5. Wide fan-in and-OR structure.
Fig 5.1 Internal structure of FPGA

Routing in FPGAs consists of wire segments of varying lengths which can be
interconnected via electrically programmable switches. Density of logic block used in
an FPGA depends on length and number of wire segments used for routing. Number
of segments used for interconnection typically is a tradeoff between density of logic
blocks used and amount of area used up for routing. Simplified version of FPGA
internal architecture with routing is shown
I/O blocks provide for interaction with the outside world. An I/O pin can be
used for input or output. I/O blocks can contain logic functionality, although high
41
logic utilization decreases pin placement flexibility, as I/O blocks utilized in logic
cannot be reassigned mid-design.
5.2 FPGA STRUCTURAL CLASSIFICATION

Basic structure of an FPGA includes logic elements, programmable
interconnects and memory. Arrangement of these blocks is specific to particular
manufacturer. On the basis of internal arrangement of blocks FPGAs can be divided
into three classes:
Symmetrical arrays
This architecture consists of logic elements (called CLBs) arranged in rows
and columns of a matrix and interconnect laid out between them. This symmetrical
matrix is surrounded by I/O blocks which connect it to outside world. Each CLB
consists of n-input Lookup table and a pair of programmable flip-flops. I/O blocks
also control functions such as tri-state control, output transition speed. Interconnects
provide routing path. Direct interconnects between adjacent logic elements have
smaller delay compared to general purpose interconnect.
Row based architecture
Row based architecture consists of alternating rows of logic modules and
programmable interconnect tracks. Input output blocks is located in the periphery of
the rows. One row may be connected to adjacent rows via vertical interconnect. Logic
modules can be implemented in various combinations. Combinatorial modules
contain only combinational elements which Sequential modules contain both
combinational elements along with flip flops. This sequential module can implement
complex combinatorial-sequential functions. Routing tracks are divided into smaller
segments connected by anti-fuse elements between them.
Hierarchical PLDs
This architecture is designed in hierarchical manner with top level containing
only logic blocks and interconnects. Each logic block contains number of logic
modules. And each logic module has combinatorial as well as sequential functional
elements. Each of these functional elements is controlled by the programmed

memory. Communication between logic blocks is achieved by programmable
interconnect arrays. Input output blocks surround this scheme of logic blocks and
interconnects.
5.3 FPGA
CLASSIFICATION ON USER PROGRAMMABLE
SWITCH TECHNOLOGIES
FPGAs are based on an array of logic modules and a supply of uncommitted
wires to route signals. In gate arrays these wires are connected by a mask design
during manufacture. In FPGAs, however, these wires are connected by the user and
therefore must use an electronic device to connect them. Three types of devices have
been commonly used to do this, pass transistors controlled by an SRAM cell, a flash
or EEPROM cell to pass the signal, or a direct connect using antifuses. Each of these
interconnect devices have their own advantages and disadvantages. This has a major
effect on the design, architecture, and performance of the FPGA.
SRAM Based
The major advantage of SRAM based device is that they are infinitely reprogrammable and can be soldered into the system and have their function changed
quickly by merely changing the contents of a PROM. They therefore have simple
development mechanics. They can also be changed in the field by uploading new
application code, a feature attractive to designers. It does however come with a price
as the interconnect element has high impedance and capacitance as well as consuming
much more area than other technologies. Hence wires are very expensive and slow.
The FPGA architect is therefore forced to make large inefficient logic modules
(typically a look up table or LUT).The other disadvantages are: They needs to be
reprogrammed each time when power is applied, needs an external memory to store
program and require large area. There are two applications of SRAM cells: for
controlling the gate nodes of pass-transistor switches and to control the select lines of
multiplexers that drive logic block inputs.
Antifuse Based
The antifuse based cell is the highest density interconnects by being a true
cross point. Thus the designer has a much larger number of interconnects so logic
modules can be smaller and more efficient. Place and route software also has a much
easier time. These devices however are only one-time programmable and therefore
have to be thrown out every time a change is made in the design.
The Antifuse has an inherently low capacitance and resistance such that the
fastest parts are all Antifuse based. The disadvantage of it is the requirement to
integrate the fabrication of it into the IC process, which means the process will always
lag the SRAM process in scaling. These are suitable for FPGAs because they can be
built using modified CMOS technology. The antifuse is positioned between two
interconnect wires and physically consists of three sandwiched layers: the top and
bottom layers are conductors, and the middle layer is an insulator. When
unprogrammed, the insulator isolates the top and bottom layers, but when
programmed the insulator changes to become a low-resistance link. It uses Poly-Si
and n+ diffusion as conductors and ONO as an insulator, but other antifuses rely on
metal for conductors, with amorphous silicon as the middle layer.
EEPROM Based
The EEPROM/FLASH cell in FPGAs can be used in two ways, as a control
device as in an SRAM cell or as a directly programmable switch. When used as a
switch they can be very efficient as interconnect and can be reprogrammable at the
same time. They are also non-volatile so they do not require an extra PROM for
loading. They, however, do have their detractions. The EEPROM process is
complicated and therefore also lags SRAM technology.
5.4 LOGIC BLOCK AND ROUTING TECHNIQUES

Crosspoint FPGA
It consists of two types of logic blocks. One is transistor pair tiles in which
transistor pairs run in parallel lines as shown in figure 5.2.
Fig 5.2 Transistor pair tiles in cross-point FPGA
Second type of logic blocks is RAM logic which can be used to implement
random access memory.
Plessey FPGA
Basic building block here is 2-input NAND gate which is connected to each
other to implement desired function.
Fig 5.3 Plessey Logic Block

Both Crosspoint and Plessey are fine grain logic blocks. Fine grain logic
blocks have an advantage in high percentage usage of logic blocks but they require
large number of wire segments and programmable switches which occupy lot of area.
Actel Logic Block
If inputs of a multiplexer are connected to a constant or to a signal, it can be
used to implement different logic functions. For example a 2-input multiplexer with
inputs a and b, select, will implement function ac + bc. If b=0 then it will implement
ac, and if a=0 it will implement bc. Typically an Actel logic block consists of
multiple numbers of multiplexers and logic gates.
Fig 5.4 Actel Logic Block
Xilinx Logic block

In Xilinx logic block Look up table is used to implement any number of
different functionality. The input lines go into the input and enable of lookup table.
The output of the lookup table gives the result of the logic function that it implements.
Lookup table is implemented using SRAM. A k-input logic function is implemented
using 2^k * 1 size SRAM. Number of different possible functions for k input LUT is
2^2^k. Advantage of such an architecture is that it supports implementation of so
many logic functions, however the disadvantage is unusually large number of memory
cells required to implement such a logic block in case number of inputs is large. A kinput LUT based logic block can be implemented in number of different ways with
tradeoff between performance and logic density. An n-lut can be shown as a direct
implementation of a function truth-table. Each of the latch holds the value of the
function corresponding to one input combination. The advantage of large fan in AND
gate based implementation is that few logic blocks can implement the entire
functionality thereby reducing the amount of area required by interconnects. On the
other hand disadvantage is the low density usage of logic blocks in a design that
requires fewer input logic. Another disadvantage is the use of pull up devices (AND
gates) that consume static power. To improve power manufacturers provide low
power consuming logic blocks at the expense of delay. Such logic blocks have gates
with high threshold as a result they consume less power. Such logic blocks can be
used in non-critical paths. Altera, Xilinx are coarse grain architecture.
CHAPTER 6
RESULTS
6.1 BLOCKWISE SIMULATION RESULTS
Each block of the block diagram is individually analyzed, executed in XILINX
and following are their respective block wise simulation results obtained from
XILINX. After simulating the Verilog code using XILINX ISE 10.1 and
implementing the design on FPGA, obtained simulation results of 2x2 multiplier, 4x4
multiplier and 8x8 multiplier are shown below.
1. 2x2 VEDIC MULTIPLIER
2x2 Vedic multiplier is designed, analyzed and simulated using Xilinx ISE
10.1. The simulation results are shown in figure 6.1.
Fig 6.1 Simulation results of 2x2 Vedic multiplier

DESCRIPTION:
a:
2 bit input data (multiplicand)
b:
2 bit input data (multiplier)
s:
4 bit output data
Example 1: a = 01, b = 00, s = 0000

Example2: a = 01, b = 10, s = 0010
Example3:
a = 01, b = 11, s = 0011
Various factors on which the speed of multiplier depends upon are obtained
from the synthesis report during simulation and are shown in table 6.1
Table 6.1 Synthesis Results of 2 bit multiplier
FPGA Device Package
3s50pq208-5
Number of slices
2 out of
Number of 4 input LUTs
4 out of 1536
Number of IOs
Number of bonded IOBs
8 out of
Maximum combinational path delay
7.858ns
Memory usage
141072 kilobytes
768
124


DESCRIPTION:
a:
b:
s:
8 bit output data
Example 1:
a = 0101, b = 0001, s = 0000 0101
Example2:
a = 0001, b = 0001, s = 0000 0001
FPGA Device Package
3s50pq208-5
Number of slices
19 out of
33 out of 1536
Number of IOs
16
16 out of
18.089ns
Memory usage
141072 kilobytes
768
124


DESCRIPTION:
a:
b:
s:
16 bit output data
Example 1:
a = 0001 0110, b = 0000 0001, s = 0000 0000 0001 0110
Example2:
a = 0001 0110, b = 0000 0011, s = 0000 0000 0100 0010
Example3:
a = 0100 0010, b = 0000 0010, s = 0000 0000 1000 0100
FPGA Device Package
3s50pq208-5
Number of slices
94 out of
165 out of 1536
Number of IOs
32
32 out of
28.451ns
Memory usage
145168 kilobytes
768
124
6.2 OUTPUTS OF 8 BIT MULTIPLIER DISPLAY ON FPGA
5
0
Fig 6.4 Display on Spartan 3E FPGA
5
0
As shown in figure 6.4, when programming is complete, the program

succeeded message is displayed. Figure 6.5 shows output of 8 bit Vedic multiplier on
LEDs of Spartan kit.
Figure 6.5 shows the 8 bit Vedic multiplier output on FPGA for the
multiplication of 00000001 (multiplicand) and 00000111 (multiplier).
Fig 6.5 Output of 8 bit multiplier on FPGA
CHAPTER 7
CONCLUSIO
N
7.1 Conclusio
n
The design of 8bit Vedic multiplier has been implemented on Xilinx FPGA
Spartan 3 board. It is a method for hierarchical multiplier design which clearly
indicates the computational advantages offered by Vedic methods. The computation
delay for 8 bit Vedic multiplier is 28.451 ns. It is therefore seen that the Vedic
multipliers are much faster than the conventional multipliers. An awareness of Vedic
mathematics can be effectively increased if it is included in engineering education.
The Comparison between proposed multiplier and 8 bit Booth radix-4 multiplier is
shown in table 7.1. As from the table this multiplier helps in future to make fast
processors.
51
Table 7.1 comparison of Vedic and booth multipliers

Vedic multiplier
Booth multiplier
28.421ns
29.549ns
145168 kilobytes
151860 kilobytes
Delay
Memory usage
7.2 Future scope

Future work includes the integration of the divider block, multiply and
accumulate (MAC) unit, thereby making it into a Vedic Arithmetic and Logical unit
(ALU). Reducing the time delay is very essential requirement for many applications
and Vedic Multiplication technique is very much suitable for this purpose.
APPENDIX-I
VERILOG SOURCE CODE
HALF ADDER
Data flow description for half adder is given below. The 1 bit inputs to the half adder
are a, b and the outputs sum and carry are given s and c.
// define a 1-bit half adder by using data flow statements
module halfadder(s,c,a,b);
// I/O port declarations where s is sum and c is carry
output s,c;
input a,b;
// specify the function of a half adder
assign s = a^b;
assign c = a&b;
endmodule
4 BIT CARRY LOOKAHEAD ADDER

52
// define a 4-bit carry lookahead adder using data flow statements module
cla(s,cout,a,b,cin);
//I/O port declarations where s is 4 bit sum cout is carry out
output [3:0]s;// an array of 4 bit sum values
output cout;
input [3:0]a,b//an array of 4 bit input values of a,b
input cin;
wire [3:0]g,p,c;
//specify function of carry look ahead adder
53
assign g=a&b;
assign p=a^b;
assign c[0]=cin;
assign c[1]=g[0]|(p[0]&c[0]);
assign c[2]=g[1]|(p[1]&g[0])|(p[1]&p[0]&c[0]);
assign c[3]=g[2]|(p[2]&g[1])|(p[2]&p[1]&g[0])|(p[2]&p[1]&p[0]&c[0]);
assign
cout=g[3]|(p[3]&g[2])|(p[3]&p[2]&g[1])|(p[3]&p[2]&p[1]&g[0])|(p[3]&p[2]&p[1]&p
[0]&c[0]);
assign s=p^c;
endmodule
8 BIT CARRY LOOKAHEAD ADDER

// define a 8-bit carry lookahead adder using data flow statements
module cla8(
//I/O port declarations where s is 8 bit sum cout is carry out
output [7:0]s,
output cout,
input [7:0]a,b,
input cin
);
wire c1;//declare net c1 for the circuit
wire [3:0]t1,t2;//declare net t1,t2 for the circuit
//instantiate cla,call it m1 and m2 respectively
cla m1(t1[3:0],c1,a[3:0],b[3:0],cin);
cla m2(t2[3:0],cout,a[7:4],b[7:4],c1);
assign s[3:0]=t1[3:0];
assign s[7:4]=t2[3:0];
endmodule
2x2 VEDIC MULTIPLIER

// Module 2x2 multiplier. Port list is taken exactly from the I/O diagram
module vm2(a,b,s);
// Port declarations from the I/O diagram
input [1:0]a;
input [1:0]b;
output [3:0]s;
// Internal wire declarations
wire [3:0]c;
wire [3:0]temp;
//stage 1
// four multiplication operation of bits according to Vedic logic done using and gates
assign c[0] = a[0]&b[0];
assign temp[0] = a[1]&b[0];
//stage 2
// instantiate two half adders, call it z1,z2 respectively
halfadder z1(c[1],temp[3],temp[0],temp[1]);
halfadder z2(c[3],c[2],temp[2],temp[3]);
endmodule

module vm4(a,b,s);
input [3:0]a;
input [3:0]b;
output [7:0]s;
wire [3:0]q0,q1,q2,q3,t,t1;
wire c,c1,c2,x;
wire cout;
// Stage 1
// Instantiate 4 2x2 multipliers, call it z1, z2, z3, z4 respectively
vm2 z1(a[1:0],b[1:0],q0[3:0]);
vm2 z2(a[3:2],b[1:0],q1[3:0]);
vm2 z3(a[1:0],b[3:2],q2[3:0]);
vm2 z4(a[3:2],b[3:2],q3[3:0]);
// Stage 2
// Instantiate 2 carry lookahead adders, call it z5, z6 respectively
cla4 z5(t[3:0],c,q1[3:0],q2[3:0],0);
cla4 z6(t1[3:0],c1,t[3:0],{q3[1:0],q0[3:2]},0);
// Assign outputs for s0, s1, s2, s3, s4, s5
assign s[1:0]=q0[1:0];
assign s[5:2]=t1[3:0];
assign x=c|c1;
// Instantiate 2 half adders, call it z7, z8 respectively

ha z7(s[6],c2,x,q3[2]);
ha z8(s[7],cout,c2,q3[3]);
endmodule

module vm8(a,b,s );
input [7:0]a;
input [7:0]b;
output [15:0]s;
wire c,c1,c2,c3,c4,x,cout;
wire [4:1]y;
wire [7:0]q0,q1,q2,q3,t1,t2;
//stage 1
// Instantiate 4 4x4 multipliers, call it z1, z2, z3, z4 respectively
vm4 z1(a[3:0],b[3:0],q0[7:0]);
vm4 z2(a[7:4],b[3:0],q1[7:0]);
vm4 z3(a[3:0],b[7:4],q2[7:0]);
vm4 z4(a[7:4],b[7:4],q3[7:0]);
// Stage 2
// Instantiate 2 carry lookahead adders, call it r1, r2 respectively
cla8 r1(t1[7:0],c,q1[7:0],q2[7:0],1'b0);
cla8 r2(t2[7:0],c1,t1[7:0],{q3[3:0],q0[7:4]},1'b0);
assign x=c|c1;
//assign outputs for s0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s11

assign s[3:0]=q0[3:0];
assign s[11:4]=t2[7:0];
//instantiate 4 half adders,call it m5,m6,m7,m8
ha m5(s[12],c2,x,q3[4]);
ha m6(s[13],c3,c2,q3[5]);
ha m7(s[14],c4,c3,q3[6]);
ha m8(s[15],cout,c4,q3[7]);
endmodule
APPENDIX-II
VERILOG
FUNCTIONS
OPERATOR
S
Operator type
performed
Arithematic
Symbol
operation
addition
equal
subtract
divide
modulus
Logical
Negation
&&
logical and
||
logical or
>
greater than
<
less than
>=
greater than or equal
<=
less than or equal
==
equality
!=
inequality
===
case equality
!= =
case inequality
bitwise negation
&
bitwise and
bitwise or
bitwise xor
~^
bitwise xnor
>>
right shift
<<
left shift
>>>
arithmetic right shift
<<<
arithmetic left shift
Concatenation
{}
concatenation
Replication
{{}}
replication
Conditional
?:
conditional
Relational
Equality
Bitwise
Shift
NUMBER SPECIFICATION
1.
Sized numbers: these are represented as <size><base format>

<number> Ex: 4b1111
The above example implies that it is a 4 bit binary number.
2.
Unsized numbers: these numbers are specified without a<base

format> specification are decimal numbers by default.
Ex:23456
The above example implies that it is a 32 bit decimal number by default.
Keywords:
always
starts an always begin ... end sequential code block
and
gate primitive, and
assign
parallel continuous assignment
automatic
a function attribute, basically reentrant and recursive
begin
starts a block that ends with end (no semicolon)
buf
gate primitive, buffer
bufif0
gate primitive, buffer if control==0
bufif1
gate primitive, buffer if control==1
case
starts a case statement
casex
starts a case statement where x matches
casez
starts a case statement where z matches
cell
library, cell identifier, in configuration
cmos
switch primitive, cmos
config
starts a configuration
default
optional last clause in a case statement
defparam
used to over-ride parameter values
design
top level module, in configuration
disable
a task or block
edge
edge control specifier
else
execute if no previous clause was true
end
end of a block, paired with a begin
endcase
end of a case statement
endconfig
end of a configuration
endfunction
end of a function definition
endgenerate
end of a generate
endmodule
end of a module definition
endprimitive
end of a primitive definition
endspecify
end of a specify
endtable
end of a table definition
endtask
end of a task definition
for
starts a for statement
force
starts net or variable assignment
forever
starts a loop statement
fork
begin parallel execution of sequential code
function
starts a function definition
generate
starts a generate block
genvar
defines a generate variable
if
starts an if statement, if(condition) ...

61
ifnone
state dependent path declaration
incdir
file path for library
include
include file specification
initial
starts an initial begin ... end sequential block
62
inout
declares a port name to be both input and output
input
declares a port name to be input
instance
specify instance name, in configuration
integer
variable data type, 32 bit integer
join
end of a parallel fork
large
charge strength, 4, of trireg
liblist
library search order for modules, in configuration
library
location of modules, libraries and files
localparam
starts a local parameter statement, not over-ridden
macromodule
same as module with possibly extra meanings
medium
charge strength, 2, of trireg
module
begin a module definition
nand
gate primitive, nand
negedge
event expression, negative edge
nmos
switch primitive, nmos
nor
gate primitive, nor
not
gate primitive, not
notif0
gate primitive, not if control==0
notif1
gate primitive, not if control==1
or
gate primitive, or
output
declares a port name to be an output
parameter
starts a parameter statement
posedge
event expression, positive edge
primitive
starts the definition of a primitive module
pulldown
gate primitive
pullup
gate primitive
remos
switch primitive, remos
real
variable data type,
realtime
variable data type, floating point time
reg
variable data type, starts a declaration of name(s)
release
release a forced net or variable assignment
repeat
starts a loop statement
rnmos
switch primitive, rnmos
rpmos
switch primitive, rpmos
rtran
bidirectional switch primitive, rtran
rtranif0
bidirectional switch primitive, rtranif0
rtranif1
bidirectional switch primitive, rtranif1
scalared
property of a vector type
signed
type modifier, reg signed
specify
starts a specify block
specparam
starts a parameter statement for timing delays
table
start a table definition in a primitive
task
starts a task definition
time
variable data type, 64 bit integer
tran
bidirectional switch primitive, tran
tranif0
bidirectional switch primitive, tranif0
tranif1
bidirectional switch primitive, tranif1
tri
net data type
tri0
net data type, connected to VSS
tri1
net data type, connected to VDD
triand
net data type, tri state wired and
trior
net data type, tri state wired or
trireg
register data type associates capacitance to the net
unsigned
type modifier, unsigned
use
library, cell identifier, in configuration
vectored
property of a vector type
wait
starts a wait statement
wand
net data type, wired and
while
starts a sequential looping statement
wire
net data type, a basic wire connection
wor
net data type, wired or
xnor
gate primitive, xnor not of exclusive or
xor
gate primitive, xor exclusive or
weak0
drive strength 3
weak1
drive strength 3
STRUCTURE OF A VERILOG PROGRAM

A Verilog program is structured as a set of modules, which may represent anything
from a collection of logic gates to a complete system. Modules are similar to classes
in C++, although not nearly as powerful. A module specifies its input and output
ports, which describe the incoming and outgoing connections of a module.
A module may also declare additional variables. The body of a module consists of:
1. initial constructs, which can initialize reg variables
2. Continuous assignments, which define only combinational logic
3. always constructs, which can define either sequential or combinational logic
4. Instances of other modules, which are used to implement the module
being defined
REFERENCES
[1]
Wallace, C.S., A suggestion for a fast multiplier, IEEE Trans. Elec.

Comput., vol. EC-13, no. 1, pp. 1417, Feb. 1964.
[2]
Booth, A.D., A signed binary multiplication technique, Quarterly Journal of

Mechanics and Applied Mathematics, vol. 4, pt. 2, pp. 236 240, 1951.
[3]
Jagadguru Swami, Sri Bharati Krisna, Tirthaji Maharaja, Vedic Mathematics

or Sixteen Simple Mathematical Formulae from the Veda, Delhi (1965),
Motilal Banarsidas, Varanasi, India, 1986.
[4]
Neil H.E Weste, David Harris, Ayan anerjee, CMOS VLSI Design, A
Circuits and Systems Perspective, Third Edition, Published by Person
[5]
Education, PP-327-328] [5] A.P. Nicholas, K.R Williams, J. Pickles,

Application of Urdhava Sutra, Spiritual Study Group, Roorkee (India),1984.
2008.
[6]
Jeganathan Sriskandarajah, Secrets of Ancient Maths: Vedic Mathematics,

Journal of Indic Studies Foundation, California, pages 15 and 16.
[7]
Rabey, Nikolic, and Chandrasekhran, Digital Integrated Circuits: A Design

Perspective, 2nd Edition, Prentice Hall, pp. 586-594, 2003.
[8]
Roy, Kaushik, Yeo, and Kiat-Seng, Low voltage Low-power VLSI

Subsystems, McGraw-Hill, pp.124-141.
[9]
Weste, Neil H.E. Eshraghian, and Kamran, CMOS VLSI Design: A

Circuits and Systems Perspective, 3rd Edition, Pearson Education, pp. 345356, 2005.
[10]
Parth Mehta and Dhanashri Gawali, Conventional versus Vedic mathematics

method for Hardware implementation of a multiplier, International
conference on Advances in Computing, Control, and Telecommunication
Technologies, pp. 640-642, 2009.
[11]
Ramalatha, M.Dayalan, K D Dharani, P Priya, and S Deoborah, High Speed

Energy Efficient ALU Design using Vedic Multiplication Techniques,
International Conference on Advances In Computational Tools for

Engineering Applications (ACTEA) IEEE, pp. 600-603, July15-17, 2009.
[12]
www.xilinx.com

Project Final Document

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Project Final Document

Enviado por

Direitos autorais:

Formatos disponíveis

Chapter 1

Multiplication-based operations such as Multiply and Accumulate (MAC) and inner

multiplication operation is of great importance in DSP as well as in general processor.

array multiplication algorithm and Booth multiplication algorithm. The computation

performed, they are:

To perform a literature survey of different multiplier techniques and to

To determine the factors involved in speed of multipliers for different

To carry out a detailed study in order to increase the speed of proposed

To procure basic knowledge about EDA tools and Field programmable

LAYOUT OF THE PROJECT

describes about the importance of Vedic mathematics and methodology involved in

multiplications, algebraic operations, factorizations, simple quadratic and higher order

4. It provides one line answer and helps in reducing silly mistakes

Using one of these principles of Vedic mathematics, Vedic multiplier is being

Multiplicand and Multiplier inputs have to be arranged in a special manner

Fig 2.1 Serial multiplier

Fig 2.2 Serial/parallel multiplier

The general architecture of this multiplier is shown in the figure 2.2. To

Fig 2.3 Shift and add multiplier

Fig 2.4 Array multiplier

It is a powerful algorithm for signed-number multiplication, which treats both

Fig 2.5 Booth multiplier

Fig 2.6 Wallace tree multiplier

Fig 2.7 Combinational multiplier

Fig 2.8 Half adder

Fig 2.9 Full adder

Fig 2.10 Schematic of RCA

2. Carry Save Adder (CSA)

Fig 2.11 Carry save adder

Fig 2.12 Schematic of Carry Look-Ahead Adder

Similarly the propagate block can be realized using the expression

The carry output of the (i-1)th stage is obtained from

The sum output can be obtained using

4. Carry Increment Adder (CIA)

Fig 2.13 Schematic of Carry Increment Adder

Fig 2.14 Schematic of Carry Skip adder

Fig 2.15 Schematic of Carry Bypass Adder

Fig 2.16 Carry Select adder

VEDIC MULTIPLIER ALGORITHM

Fig 3.1 Example for multiplication of 2digit binary numbers

SYSTEM DESIGN, BLOCK DIAGRAMS

Block diagram of 2x2 multiplier:

Fig 3.2 2x2 Vedic multiplier

Fig 3.3 Sample representation for 4x4 bit Vedic multiplier

Fig 3.4 4x4 multiplier

Fig 3.5 8X8 Multiplier

Some of the options would be:

Synthesis tools translate abstract descriptions of functionality such as HDL

Fig 4.1 EDA Design Flow

4.2 HARDWARE DESCRIPTION LANGUAGES (HDLs)

methodologies in handling complex electronic design. To meet these challenges, a

were contracted by the DOD to complete the specification and implementation of a

4.3 VERILOG CODE OVERVIEW

Syntactic differences include: required bit-widths for variable declarations,

2. Behavioral verification, to verify logical and timing issues

5.1 INTERNAL STRUCTURE OF FPGA

Fig 5.1 Internal structure of FPGA

5.2 FPGA STRUCTURAL CLASSIFICATION

elements. Each of these functional elements is controlled by the programmed

CLASSIFICATION ON USER PROGRAMMABLE

5.4 LOGIC BLOCK AND ROUTING TECHNIQUES

Fig 5.2 Transistor pair tiles in cross-point FPGA

Fig 5.3 Plessey Logic Block