Você está na página 1de 68

Abstract

The binary adder is the critical element in most digital circuit designs including digital signal
processors (DSP) and microprocessor data path units. As such, extensive research continues to
be focused on improving the performance of the adder. Binary adders are one of the most
essential logic elements within a digital system. In this project a DCT design is implemented
using the adders. DCT is the best application of the adders which is widely used in DSP blocks.
A comparison is done between the carry skip and D_latch.

Approximation of discrete cosine transform (DCT) is useful for reducing its computational
complexity without significant impact on its coding performance. Most of the existing
algorithms for approximation of the DCT target only the DCT of small transform lengths, and
some of them are non-orthogonal. Proposed algorithm is highly scalable for hardware as well as
software implementation of DCT of higher lengths, and it can make use of the existing
approximation of 8-point DCT to obtain approximate DCT of any power of two length, .We
demonstrate that the proposed approximation of DCT provides comparable or better image and
video compression performance than the existing approximation methods. It is shown that
proposed algorithm involves lower arithmetic complexity compared with the existing . We have
presented a fully scalable reconfigurable parallel architecture for the computation of approximate
DCT based on the proposed algorithm. One uniquely interesting feature of the proposed design is
that it could be configured for the computation of a 32-point DCT or for parallel computation of
two 16-point DCTs or four 8-point DCTs with a marginal control overhead. The project is
developed using VERILOG language and implemented in XILINX ISE.

INTRODUCTION

THE DISCRETE cosine transform (DCT) is popularly used in image and video compression.
Since the DCT is computationally intensive, several algorithms have been proposed in the
literature to compute it efficiently. Recently, significant work has been done to derive
approximate of 8-point DCT for reducing the computational complexity. The main objective of
the approximation algorithms is to get rid of multiplications which consume most of the power
and computation time, and to obtain meaningful estimation of DCT as well. Haweel has
proposed the signed DCT (SDCT) for 8 X8 blocks where the basis vector elements are replaced
by their sign, i.e, 1. Bouguezel-Ahmad-Swamy (BAS) have proposed a series of methods. They
have provided a good estimation of the DCT by replacing the basis vector elements by 0, 1/2, 1.
In the same vein, Bayer and Cintra have proposed two transforms derived from 0 and 1 as
elements of transform kernel, and have shown that their methods perform better than the method,
particularly for low- and high-compression ratio scenarios. The need of approximation is more
important for higher-size DCT since the computational complexity of the DCT grows
nonlinearly. On the other hand, modern video coding standards such as high efficiency video
coding (HEVC) [10] uses DCT of larger block sizes (up to 32X 32) in order to achieve higher
compression ratio. But, the extension of the design strategy used in H264 AVC for larger
transform sizes, such as 16-point and 32-point is not possible. Besides, several image processing
applications such as tracking and simultaneous compression and encryption require higher DCT
sizes. In this context, Cintra has introduced a new class of integer transforms applicable to
several block-lengths. Cintra have proposed a new 16 X16 matrix also for approximation of 16-
point DCT, and have validated it experimentally. Recently, two new transforms have been
proposed for 8-point DCT approximation: Cintra et al. have proposed a low-complexity 8-point
approximate DCT based on integer functions and Potluri et al. have proposed a novel 8-point
DCT approximation that requires only 14 addition . On the other hand, Bouguezel have
proposed two methods for multiplication-free approximate form of DCT. The first method is for
length , 16 and 32; and is based on the appropriate extension of integer DCT. Also, a systematic
method for developing a binary version of high-size DCT (BDCT) by using the sequency-
ordered Walsh-Hadamard transform (SO-WHT) is proposed in. This transform is a permutated
version of the WHT which approximates the DCT very well and maintains all the advantages of
the WHT. A scheme of approximation of DCT should have the following features:
i) It should have low computational complexity.
ii) It should have low error energy in order to provide compression performance close to the
exact DCT, and preferably should be orthogonal.
iii) It should work for higher lengths of DCT to support modern video coding standards, and
other applications like tracking, surveillance, and simultaneous compression and encryption.

But the existing DCT algorithms do not provide the best of all the above three requirements.
Some of the existing methods are deficient in terms of scalability, generalization for higher sizes,
and orthogonality. We intend tomaintain orthogonality in the approximate DCT for two reasons.
Firstly, if the transform is orthogonal, we can always find its inverse, and the kernel matrix of the
inverse transform is obtained by just transposing the kernel matrix of the forward transform. This
feature of inverse transform could be used to compute the forward and inverse DCT by similar
computing structures.

Discrete Cosine Transform


The DFT is not the only transform that is widely used in applications Published standards for image and
video coding (compression) make use of the DCT

1. JPEG (1989)
2. MPEG1, 2, and 4
2.1. MPEG1 (1992): video CD players, storage and retrieval of moving pictures and audio on storage
media
2.2. MPEG2 (1994): HDTV, DVD, standard for Digital TV (cable)
2.3. MPEG3, originally targeted for HDTV, was incorporated into MPEG2
2.4. MPEG4 (late 1998): standard for multimedia applications, targeted for wireless video
3. H.261, H.263
3.1. H.261 (circa 1993): video conferencing
3.2. H.263 (circa 1995): wireless video
These standards provide instructions for decoding the signal, but there is often considerable freedom for
encoding the signal. For more information go to http://drogo.cselt.it/mpeg/.

Compression
Two classes of compression algorithms try to reduce the number of bits required to represent a signal.

 lossless: compression ratios around 2-3:1 for data files


 lossy: compression ratios up to 1000:1 for video
For wireless video, need compression ratios up to 1000:1. Can get near lossless video compression at 8:1
with little degradation.

Compression algorithms work by removing redundancy in the signal. In video signals, the redundancy
can be of three forms.

 statistical: (e.g. Huffman codes, arithmetic, Lempel-Ziv)


 spatial: (e.g. vector quantization, DCT, subband coders, wavelets)
 temporal: (e.g. motion compensation)
Wavelet compression is used in JPEG-2000, MPEG4, and H.263+

Transform coders decompose a frame into blocks, typically 8 x 8. In MPEG2, they are called
macroblocks and divide the frame into luminance (intensity) and chrominance (color) images (YUV).

 luminance image: one 16 x 16 macroblock or four 8 x 8 macroblocks (Y)


 chrominance image: two 8 x 8 blocks (UV)
A 2-D DCT of each block is computed and the transform coefficients are quantized. Quantized
coefficients are coded losslessly. The choice of quantization affects the transmission rate and distortion.

Advantages of the DCT (relative to the DFT)

 real-valued
 better energy compaction (much of the signal energy can be represented by only a few coefficients)
 coefficients are nearly uncorrelated
 experimentally observed to work well
2-D DCT

N1 1 N 2 1
  
  2n  1 k    2n  1 k 


X c k1 , k2     
4 x n1 , n2 cos 


1 1 
 cos
 


2 2 


n1  0 n2  0
 2N 1   2N2 

N1 1 N2 1
 
  2n  1 k 
 
  2n  1 k 
      
1
cos   cos 
1 1 2 2 
x n1 , n2    C k1 C k 2 X c k1 , k2    
N1 N 2 k1  0 k 2  0
 2N1   2N 2 
   


1 2 , k  0
C k   

1, k  0

1. The DFT is related to the Fourier Series coefficients of a periodically extended sequence.
2. The DCT is related to the Fourier Series coefficients of a symmetrically extended sequence.
3. The 2-D DCT is a separable transform. It can be evaluated using a row-column decomposition. For a
8 x8 DCT, we need 16 1-D DCTs. Calling 16 functions may lead to unacceptable overhead for 8-
point DCTs.

1-D DCT
N 1   2n  1k 
X c k    2 x n cos 



n 0  2N 

Define the symmetric extension of x n  as y n   x n   x 2 N  1  n  for


n  0, 1, . . . , N  1.
 

x n  yn 

0 N 1 n 0 N 1 2N  1 n

Now consider the 2N-point DFT of y n .

2 N 1 2
j
Y k    yne
kn
2N

n 0
N 1 2 2 N 1 2
j j
  xne  x2 N  1  ne
kn kn
2N
 2N

n 0 n N
N 1 2 N 1 2
j j k  2 N 1 n 
  xne   xne
kn
2N 2N

n 0 n 0
2 2 2

N 1 j 
  xne
kn j kn j k
2N
e 2N
e 2N

n 0  
 N 1   j  k  j 2 kn 2 
k
  xne 2 N e 2 N  e 2 N e 2 N 
j k j kn j
e 2N

n 0  
 N 1
 k
  2 xncos 2n  1
j k
e 2N

n 0  2N 

Algorithm #1 for 1-D DCT


1. Set y n   x n   x 2 N  1  n 
2. Calculate Y k  using a 2N-point DFT
  
3. Set X c k   exp 
 j Y k 
k  , for k  0, 1, . . . , N  1.
 2N 
This requires N+N log2(2N) complex multiplies. Another algorithm can be developed that requires fewer

multiplies by using a shorter DFT.


Review of 1-D Decimation-in-Time (DIT) FFT
Consider the 1-D DFT

2 N 1
Y k    y n W 2 N
nk

n 0

Divide the sum into two components, one over the even samples and one over the odd samples.

N 1 N 1
Y k    y2r W 2 rk
2N   y2r  1W2N2 r 1k , k  0, 1,...,2 N  1
r 0 r 0
N 1 N 1
  y2r W Nrk  W2kN  y2r  1W Nrk , k  0, 1,...,2 N  1
r 0 r 0

 G[k ]  W H [k ] k
2N

2rk rk
Note that W22Nrk  exp( 2 )  exp( 2 )  WNrk
2N N

G[k] and H[k] are N-point DFTs:

Set

g n   y 2n , n  0, 1, . . . , N  1

h n   y 2n  1, n  0, 1, . . . , N  1

 
d d d
x[n] y[n] c c
c
b b b
a a a

g n   0a, c, d,N-1
b n 0 N-1 2N-1 n

h n   b, d, c, a   g N  1  n 
 H k   WN k GN  k N   WN k G * k 

Therefore,

k k
Y k   Gk   e G * k ,
j j
2N
e N
k  0, 1,..., 2 N  1
k
Y k   Gk   e G * k , k  0, 1,..., 2 N  1
j
2N

  j k  k

X c k   Re e    
j
 2N 
G k  e 2N
G *
k  , k  0, 1,..., 2 N  1
  
  
  j 2Nk 2k

Gk   e 2 N G * k 
j

 Re  e
 
 j 3k / 2 k / 2
 j 2N k / 2

   
j
 Re  e 2 N e
 G k  e 2N
G *
k 


  

 3k   k 
X c k   2 cos  cos  Re Gk , k  0,1,..., N  1
 4N   4N 

Algorithm #2 for 1-D DCT


1. Set

g n  x2n,
N
n  0,1,..., 1
2
g N  1  n  x2n  1,
N
n  0,1,..., 1
2

2. Calculate the N-point DFT of g n 


3. Set

 3k   k 
X c k   2 cos  cos  Re Gk 
 4N   4N 

The coefficients in front of Greal[k] can be pre-computed for different values of k for a given N.
DCT APPROXIMATION

The elements of -point DCT matrix are given by:

where , , and for 0. The DCT given by (1) is referred


to as exact DCT in order to distinguish it from approximated forms of DCT. For
and , for any even value of we can

Since , (2) can be rewritten as:

Hence, the cosine transform kernel on the right-hand side of (3) corresponds to -point DCT
and its elements can be assumed to be , for . Therefore,
the first elements of even rows of DCT matrix of size correspond to the -point
DCT matrix. Accordingly, the recursive decomposition of can be performed as detailed.
Using the even/odd symmetries of its row vectors, DCT matrix can be represented by the
following matrix product.

Where is a block sparse matrix expressed by:


Where is the zero matrix. Block sub matrix consists of odd
rows of the first columns of . is a permutation matrix expressed
by:

Where is a row of zeros and is a matrix defined by its row


vectors as:

Where is the th row vector of the identity


matrix. Finally, the last matrix in (4), is defined by:

Where is an matrix having all ones on the anti-diagonal and


zeros elsewhere.

To reduce the computational complexity of DCT, the computational cost of matrices presented in
(4) is required to be assessed. Since does not involve any arithmetic or logic operation,
and requires additions and subtractions, they contribute very little to the
total arithmetic complexity and cannot be reduced further. Therefore, for reducing the
computational complexity of -point DCT, we need to approximate in (5). Let and
denote the approximation matrices of and , respectively. To find these
approximated sub matrices we take the smallest size of DCT matrix to terminate the
approximation procedure to 8, since 4-point DCT and 2-pointDCT can be implemented by
adders only. Consequently, a good approximation of , where is an integral power of
two, for , leads to a proper approximations of and . For approximation of
we can choose the 8-point DCT since that presents the best trade-off between the number
of required arithmetic operators and quality of the reconstructed image. The trade-off analysis
shows that approximating by where denotes the rounding-off
operation outperforms the current state-of-the-art of 8-point approximation methods.
When we closely look at (4) and (5), we note that operates on sums of pixel pairs

while operates on differences of the same pixel pairs. Therefore, if we replace by ,


we shall have two main advantages. Firstly, we shall have good compression performance due to

the efficiency of and secondly the implementation will be much simpler, scalable and
reconfigurable. For approximation of we have investigated two other low-complexity
alternatives, and in the following we discuss here three possible options of approximation of
:
i) The first one is to approximate by null matrix, which implies all even-indexed DCT
coefficients are assumed to be zero. The transform obtained by this approximation is far from
the exact values of even-indexed DCT coefficients, and the odd coefficients do not contain any
information.
ii) The second solution is obtained by approximating by an 8X8 matrix where each row
contains one 1 and all other elements are zeros. Here, elements equal to 1 correspond to the
maximum of elements of the exact DCT in each row. The approximate transform in this case is
closer to the exact DCT than the solution obtained by null matrix.

iii) The third solution consists of approximation of by Since as well as are sub
matrices of and operate on matrices generated by sum and differences of pixel pairs at

distance of 8, approximation of by has attractive computational properties: regularity of

the signal-flow graph, orthogonality since is orthogonalizable, and good compression


efficiency, other than scalability and scope for reconfigurable implementation.
We have not done exhaustive search of all possible solutions. So there could be other possible
low-complexity implementation of . But other solutions are not expected to have the

potential for reconfigurablity what we achieve by replacement of by . Based on this third

possible approximation of , we have obtained the proposed approximation of as :

As stated before, matrix is orthogonalizable. Indeed, for each we can calculate


given by:

where denotes matrix transposition. For data compression, we can use

instead of since . Since is a diagonal matrix, it can be integrated into the


scaling in the quantization process (without additional computational complexity). Therefore, as

adopted, the computational cost of is equal to that of . Moreover, the term can be
integrated in the quantization step in order to have multiplier less architecture. The procedure for
the generation of the proposed orthogonal approximated DCT is stated in Algorithm 1.
Fig.. Signal flow graph (SFG) of . Dashed arrows represent multiplications
by 1.

SCALABLE AND RECONFIGURABLE ARCHITECTURE FOR DCT COMPUTATION

In this section, we discuss the proposed scalable architecture for the computation of approximate
DCT of and 32.We have derived the theoretical estimate of its hardware complexity
and discuss the reconfiguration scheme.

Scalable Design

The basic computational block of algorithm for the proposed DCT approximation, is
given. The block diagram of the computation of DCT based on is shown in Fig. 1. For a
given input sequence , the approximate DCT coefficients are

obtained by . An example of the block diagram of is illustrated in


Fig. 2, where two units for the computation of are used along with an input adder unit and
output permutation unit. The functions of these two blocks are shown respectively in (8) and
(6).Note that structures of 16-point DCT of Fig. 2 could be extended to obtain the DCT of higher
sizes. For example, the structure for the computation of 32-point DCT could be obtained by
combining a pair of 16-point DCTs with an input adder block and output permutation block.

Complexity Comparison
To assess the computational complexity of proposed –point approximate DCT
, we need to determine the computational cost of matrices quoted in (9).
As shown in Fig. 1 the approximate 8-point DCT involves 22 additions. Since has no
computational cost and requires additions for –point DCT, the overall arithmetic
complexity of 16-point, 32-point, and 64-point DCT approximations are 60, 152, and 368
additions, respectively. More generally, the arithmetic complexity of -point DCT is equal to
additions. Moreover, since the structures for the computation of DCT of different lengths are
regular and scalable, the computational time for DCT coefficients can be found to be

where is the addition-time. The number of arithmetic operations involved


in proposed DCT approximation of different lengths and those of the existing competing
approximations are shown in Table I. It can be found that the proposed method requires the
lowest number of additions, and does not require any shift operations. Note that shift operation
does not involve any combinational components, and requires only rewiring during hardware
implementation. But it has indirect contribution to the hardware complexity since shift-add
operations lead to increase in bit-width which leads to higher hardware complexity of arithmetic
units which follow the shift-add operation. Also, we note that all considered approximation
methods involve significantly less computational complexity over that of the exact DCT
algorithms. According to the Loeffler algorithm, the exact DCT computation requires 29, 81,
209, and 513 additions along with 11, 31, 79, and 191 multiplications, respectively for 8, 16, 32,
and 64-point DCTs.
Fig. 2. Block diagram of the proposed DCT for

Fig. 3.Reconfigurable architecture for approximate DCT of lengths


Reconfiguration Scheme

As specified in the recently adopted HEVC, DCT of different lengths such as , 16,
32 are required to be used in video coding applications. Therefore, a given DCT architecture
should be potentially reused for the DCT of different lengths instead of using separate structures
for different lengths. We propose here such reconfigurable DCT structures which could be
reused for the computation of DCT of different lengths. The reconfigurable architecture for the
implementation of approximated 16-point DCT is shown in Fig. 3. It consists of three computing
units, namely two 8-point approximated DCT units and a 16-point input adder unit that generates

And , . The input to the first 8-point DCT approximation unit

is fed through 8 MUXes that select either or

, depending on whether it is used for 16-point DCT


calculation or 8-point DCT calculation. Similarly, the input to the second 8-point DCT unit (Fig.
3) is fed through 8 MUXes that select either or , depending on whether it is used for 16-point
DCT calculation or 8-point DCT calculation. On the other hand, the output permutation unit uses

14 MUXes to select and re-order the output depending on the size of the selected DCT.
is used as control input of the MUXes to select inputs and to perform permutation according to

the size of the DCT to be computed. Specifically, enables the computation of

16-point DCT and enables the computation of a pair of 8-point DCTs in


parallel. Consequently, the architecture of Fig. 3 allows the calculation of a 16-point DCT or two
8-point DCTs in parallel.
Literature survey

INTRODUCTION TO DIFFERENT ADDERS

Arithmetic is the oldest and most elementary branch of Mathematics. The name
Arithmetic comes from the Greek word άριθμός (arithmos). Arithmetic is used by almost
everyone, for tasks ranging from simple day to day work like counting to advanced science and
business calculations. As a result, the need for faster and efficient Adders in computers has been
a topic of interest over decades. Addition is a fundamental operation for any digital system, digital
signal processing or control system. A fast and accurate operation of a digital system is greatly
influenced by the performance of the resident adders. Adders are also very important component
in digital systems because of their extensive use in other basic digital operations such as subtraction,
multiplication and division. Hence, improving performance of the digital adder would greatly
advance the execution of binary operations inside a circuit compromised of such blocks. The
performance of a digital circuit block is gauged by analyzing its power dissipation, layout area and
its operating speed.

To humans, decimal numbers are easy to comprehend and implement for performing
arithmetic. However, in digital systems, such as a microprocessor, DSP (Digital Signal
Processor) or ASIC (Application-Specific Integrated Circuit), binary numbers are more
pragmatic for a given computation. This occurs because binary values are optimally efficient at
representing many values.

Binary adders are one of the most essential logic elements within a digital system. In addition,
binary adders are also helpful in units other than Arithmetic Logic Units (ALU), such as
multipliers, dividers and memory addressing. Therefore, binary addition is essential that any
improvement in binary addition can result in a performance boost for any computing system and,
hence, help improve the performance of the entire system.

Binary adders are one of the most essential logic elements within a digital system. In
addition, binary adders are also helpful in units other than Arithmetic Logic Units (ALU), such
as multipliers, dividers and memory addressing. Therefore, binary addition is essential that any
improvement in binary addition can result in a performance boost for any computing system and,
hence, help improve the performance of the entire system.

The major problem for binary addition is the carry chain. As the width of the input operand
increases, the length of the carry chain increases. Figure 1.1 demonstrates an example of an 8- bit
binary add operation and how the carry chain is affected. This example shows that the worst case
occurs when the carry travels the longest possible path, from the least significant bit (LSB) to the
most significant bit (MSB). In order to improve the performance of carry-propagate adders, it is
possible to accelerate the carry chain, but not eliminate it. Consequently, most digital designers
often resort to building faster adders when optimizing a computer architecture, because they tend
to set the critical path for most computations.

Fig 2.1 Binary Adder Example.

The binary adder is the critical element in most digital circuit designs including digital signal
processors (DSP) and microprocessor data path units. As such, extensive research continues to
be focused on improving the power delay performance of the adder. In VLSI implementations,
parallel-prefix adders are known to have the best performance. Reconfigurable logic such as
Field Programmable Gate Arrays (FPGAs) has been gaining in popularity in recent years
because it offers improved performance in terms of speed and power over DSP-based and
microprocessor-based solutions for many practical designs involving mobile DSP and
telecommunications applications and a significant reduction in development time and cost over
Application Specific Integrated Circuit (ASIC) designs.

The power advantage is especially important with the growing popularity of mobile and portable
electronics, which make extensive use of DSP functions. However, because of the structure of
the configurable logic and routing resources in FPGAs, parallel-prefix adders will have a
different performance than VLSI implementations. In particular, most modern FPGAs employ a
fast-carry chain which optimizes the carry path for the simple Ripple Carry Adder (RCA). In this
paper, the practical issues involved in designing and implementing tree-based adders on FPGAs
are described. Several tree-based adder structures are implemented and characterized on a FPGA
and compared with the Ripple Carry Adder (RCA) and the Carry Skip Adder (CSA). Finally,
some conclusions and suggestions for improving FPGA designs to enable better tree-based adder
performance are given.

As mentioned previously, In every VLSI design we use the binary notation that is like adders.

Fig: 1-bit Half Adder.

Consider a simple addition of two n-bit inputs A;B and carry-in cin it has the one bit
size along with n-bit output S.
S = A + B + Cin
Where A = an-1, an-2……a0; B = bn-1, bn-2……b0.

The addition operation is represented with + in the above equation . However in Boolean
algebra we use the binary notations. For perfoming different operations we have to use different
logic gates like AND ,OR, and X-OR..etc... In the following documentation the dot operation
between two variables represents the ‘AND’ operation i.e ‘a AND b’, Similarly, the addition
operation between two variables like a + b denotes 'a OR b' and a^ b denotes 'a XOR b'.
Considering the situation of adding two bits, the sum s and carry c can be expressed
using Boolean operations mentioned above.

Si =ai ^ bi
Ci + 1 =ai . bi

The Equation of Ci+1 can be implemented as shown in Fig.2.1. In the figure, there is a
Half adder, with 2 input bits. The longest path from the input to the output is called critical path
which is in the solid line. Equation of ci+1 can be given to another half adder to perform full
adder operation, where there is a carry is taken as input.

Si =ai ^ bi ^ ci
Ci + 1 = ai . bi + ai . ci + bi . ci

Fig: 1-bit Full Adder.


A Full adder circuit can be built based on Equation above. The block diagram of a 1-bit
full adder is shown in Fig.4.1.2. The full adder is designed by using two half adders because if
we use half adders then we no need to use any exernal half adder. The circuit can be proven by
the Boolean expressions. If we want to generate at a time the we can generate easy computation
of carry by use carry generation carry propagation.

Gi = ai .bi
Pi = ai + bi
Ti = ai^ bi
Where i is an integer and 0 _ i < n.
With the help of the literals above, output carry and sum at each bit can be written as:

Ci + 1 = gi + pi .ci
Si = ti ^ ci
In some literatures, carry-propagate pi can be replaced with temporary sum ti in order to
save the number of logic gates. Here these two terms are separated in order to clarify the
concepts. For example, for Ling adders, only pi is used as carry-propagate.

The single bit carry generate/propagate can be extended to group version G and P. The
following equations show the inherent relations.

Gi : k = Gi : j + Pi : j . Gj – 1 : k
Pi : k = Pi : j . Pj-1:k
Where i : k denotes the group term from i through k.

Using group carry generate/propagate, carry can be expressed as expressed in the


following equation.
Ci + 1 = Gi : j + Pi : j . Cj

2.1.2. Ripple carry adder

We can be Ripple carry adder by connecting n-full adders in casecading. Fig 2.1 shows a 4-bit
ripple carry adder. One full adder is responsible for the addition of two binary digits at any stage
of the ripple carry. The carry comes from the each full is the input the next full adder. If we
want add n-bit number then we have to connect the n-full adder in case cade.
Fig.2.1.2:4-bRippleCarryAdder

One of impartant drawbacks of this adder is that the delay increases linearly with the bit
length. Because it propagates the one full adder output carry to another so it takes more time to
generate the output, which is approximated by:
T = (n-1) tc + ts
Delay :
Thelatencyofa4-bitripplecarryaddercanbederivedbyconsideringthe worst-
casesignalpropagationpath.Wecanthuswritethefollowingexpressions:
TRCA-4bit = TFA(A0,B0→Co)+T FA (C in→C1)+TFA (Cin→C2)+ TFA (Cin→S3)

And,itiseasytoextendtok-bitRCA:

TRCA-4bit = TFA(A0,B0→Co)+(K-2)* TFA (Cin→Ci)+ TFA (Cin→Sk-1).

Drawbacks :

Delayincreaseslinearlywiththebitlength and Notveryefficientwhenlargebitnumbersareused.

2.1.3. Carry Look-Ahead Adder

Lookahead carry algorithm speed up the operation to perform addition, because in this
algorithm carry for the next stages is calculated in advance by using the combination circuit at
the every input of the full adder. In CLA, the carry propagation time is reduced to O(log2(Wd))
by using a tree like circuit to compute the carry rapidly. Fig.2.3 shows the 4-bit Carry Look-
Ahead Adder.

Fig2.1.3: 4-bit Carry Look Ahead Adder

The CLA exploits the fact that the carry generated by a bit-position depends on the three
inputs to that position. Here the carry is generated at the input of fulladder If ‘X’ and ‘Y‘are two
inputs then if X=Y=1, a carry is generated independently of the carry from the previous bit
position and if X=Y= 0, no carry is generated. Similarly if X ≠ Y, a carry is generated if and only
if the previous bit-position generates a carry. ‘C’ is initial carry, “S” and “Cout” are output sum
and carry respectively, then Boolean expression for calculating next carry and addition is:
Pi = Xi xor Yi -- Carry Propagation
Gi = Xi and Yi -- Carry Generation
Ci + 1 =Gi or (Pi and Ci) -- Next Carry
Si = Xi xor Yi xorCi -- Sum Generation
Thus, for 4-bit adder, we can extend the carry, as shown below:
C1 = G0 + P0 · C0
C2 = G1 + P1 · C1 = G1 + P1 · G0 + P1 · P0 · C0
C3 = G2 + P2 · G1 + P2 · P1 · G0 + P2 · P1 · P0 · C0
C4 = G3 + P3 · G2 + P3 · P2 · G1 + P3 · P2 · P1 · G0+ P3 · P2 · P1 · P0 · C0
As with many design problems in digital logic, we can make tradeoffs between area and
performance(delay). In the case of adders, we can create faster (butlarger) designs than the RCA.
The Carry Look ahead Adder(CLA) is one of these designs.
Drawbacks :
The disadvantage of CLA is the if we want large bit addition the circuit complexity is
increases.
2.1.4. Carry Save Adder
The main purpose of carry-save adder to reduces the addition of number. The
propagation delay is based on the the number of bits. The carry-save adder consist of n full
adders, each full adder produces sum and carry bits for a given inputs.
The entire sum is calculated by shifting the carry sequence left by one place and
appending a 0 to the front (most significant bit) of the partial sum sequence and adding this result
of sequence with RCA produces the resulting n+1 bit value.
This process can be continued further , by adding an input for each stage of full adders,
without any intermediate carry propagation. These stages can be arranged in a binary tree
structure, in the number of inputs to be added by using the cumulative delay logarithmic, and the
number of bits per input will not be varied. The main application of carry save algorithm is, it is
used for efficient CMOS implementation of much wider variety of algorithms and well known
for multiplier architecture for high speed DSP .Carry save adder will speed up the carry
propagation in the array by applied in the partial product line of array multipliers.

Fig.2.1.4: 4-bit Carry Save Adder


Basically, carry save adder is used for calculating the sum of three or more n-bit binary
numbers. Carry save adder is like a full adder. As shown in the Fig.2.4, here we are computing
sum of two 4-bit binary numbers, so we take 4 full adders at first stage. Carry save unit consists
of 4 full adders, each full adder consists of sum and carry based on the given input. Let X and Y
are two 4-bit numbers and produces partial sum and carry as shown in the below :

Si = Xi xor Yi ; Ci = Xi and Yi
The final addition is then computed as:
1. Shifting the carry sequence C left by one place.
2. Placing a 0 to the front (MSB) of the partial sum sequence S.
3. Finally,a ripple carry adder is used to add these two together and computing the
resulting sum.
Carry Save Adder Computataion :

X: 10011
Y: 11001
Z: + 01011
S: 00001
C: + 11011
SUM: 1 1 0 1 1 1

In this design 126 bit carry save adder is used since the output of the multiplier is 126
bits (2N). The main purpose of the is used to reduce the addition from three numbers to two
numbers. The propagation delay is 3gates despite of the number of bits. The carry save adder
contains n full adders, computing a single sum and carries bits based on the given three bit
number input.The entire sum can be calculated by shifting the carry sequence left by one place
and then appending a 0 to most significant bit of the partial sum sequence. Now the partial sum
sequence is added with ripple carry unit resulting in n + 1 bit value. Where the carry out from
one stage and directly fed to the next. This process is continued without adding any intermediate
carry propagation. Since the representation of 126 bit carry save adder is infeasible, hence a
typical 6 bit carry save adder is shown in the figure 3.Here we are computing the sum of two 126
bit binary numbers, then 126 half adders at the first stage instead of 126 full adder. Therefore ,
carry save unit comprises of 126 half adders, each of which computes single sum and carry bit
based only on the corresponding bits of the two input numbers.

Figure:2.1.6 bit carry save adder

If x and y are supposed to be two 126 bit numbers then it produces the partial products
and carry as S and C respectively.
Si = xi 1\ yi (4)
Ci = xi&yi (5)
In the addition of two numbers without carry propagation performing addition by using
the half adder and two ripple carry adders, the delay in that device is equal to the number full
adders delay. all the output values are produced parallel in the carry save adder , resulting and
then it takes less time than ripple carry adder . the accumulators uses parallel in parallel out
processes
2.1.5. Carry Select Adder
A carry-select adder is classified into sectors, each of which – except for the least-
significant –performs two additions in parallel, one assuming a carry-in of zero, the other a
carry-in of one. The carry select is designed by using the two four bit ripple carry adders. The
carry-select adder not fast but it is simple in nature, having a gate level depth of O(√n) . if we
add two carry select adder then it performs additions twice, one time with the assumption of the
carry being zero and the other assuming one.
After calculation of sum carry of the ripple carry adders then they selected with the
multiplexer once for the correct carry is coming or not. The design schematic of Carry Select
Adder is shown in Fig.2.4.

Fig.2.4:The N-bit Ripple Carry Adder constructed by N set single bit Full-adder
In the N-bit carry ripple adder, the delay time can be expressed as:
TCRA = (N-1) Tcarry + Tsum
In the N-bit carry select adder, the delay time is:
TCSA =Tsetup + (N/M) Tcarry + MTmux + Tsum
In our proposed N-bit area-efficient carry select adder, the delay time is:
Tnew =Tsetup + (N-1) Tmux + Tsum
The carry select adder comes in the category of conditional sum adder. Conditional sum
adder works on some condition. Sum and carry are calculated by assuming input carry as 1 and 0
prior the input carry comes. When actual carry input arrives, the actual calculated values of sum
and carry are selected using a multiplexer.
The conventional carry select adder consists of k/2 bit adder for the lower half of the bits
i.e. least significant bits and for the upper half i.e. most significant bits (MSB’s) two k/bit adders.
In MSB adders one adder assumes carry input as one for performing addition and another
assumes carry input as zero. The carry out calculated from the last stage i.e. least significant bit
stage is used to select the actual calculated values of outputcarry and sum. The selection is done
by using a multiplexer. This technique of dividing adder in two stages increases the area
utilization but addition operation fastens.

EXISTING
Carry Skip adder:
A Carry Skip Adder consists of a simple ripple carry adder with speed up carry chain
called a skip chain. The chain defines the distribution of ripple carry blocks, which compose the
skip carry blocks, which compose the skip adder. Skip Carry adder are divided into blocks,
where a special circuit detects quickly if all the bits to be added are different (Pi = 1 in the entire
block). The carry skip adder provides a compromise between a ripple carry adder and a CLA
adder. The carry skip adder divides the words to be added into blocks. The signal produced by
this circuit will be called block propagation signal. If the carry is propagated at all positions in
the block, then the carry signal entering into the block can directly bypass it and so be
transmitted through a multiplexer to the next block. The important features of the CSA are as
follows.

Figure:- 16-Bit Ripple Carry adder.

Carry Skip Adder speeds up the computation when compared with Ripple Carry Adder by
reducing the path delay.

• When X[i]!=Y[i] that is when both the inputs of a full adder are not equal the CSA skips

that particular stage


since carry propagator is a XOR operation which is always high when both the inputs are
not equal.
With carry skip adders, the linear growth of carry chain delay with the size of the input
operands is improved by allowing carries to skip across blocks of bits, rather than rippling
through them. Carry skip Adder are of three types:
2-Blocks carry skip adder
The 2-block CSA comprises of two 8-bit RCA’s, 9 input AND gate and a OR gate with carry
propagator of each full adder of second RCA connected as the inputs to the AND gate.

Figure . 2-Block CSA.

4-Blocks carry skip adder


The 4-block CSA comprises of four 4-bit RCA’s, three 5-input AND gate and a OR gate with
carry propagator of each full adder of RCA connected as the inputs to the AND gate.

Figure:- 4-Block CSA.

8-Block carry skip adder


The 8-block CSA comprises of eight 2-bit RCA’s, seven 3-input AND gate and a OR gate with
carry propagator of each full adder of RCA connected as the inputs to the AND gate.

Fig:- 8-Block CSA.

A carry-skip adder (also known as a carry-bypass adder) is an adder implementation that


improves on the delay of a ripple-carry adder with little effort compared to other adders. The
improvement of the worst-case delay is achieved by using several carry-skip adders to form a
block-carry-skip adder.
Based on the discussion presented in Section III, it is concluded that by reducing the delay of the
skip logic, one may lower the propagation delay of the CSKA significantly. Hence, in this paper,
we present a modified CSKA structure that reduces this delay.

General Description of the CSKA Structure

The structure is based on combining the concatenation and the incrementation schemes
with the Conv-CSKA structure, and hence, is denoted by CI-CSKA. It provides us with the
ability to use simpler carry skip logics. The logic replaces 2:1 multiplexers by AOI/OAI
compound gates (Fig. 2). The gates, which consist of fewer transistors, have lower delay, area,
and smaller power consumption compared with those of the 2:1 multiplexer. Note that, in this
structure, as the carry propagates through the skip logics, it becomes complemented. Therefore,
at the output of the skip logic of even stages, the complement of the carry is generated. The
structure has a considerable lower propagation delay with a slightly smaller area compared with
those of the conventional one. Note that while the power consumptions of the AOI (or OAI) gate
are smaller than that of the multiplexer, the power consumption of the proposed CI-CSKA is a
little more than that of the conventional one. This is due to the increase in the number of the
gates, which imposes a higher wiring capacitance (in the noncritical paths). Now, we describe
the internal structure of the proposed CI-CSKA shown in Fig. 2 in more detail. The adder
contains two N bits inputs, A and B, and Q stages. Each stage consists of an RCA block with the
size of Mj ( j = 1, . . . , Q). In this structure, the carry input of all the RCA blocks, except for the
first block which is Ci , is zero (concatenation of the RCA blocks). Therefore, all the blocks
execute their jobs simultaneously. In this structure, when the first block computes the summation
of its corresponding input bits (i.e., SM1, . . . , S1), and C1, the other blocks simultaneously
compute the intermediate results [i.e., {ZK j+Mj , . . . , ZK j+2, ZK j+1} for K j = _j−1 r=1 Mr ( j
= 2, . . . , Q)], and also Cj signals. In the proposed structure, the first stage has only one block,
which is RCA. The stages 2 to Q consist of two blocks of RCA and incrementation. The
incrementation block uses the intermediate results generated by the RCA block and the carry
output of the previous stage to calculate the final summation of the stage. The internal structure
of the incrementation block, which contains a chain of half-adders (HAs), is shown in Fig. 3. In
addition, note that, to reduce the delay considerably, for computing the carry output of the stage,
the carry output of the incrementation block is not used. As shown in Fig. 2, the skip logic
determines the carry output of the j th stage (CO, j ) based on the intermediate results of the j th
stage and the carry output of the previous stage (CO, j−1) as well as the carry output of the
corresponding RCA block (Cj ). When determining CO, j , these cases may be encountered.
When Cj is equal to one, CO, j will be one. On the other hand, when Cj is equal to zero, if the
product of the intermediate results is one (zero), the value of CO, j will be the same as CO, j−1
(zero). The reason for using both AOI and OAI compound gates as the skip logics is the
inverting functions of these gates in standard cell libraries. This way the need for an inverter
gate, which increases the power consumption and delay, is eliminated. As shown in Fig. 2, if an
AOI is used as the skip logic, the next skip logic should use OAI gate. In addition, another point
to mention is that the use of the proposed skipping structure in the Conv-CSKA structure
increases the delay of the critical path considerably. This originates from the fact that, in the
Conv-CSKA, the skip logic (AOI or OAI compound gates) is not able to bypass the zero carry
input until the zero carry input propagates from the corresponding RCA block. To solve this
problem, in the proposed structure, we have used an RCA block with a carry input of zero (using
the concatenation approach). This way, since the RCA block of the stage does not need to wait
for the carry output of the previous stage, the output carries of the blocks are calculated in
parallel.

Fig.- CI-CSKA structure.


PROPOSED

INTRODUCTION TO D LATCH:
A flip-flop or latch is a circuit that has two stable states and can be used to store state information.
A flip-flop is a bistablemultivibrator. The circuit can be made to change state by signals applied to
one or more control inputs and will have one or two outputs. It is the basic storage element
in sequential logic. Flip-flops and latches are a fundamental building block of digital
electronics systems used in computers, communications, and many other types of systems.

Flip-flops and latches are used as data storage elements. Such data storage can be used
for storage of state, and such a circuit is described as sequential logic. When used in a finite-state
machine, the output and next state depend not only on its current input, but also on its current
state (and hence, previous inputs). It can also be used for counting of pulses, and for
synchronizing variably-timed input signals to some reference timing signal.

Flip-flops can be either simple (transparent or opaque) or clocked (synchronous or edge-


triggered). Although the term flip-flop has historically referred generically to both simple and
clocked circuits, in modern usage it is common to reserve the term flip-flop exclusively for
discussing clocked circuits; the simple ones are commonly called latches

Using this terminology, a latch is level-sensitive, whereas a flip-flop is edge-sensitive.


That is, when a latch is enabled it becomes transparent, while a flip flop's output only changes on
a single type (positive going or negative going) of clock edge.

Latch is an electronic device that can be used to store one bit of information. The D latch
is used to capture, or 'latch' the logic level which is present on the Data line when the clock input
is high. If the data on the D line changes state while the clock pulse is high, then the output, Q,
follows the input, D. When the CLK input falls to logic 0, the last state of the D input is trapped
and held in the latch.

TIMING DIAGRAM:
From the timing diagram it is clear that the output Q's waveform resembles that of input
D's waveform when the clock is high whereas when the clock is low Q retains the previous value
of D (the value before clock dropped down to 0).

3.1. THE D-TYPE FLIP FLOP:

The working of D flip flop is similar to the D latch except that the output of D Flip Flop
takes the state of the D input at the moment of a positive edge at the clock pin (or negative edge
if the clock input is active low) and delays it by one clock cycle. That's why, it is commonly
known as a delay flip-flop. The D Flip Flop can be interpreted as a delay line or zero order hold.
The advantage of the D flip-flop over the D-type "transparent latch" is that the signal on the D
input pin is captured the moment the flip-flop is clocked, and subsequent changes on the D input
will be ignored until the next clock event.

3.2 CHARACTERISTICS AND APPLICATIONS OF D LATCH AND D FLIP FLOP:

1. D-latch is a level Triggering device while D Flip Flop is an Edge triggering device.

2. The disadvantage of the D FF is its circuit size, which is about twice as large as that of a D
latch. That's why delay and power consumption in Flip flop is more as compared to D latch.

3. Latches are used as temporary buffers whereas flip flops are used as registers.

4. Flip flop can be considered as a basic memory cell because it stores the value on the data line
with the advantage of the output being synchronized to a clock.

5. Many logic synthesis tool use only D flip flop or D latch.

6. FPGA contains edge triggered flip flops.

7. D flip flops are also used in finite state machines.

3.3 EDGE TRIGGERING VS. LEVEL CLOCKING:


1. When a circuit is edge triggered the output can change only on the rising or falling edge of
the clock. But in the case of level-clocked, the output can change when the clock is high (or
low).

2. In edge triggering output can change only at one instant during the lock cycle; with level
clocking output can change during an entire half cycle of the clock.

One of the main disadvantages of the basic SR NAND Gate bistable circuit is that the
indeterminate input condition of “SET” = logic “0” and “RESET” = logic “0” is forbidden. This
state will force both outputs to be at logic “1”, over-riding the feedback latching action and
whichever input goes to logic level “1” first will lose control, while the other input still at logic
“0” controls the resulting state of the latch.

But in order to prevent this from happening an inverter can be connected between the
“SET” and the “RESET” inputs to produce another type of flip flop circuit known as a Data
Latch, Delay flip flop, D-type Bistable, D-type Flip Flop or just simply a D Flip Flop as it is
more generally called.

The D Flip Flop is by far the most important of the Clocked Flip-flops as it ensures that
ensures that inputs S and R are never equal to one at the same time. The D-type flip flop are
constructed from a gated SR flip-flop with an inverter added between the S and the R inputs to
allow for a single D (data) input.

Then this single data input, labeled D, is used in place of the “set” signal, and the inverter
is used to generate the complementary “reset” input thereby making a level-sensitive D-type flip-
flop from a level-sensitive RS-latch as now S = D and R = not D as shown.

D Flip Flops as Data Latches:

As well as frequency division, another useful application of the D flip flop is as a Data
Latch. A data latch can be used as a device to hold or remember the data present on its data
input, thereby acting a bit like a single bit memory device and IC’s such as the TTL 74LS74 or
the CMOS 4042 are available in Quad format exactly for this purpose. By connecting together
four, 1-bit data latches so that all their clock inputs are connected together and are “clocked” at
the same time, a simple “4-bit” Data latch can be made as shown below.

4-bit Data Latch:

Fig. 3.3: 4-bit Data Latch

Transparent Data Latch:

The Data Latch is a very useful device in electronic and computer circuits. They can be
designed to have very high output impedance at both outputs Q and its inverse or complement
output Q to reduce the impedance effect on the connecting circuit when used as a buffer, I/O
port, bi-directional bus driver or even a display driver.

But a single “1-bit” data latch is not very practical to use on its own and instead
commercially available IC’s incorporate 4, 8, 10, 16 or even 32 individual data latches into one
single IC package, and one such IC device is the 74LS373 Octal D-type transparent latch.

The eight individual data latches or bi-stables of the 74LS373 are “transparent” D-type
flip-flops, meaning that when the clock (CLK) input is HIGH at logic level “1”, (but can also be
active low) the outputs at Q follows the data D inputs.
In this configuration the latch is said to be “open” and the path from D input to Q output
appears to be “transparent” as the data flows through it unimpeded, hence the name transparent
latch. When the clock signal is LOW at logic level “0”, the latch “closes” and the output at Q is
latched at the last value of the data that was present before the clock signal changed and no
longer changes in response to D.

8-bit Data Latch:

Fig. 3.4: 8-bit Data Latch

The D-type Flip Flop Summary:

The data or D-type Flip Flop can be built using a pair of back-to-back SR latches and
connecting an inverter (NOT Gate) between the S and the R inputs to allow for a single D (data)
input. The basic D flip flop circuit can be improved further by adding a second SR flip-flop to its
output that is activated on the complementary clock signal to produce a “Master-Slave D flip-
flop” device.

The difference between a D-type latch and a D-type flip-flop is that a latch does not have
a clock signal to change state whereas a flip-flop always does. The D flip-flop is an edge
triggered device which transfers input data to Q on clock rising or falling edge. Data Latches are
level sensitive devices such as the data latch and the transparent latch.
In the next tutorial about Sequential Logic Circuits, we will look at connecting together
data latches to produce another type of sequential logic circuit called a Shift Register that are
used to convert parallel data into serial data and vice versa.

CARRY -SKIP ADDER WITH D_LATCH:


This method replaces the Carry skip circuit by D-latch along with carry skip. Latches are
used to store 1-bit binary information. The latch is one of the sequential circuits so their outputs
are depends on the present inputs and previous inputs. In other words, the latch is level sensitive,
so when latch are enabled, the operation of latch is changes according with input signal of the
latch. The architecture of proposed 16-bit Carry Skip Adder is shown in Fig. 3. In this we are
using five different ripple carry adders with different bit size and D-Latch. In this proposed
method uses only one adder Instead of using two separate adders in the regular carry skip
adder(CSA) to reduce the area, and power consumption. In CSA Each of the two additions is
performed in one clock cycle. In the 16-bit adder Ripple carry adder used in the least significant
bit (LSB) which is 2 bit wide. The upper half of the adder is most significant part is 14-bit wide
the upper of the adder is works depends on the clock, the input carry performed addition when
carry goes high.

The carry input is assumed as zero While clock goes low and the sum of adder is stored
in adder itself. From the Fig. it can understand that latch is used to store the sum and carry for
Cin=1 and cin=0.Carry out from the previous stage i.e., least significant bit adder is used as
control signal for multiplexer to select final output carry and sum of the 16-bit adder. If the
actual carry input is one, then computed sum and carry latch is accessed and for carry input zero
MSB adder is accessed is the output carry.
Figure 3: 4-BIT CSA WITH D-LATCH ARCHITECTURE

Figure 4: 16-BIT CSA WITH D-LATCH ARCHITECTURE

The architectures of the 4-Bit, 16-Bit CSA with D-LATCH are shown in figure 3, figure
4 . The 16-Bit CSA with D-LATCH Architecture is designed based on the cascading of four 4-
Bit Architectures. In the D-LATCH architecture first stage is designed based on Ripple Carry
adders and the second stage is designed based on the D-LATCH logic.

WORKING PRINCIPLE OF CSA ADDER USING D LATCH:

In the Above diagram D-Latch has been used to pass the output signals of the RCA to the
multiplexer at which the carry signal value will be selected according to the propagation value.

The bits from a and b (i,e a[3:0] and b[3:0]) as inputs to (3:0)RCA along with cin as
input. The bits from a and b (i,e a[7:4] and b[7:4]) as inputs to (7:4)RCA along with BP. When
BP=1, the carry output C0 of the(3:0)RCA is fed as input to the (1 bit) D-Latch and the initial
carry cin of the of the (1 bit) D-latch follows the input and given as an input to the
(2:1)multiplexer.

The bits from a and b (i,e a[11:8] and b[11:8]) as inputs to (11:8)RCA along with c0 as
input. The bits from a and b (i,e a[15:12] and b[15:12]) as inputs to (15:12) RCA along with BP.
When BP=1, the carry output C1 of the(3:0)RCA is fed as input to the (1 bit) D-Latch and the
initial carry c0 of the of the (1 bit) D-latch follows the input and given as an input to the
(2:1)multiplexer.
INTRODUCTION TO VLSI
3.1 VLSI TECHNOLOGY
Digital systems are highly complex at their most detailed level. They may consist of
millions of elements i.e., transistors or logic gates. For many decades, logic schematics served as
requisite of logic design, but not any more. Today, hardware complexity has grown to such a
degree that a schematic with logic gates is almost useless as it shows only a web of connectivity
and not functionality of design. Since the 1970s, computer engineers, electrical engineers and
electronics engineers have moved towards Hardware description language (HDLs).

Digital circuit has rapidly evolved over the last twenty five years .The earliest digital
circuits were designed with vacuum tubes and transistors. Integrated circuits were then invented
where logic gates were placed on a single chip. The first IC chip was small scale integration
(SSI) chips where the gate count is small. When technology became sophisticated, designers
were able to place circuits with hundreds of gates on a chip. These chips were called MSI chips
with advent of LSI, designers could put thousands of gates on a single chip. At this point, design
process is getting complicated and designers felt the need to automate these processes.

For these reasons, hardware description languages have played an important role in
describe and synthesis design methodology. They are used for specification, simulation and
synthesis of an electronic system. This helps to reduce the complexity in designing and products
are made to be available in market quickly.

The components of a digital system can be classified as being specific to an application or


as being standard circuits. Standard components are taken from a set that has been used in other
systems. MSI components are standard circuits and their use results in a significant reduction in
the total cost as compared to the cost of using SSI Circuits. In contrasts, specific components are
particular to the system being implemented and are not commonly found among the standard
components.
Design Flow:

System specification

Functional design

X=(ABCD+A+D+A(B+C))
Logic design
Y=(A(B+C)+AC+D+A(BC+D))

Circuit design

Physical design

Fabrication

Packaging and testing

Fig 3.1: Steps involving manufacturing of a chip


3.2 FPGA DESIGN FLOW:

The process of implementing a design on an FPGA can be broken down in to several stages,
loosely definable as design entry or capture, synthesis, and place and route. Along the way, the
design is simulated at various levels of abstraction as in ASIC design. The availability of
sophisticated and coherent tool suites for FPGA design makes them all the more attractive.

At one time, design entry was performed in the form of schematic capture. Most designers
have moved over to hardware description languages (HDLs) for design entry. Some will prefer a
mixture of the two techniques. Schematic-based design-capture tools gave designers a great deal
of control over the physical placement and partitioning of logic on the device. But it’s becoming
less likely that designers will take that route. Mean while, language-based design entry is faster,
but often at the expense of performance or density.

For many designers, the choice of whether to use schematic-or HDL-based design entry
comes down to their conception of their design .For those who think in software or algorithmic-
like terms, HDLs are the better choice. HDLs are well suited for highly complex designs,
especially when the designer has a good handle on how the logic must be structured. They can
also be very useful for designing smaller functions when you haven’t the time or inclination to
work through the actual hardware implementation.
Fig 3.2: FPGA Design flow

On the other hand, HDLs represent a level of abstraction that can isolate designers from
the details of the hardware implementation. Schematic-based entry gives designers much more
visibility in to the hardware. It’s a better method for those who are hardware- oriented. The
downside of schematic-based entry is that it makes the design more difficult to modify or port to
another FPGA.

A third option for design entry, state-machine entry, works well for designers who can
see their logic design as a series of states that the system control, that can be clearly represented
in visual formats. Tool support for finite state-machine entry is limited, though.
After design entry, the design is simulated at the register-transfer level (RTL). This is the
first of several simulation stages, because the design must be simulated at successive levels of
abstraction as it moves down the chain toward physical implementation on the FPGA itself. RTL
simulation offers the highest performance in terms of speed. As a result, designers can perform
many simulation runs in an effort to refine the logic. At this stage, FPGA development isn’t
unlike software development. Signals and variables are observed, procedures and functions
traced, and breakpoints set. The good news is that it’s a very fast simulation. But because the
design hasn’t yet been synthesized to gate level, properties such as timing and resources usage
are still unknowns.

The next step follows RTL simulation is to convert the RTA representation of the design
into a bit-stream file that can be loaded onto the FPGA. The interim step is FPGA synthesis,
which translates the VHDL or Verilog code into a device net list format that can be understood
by a bit-stream converter.

The synthesis process can be broken down into five steps. First step, the HDL code is
converted into device netlist format. Then the resulting file is converted into device netlist
format. Then the resulting file is converted into a hexadecimal bit-stream file, or bit file. This
step is necessary to change the list of required devices and interconnects into hexadecimal bits to
download to the FPGA. This final step completes the FPGA synthesis procedure by
programming the design onto the physical FPGA.

It’s important to fully constrain designs before synthesis. A constrain file is an input to
the synthesis process just as the RTL code itself. Constrains can be applied globally or to specific
portions of the design. The synthesis engine uses these constrain to optimize the netlist.
However, it’s equally important to not over-constrain the design, which will generally result in
less than-optimal results from the next step in the implementation process-physical device
placement-and interconnect routing.
Following synthesis, device implementation begins. After netlist synthesis, the design is
automatically converted into the format supported internally by the FPGA vendor’s place-and-
route tools. Design-rule checking and optimization is performed on the incoming netlist and the
software partitioning is required to achieve high routing completion and high performance.
Increasingly, FPGA designers are turning to floor planning after synthesis and design
partitioning. FPGA floor planners work from the netlist hierarchy as defined by the RTL coding.
Floor planning can help if area is tight. When possible, it’s a good idea to place critical logic in
separate blocks.

After partitioning and floorplaning, the placement tool tries to place the logic blocks to
achieve efficient routing. The tool monitors routing length and track congestion while placing the
blocks. It may also track the absolute path delays to meet the user’s timing constraints. Overall
the process mimics PCB place and route.

Functional simulation is performed after synthesis and before physical implementation.


This step ensures correct logic functionality. After implementation, there’s a final verification
step with full timing information. After placement and routing, the logic and routing delays are
back-annotated to the gate-level netlist for this simulation. At this point, simulation is a much
longer process, because timing is also a factor. Often, designers substitute static timing
simulation. Static timing analysis calculates the timing of combinational paths between registers
and compares it again the designer’s timing constraints.

Once the design is successfully verified and found to meet timing, the final step is to
actually program the FPGA itself. At the completion of placement and routing, a binary
programming file is created. It’s used to configure the device. No matter what the device’s
underlying technology, the FPGA interconnect fabric has cells that configure it to connect to the
inputs and outputs of the logic blocks. In turn, the cells configure that logic to each other. Most
programmable-logic technologies, including the PROM’s for SRAM-based FPGA’s, require
some sort of a device programmer. Device can also be programmed through their configuration
ports using a set of dedicated pins.
Modern FPGAs also incorporate a JTAG port that, happily, can be used for more than
boundary-scan testing. The JTAG port can be connected to the device’s internal SRAM
configuration-cell shift registers, which in turn can be instructed to connect to the chip’s JTAG
scan chain.

Integrated flows for FPGAs make sure sense in general, considering that FPGA vendors
will continue to introduce more complex, powerful, and economical devices over time. An
integrated third-party flow makes it easier to re-target a design to different technologies from
different vendors as conditions warrant.

3.3 INTRODUCTION OF VLSI

Very-large-scale integration (VLSI) is the process of creating integrated circuits by


combining thousands of transistor-based circuits into a single chip. VLSI began in the 1970s
when complex semiconductor and communication technologies were being developed. The
microprocessor is a VLSI device. The term is no longer as common as it once was, as chips have
increased in complexity into the hundreds of millions of transistors.

The first semiconductor chips held one transistor each. Subsequent advances added more
and more transistors, and, as a consequence, more individual functions or systems were
integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten
diodes, transistors, resistors and capacitors, making it possible to fabricate one or more logic
gates on a single device. Now known retrospectively as "small-scale integration" (SSI),
improvements in technique led to devices with hundreds of logic gates, known as large-scale
integration (LSI), i.e. systems with at least a thousand logic gates. Current technology has moved
far past this mark and today's microprocessors have many millions of gates and hundreds of
millions of individual transistors.

Advantages of ICs over discrete components

While we will concentrate on integrated circuits , the properties of


integrated circuits-what we can and cannot efficiently put in an integrated circuit-largely
determine the architecture of the entire system. Integrated circuits improve system characteristics
in several critical ways. ICs have three key advantages over digital circuits built from discrete
components:

 Size. Integrated circuits are much smaller-both transistors and wires are shrunk to
micrometer sizes, compared to the millimeter or centimeter scales of discrete
components. Small size leads to advantages in speed and power consumption,
since smaller components have smaller parasitic resistances, capacitances, and
inductances.

 Speed. Signals can be switched between logic 0 and logic 1 much quicker within
a chip than they can between chips. Communication within a chip can occur
hundreds of times faster than communication between chips on a printed circuit
board. The high speed of circuits on-chip is due to their small size-smaller
components and wires have smaller parasitic capacitances to slow down the
signal.

 Power consumption. Logic operations within a chip also take much less power.
Once again, lower power consumption is largely due to the small size of circuits
on the chip-smaller parasitic capacitances and resistances require less power to
drive them.

Applications of VLSI

Electronic systems now perform a wide variety of tasks in daily life. Electronic
systems in some cases have replaced mechanisms that operated mechanically, hydraulically, or
by other means; electronics are usually smaller, more flexible, and easier to service. In other
cases electronic systems have created totally new applications. Electronic systems perform a
variety of tasks, some of them visible, some more hidden:

 Personal entertainment systems such as portable MP3 players and DVD


players perform sophisticated algorithms with remarkably little energy.

 Electronic systems in cars operate stereo systems and displays; they also
control fuel injection systems, adjust suspensions to varying terrain, and
perform the control functions required for anti-lock braking (ABS) systems.
 Digital electronics compress and decompress video, even at high-definition
data rates, on-the-fly in consumer electronics.

 Low-cost terminals for Web browsing still require sophisticated electronics,


despite their dedicated function.

 Personal computers and workstations provide word-processing, financial


analysis, and games. Computers include both central processing units (CPUs)
and special-purpose hardware for disk access, faster screen display, etc.

 Medical electronic systems measure bodily functions and perform complex


processing algorithms to warn about unusual conditions. The availability of
these complex systems, far from overwhelming consumers, only creates
demand for even more complex systems.

The growing sophistication of applications continually pushes the design and manufacturing of
integrated circuits and electronic systems to new levels of complexity. And perhaps the most
amazing characteristic of this collection of systems is its variety-as systems become more
complex, we build not a few general-purpose computers but an ever wider range of special-
purpose systems. Our ability to do so is a testament to our growing mastery of both integrated
circuit manufacturing and design, but the increasing demands of customers continue to test the
limits of design and manufacturing
LANGUAGE USED

4.1VERILOG HDL

Verilog HDL is a hardware description language that can be used to model a digital
system at many levels of abstraction ranging from the algorithmic-level to the gate-level to the
switch-level. The complexity of the digital system being modeled could vary from that of a
simple gate to a complete electronic digital system, or anything in between. The digital system
can be described hierarchically and timing can be explicitly modeled within the same
description.

The Verilog HDL language includes capabilities to describe the behavior-al nature of a
design, the dataflow nature of a design, a design's structural composition, delays and a
waveform generation mechanism including aspects of response monitoring and verification, all
modeled using one single language. In addition, the language provides a programming language
interface through which the internals of a design can be accessed during simulation including the
control of a simulation run.

The language not only defines the syntax but also defines very clear simulation semantics
for each language construct. Therefore, models written in this language can be verified using a
Verilog simulator. The language inherits many of its operator symbols and constructs from the
C programming language. Verilog HDL provides an extensive range of modeling capabilities,
some of which are quite difficult to comprehend initially. However, a core subset of the language
is quite easy to leam and use. This is sufficient to model most applications.

4.1.1 Major Capabilities:

Listed below are the majort capabilities of the verilog hardware description:

 Primitive logic gates, such as and, or and nand, are built-in into the language.
 Flexibility of creating a user-defined primitive (UDP). Such a primitive could either be a
combinational logic primitive or a sequential logic primitive.
 Switch-level modeling primitive gates, such as pmos and nmos, are also built-in into the
language.
 Explicit language constructs are provided for specifying pin-to-pin delays, path delays
and timing checks of a design.
 A design can be modeled in five different styles or in a mixed style. These styles are:
behavioral style - modeled using procedur-al constructs; dataflow style - modeled using
continuous assign-ments; and structural style - modeled using gate and module
instantiations.
 There are two data types in Verilog HDL; the net data type and the register data type. The
net type represents a physical connection between structural elements while a register
type represents an abstract data storage element.
 Figure.2-1 shows the mixed-level modeling capability of Verilog HDL, that is, in one
design, each module may be modeled at a different level.

Fig:4.1 Mixed level modelling

 Verilog HDL also has built-in logic functions such as & (bitwise-and) and I (bitwise-or).

 High-level programming language constructs such as condition- als, case statements, and
loops are available in the language.

 Notion of concurrency and time can be explicitly modeled.

 Powerful file read and write capabilities fare provided.

 The language is non-deterministic under certain situations, that is, a model may produce
different results on different simulators; for example, the ordering of events on an event
queue is not defined by the standard.
4.1.2 SYNTHESIS:

Synthesis is the process of constructing a gate level netlist from a register-transfer level
model of a circuit described in Verilog HDL. Figure.2-2 shows such a process. A synthesis
system may as an intermediate step, generate a netlist that is comprised of register-transfer level
blocks such as flip-flops, arithmetic-logic-units, and multiplexers, interconnected by wires. In
such a case, a second program called the RTL module builder is necessary. The purpose of this
builder is to build, or acquire from a library of predefined components, each of the required RTL
blocks in the user-specified target technology.

Figure.4-2 synthesis process

Having produced a gate level netlist, a logic optimizer reads in the netlist and optimizes
the circuit for the user-specified area and timing constraints. These area and timing constraints
may also be used by the module builder for appropriate selection or generation of RTL blocks. In
this book, we assume that the target netlist is at the gate level. The logic gates used in the
synthesized netlists are described in Appendix B. The module building and logic optimization
phases are not described in this book.

The above figure shows the basic elements ofVerilog HDL and the elements used in
hardware. A mapping mechanism or a construction mechanism has to be provided that translates
the Verilog HDL elements into their corresponding hardware elements as shown in figure.2-3
Fig.4-3 Typical design process

 Verilog synthesis tools can create logic-circuit structures directly from verilog behavioral
description and target them to a selected technology for realization (I.e,translate verilog to actual
hardware).
 Using verilog , we can design ,simulate and synthesis anything from a simple combinational
circuit to a complete microprocessor on chip.
 Verilog HDL has evolved as a standard hardware description language. Verilog HDL offers many
useful features for hardware design.
 Verilog HDL is a general-purpose hardware description language that is easy to learn and easy to
use. It is similar in syntax to the C programming language. Designers with C programming
experience will find it easy to learn Verilog HDL.
 Verilog HDL allows different levels of abstraction to be mixed in the same model. Thus, a
designer can define a hardware model in terms of switches, gates, RTL, or behavioral code. Also,
a designer needs to learn only one language for stimulus and hierarchical design.
 Most popular logic synthesis tools support Verilog HDL. This makes it the language of choice for
designers.
 All fabrication vendors provide Verilog HDL libraries for post logic synthesis simulation. Thus,
designing a chip in Verilog HDL allows the widest choice of vendors.
 The Programming Language Interface (PLI) is a powerful feature that allows the user to write
custom C code to interact with the internal data structures of Verilog. Designers can customize a
Verilog HDL simulator to their needs with the PLI.

4.2 PROGRAM STRUCTURE:

 The basic unit and programming in verilog is ”MODULE”(a text file containing statements and
declarations ).
 A verilog module has declarations that describes the names and types of the module inputs and
outputs as well as local signals, variables, constants and functions that are used internally to the
module ,are not visible outside.
 The rest of the module contains statements that specify the operation of the module output and
internal signals.
 Verilog is a case-sensitive language like C. Thus sense, Sense, SENSE, sENse,…etc., are all
treated as different entities / quantities in Verilog.

SYNTAX:

Module Module_Name( PORT_name,………………port_name);


Input declaration

Output declaration

Net declaration

Variable declaration

Parameter declaration

Function declaration

Task declaration

Concurrent_statements

Endmodule

module  signifies the beginning of a module definition.

endmodule signifies the end of a module definition.

SYSTEM TASKS

During the simulation of any design, a number of activities are to be carried out to monitor and control
simulation. A number of such tasks are provided / available in Verilog. Some other tasks serve other
functions. However, a few of these are used commonly; these are described here. The “$” symbol
identifies a system task. A task has the format

$<keyword>

$display:

When the system encounters this task, the specified items are displayed in the formats specified and the
system advances to a new line. The structure, format, and rules for these are the same as for the “printf” /
“scanf” function in C. Refer to a standard text in “C” language for the text formatting codes in common
usage [Gottfried].

$monitor:

The $monitor task monitors the variables specified whenever any one of those specified changes. During
the running of the program the monitor task is invoked and the concerned quantities displayed whenever
any one of these changes. Following this, the system advances to the next line. A monitor statement need

appear only once in a simulation program. All the quantities specified in it are continuously monitored. In
contrast, the $display command displays the quantities concerned only once – that is, when the specific
line is encountered during execution. The format for the $monitor task is identical to that of the $display
task.

Examples

$monitor (“The value of a is : a = , %d”, a);

With the task, whenever the value of a changes during execution of a program, its new value is printed
according to the format specified. Thus if the value of a changes to 2.4 at any time during execution of the
program, we get the following display on the monitor.

The value of a is: a = 2.4.

Tasks for Control of Simulation:

Two system tasks are available for control of simulation:

$finish task, when encountered, exits simulation. Control is reverted to the Operating System. Normally
the simulation time and location are also printed out by default as part of the exit operation.

$stop task, suspends simulation; if necessary the simulation can be resumed by user intervention. Thus
with the stop task, the simulator is in an interactive mode. In contrast with $finish, simulation has to be
started afresh.
SOFTWARE TOOLS

5.1 Xilinx
Xilinx software is used by the VHDL/VERILOG designers for performing Synthesis
operation. Any simulated code can be synthesized and configured on FPGA. Synthesis is the
transformation of VHDL code into gate level net list. It is an integral part of current design
flows.

Algorithm

Start the ISE Software by clicking the XILINX ISE icon.


Create a New Project and find the following properties displayed.

Create a VHDL Source formatting all inputs, outputs and buffers if required. which provides a
window to write the VHDL code, to be synthesized.
Check Syntax after finally editing the VHDL source for any errors. After that perform the RTL
and TECHNOLOGY schematic for verifying synthesis.

Software Used

5.2 ModelSim
It is software used to simulate and verify the functionality of a VHDL/VERILOG code. In our
project, we use this software for the same purpose. For each module, the input values are given
and the results are observed accordingly.

Following steps have to be followed to simulate a design in Modelsim:


1. Invoke modelsim by double clicking the icon. Then a window appears containing menu
for various commands, work space and library space. Create a directory where the
simulation files have to be saved.
2. Create a new file/Add existing file in “Add items to the Project” window and “Create a new
file” window.
3. Then, on the work space of the main window, you find your file in which the code can be
written when double-clicked.

4. After the code is written, it is saved and then compiled for syntax errors. If no syntax errors, a
“Green Tick” mark will be seen in the work space, else a “Red Cross”. A red message indicates
that there is an error in our code. To correct any errors, just double-click on the Error message
and the error is highlighted in the source window.
5. Simulate the compiled code, by clicking (+) sign of work in Library tab. Clicking that file will
directly open a Signals window in which all the signals(internal/external) used in the module are
ready for simulation.

6. The appropriate signals are selected and added to the wave window by clicking “Add to wave”
in the tool bar. Hence the obtained signals are assigned with required values to provide desired
outputs.
7. There are different options available in the waveform window to view the output values in
various representations such as Binary, Hexadecimal, Symbolic, Octal, ASCII, Decimal and
Unsigned representations.
8. The Modelsim can be exited by selecting ‘File -> quit’ from menu.

9. In this way the Modelsim software is used for functional verification or Simulation of the
user’s code.
RESULTS
CONCLUSION

In this paper, we have proposed a recursive algorithm to obtain orthogonal approximation of


DCT where approximate DCT of length could be derived from a pair of DCTs of length

at the cost of additions for input preprocessing. The proposed approximated DCT
has several advantages, such as of regularity, structural simplicity, lower-computational
complexity, and scalability. Comparison with recently proposed competing methods shows the
effectiveness of the proposed approximation in terms of error energy, hardware resources
consumption, and compressed image quality. We have also proposed a fully scalable
reconfigurable architecture for approximate DCT computation where the computation of 32-
point DCT could be configured for parallel computation of two 16-point DCTs or four 8-point
DCTs.

REFERENCES

[1] A. M. Shams, A. Chidanandan,W. Pan, and M. A. Bayoumi, “NEDA: A low-power high-


performance DCT architecture,” IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955–964, 2006.

[2] C. Loeffler, A. Lightenberg, and G. S. Moschytz, “Practical fast 1-DDCT algorithm with 11
multiplications,” in Proc. Int. Conf. Acoust.,Speech, Signal Process. (ICASSP), May 1989, pp.
988–991.

[3] M. Jridi, P. K. Meher, and A. Alfalou, “Zero-quantised discrete cosine transform coefficients
prediction technique for intra-frame video encoding,” IET Image Process., vol. 7, no. 2, pp. 165–
173, Mar. 2013.

[4] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, “Binary discrete cosine and Hartley
transforms,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 4, pp. 989–1002, Apr. 2013.
[5] F. M. Bayer and R. J. Cintra, “DCT-like transform for image compression requires 14
additions only,” Electron. Lett., vol. 48, no. 15, pp. 919–921, Jul. 2012. [6] R. J. Cintra and F. M.
Bayer, “A DCT approximation for image compression,” IEEE Signal Process. Lett., vol. 18, no.
10, pp. 579–582, Oct. 2011.

[7] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, “Low-complexity 8x 8 transform for image


compression,” Electron. Lett., vol. 44, no. 21, pp. 1249–1250, Oct. 2008.

[8] T. I. Haweel, “A new square wave transform based on the DCT,” Signal Process., vol. 81,
no. 11, pp. 2309–2319, Nov. 2001.

[9] V. Britanak, P.Y.Yip, and K. R. Rao, Discrete Cosine and Sine Transforms: General
Properties, Fast Algorithms and Integer Approximations. London, U.K.: Academic, 2007.

[10] G. J. Sullivan, J.-R. Ohm,W.-J.Han, and T.Wiegand, “Overview of the high efficiency video
coding (HEVC) standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–
1668, Dec. 2012.

[11] F. Bossen, B. Bross, K. Suhring, and D. Flynn, “HEVC complexity and implementation
analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1685–1696, 2012.

[12] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang, “Incremental learning of 3D-
DCT compact representations for robust visual tracking,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 35, no. 4, pp. 863–881, Apr. 2013.

[13] A. Alfalou, C. Brosseau, N. Abdallah, andM. Jridi, “Assessing the performance of a method
of simultaneous compression and encryption of multiple images and its resistance against various
attacks,” Opt. Express, vol. 21, no. 7, pp. 8025–8043, 2013.

[14] R. J. Cintra, “An integer approximation method for discrete sinusoidal transforms,”
Circuits, Syst., Signal Process., vol. 30, no. 6, pp. 1481–1501, 2011. [15] F. M. Bayer, R. J.
Cintra, A. Edirisuriya, and A. Madanayake, “A digital hardware fast algorithm and FPGA-based
prototype for a novel 16-point approximate DCT for image compression applications,” Meas.
Sci. Technol., vol. 23, no. 11, pp. 1–10, 2012.

[16] R. J. Cintra, F. M. Bayer, and C. J. Tablada, “Low-complexity 8-point DCT approximations


based on integer functions,” Signal Process., vol. 99, pp. 201–214, 2014. [17] U. S. Potluri,
A.Madanayake, R. J. Cintra, F.M. Bayer, S. Kulasekera, andA. Edirisuriya, “Improved 8-point
approximate DCT for image and video compression requiring only 14 additions,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 61, no. 6, pp. 1727–1740, Jun. 2014.

[18] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, “A novel transform for image


compression,” in Proc. 2010 53rd IEEE Int. Midwest Symp. Circuits Syst. (MWSCAS), pp. 509–
512.

[19] K. R. Rao and N. Ahmed, “Orthogonal transforms for digital signal processing,” in Proc.
IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 1976, vol. 1, pp. 136–140.

[20] Z. Mohd-Yusof, I. Suleiman, and Z. Aspar, “Implementation of two dimensional forward


DCT and inverse DCT using FPGA,” in Proc. TENCON 2000, vol. 3, pp. 242–245.

[21] “USC-SIPI image database,” Univ. Southern California, Signal and Image Processing
Institute [Online]. Available: http://sipi.usc.edu/ database/, 2012

[22] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From
error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612,
2004.

Você também pode gostar