Você está na página 1de 5

Efficient Resource Utilization of FPGAs

Kashif Latif Arshad Aziz Athar Mahboob


National University of National University of National University of
Science & Technology Science & Technology Science & Technology
Habib Rahmatullah Road Habib Rahmatullah Road Habib Rahmatullah Road
Karachi, Pakistan Karachi, Pakistan Karachi, Pakistan
kashif@pnec.edu.pk arshad@nust.edu.pk athar@pnec.edu.pk

ABSTRACT this device it is becoming more crucial that how to effec-


With growing use of FPGAs it is becoming more and more tively and efficiently utilize the internal resources of it. Nor-
crucial that how to effectively and efficiently utilize the in- mal coding techniques and synthesis tools implement every
ternal resources of these devices. Normal coding techniques logic to a LUT (Look Up Table) based architecture. Which
and synthesis tools implement every logic to a LUT based utilizes more area on the chip and remain some fast and ded-
architecture. Which utilizes more area on the chip and some icated area of the chip unutilized. Which in turn results in
fast and dedicated area of the chip remain unutilized. Which slow clock rates and bigger critical path lengths, hence re-
in turn results in slow clock rates and bigger critical path main the design inefficient in terms of both speed and area.
lengths, hence the design remain inefficient in terms of both Normally, utilized chip area of FPGA is calculated in terms
speed and area. In this paper we will present and discuss of CLBs (Configurable Logic Blocks) count. A modern
some techniques to effectively utilize the FPGA resources FPGA’s CLB not only contain the LUTs but there is other
in order to speed up the clock rates and reduce the area dedicated hardware also included within a CLB. For exam-
utilization. ple Xilinx’s modern FPGAs contain dedicated carry logic
gates MUXCY and ORCY, and other dedicated functional
gates like MUXFXs and MULT AND. Conventional tech-
Categories and Subject Descriptors niques map all of the logic to a LUT based architecture and
B.5.2 [Register-Transfer-Level Implementation]: De- dedicated area of a CLB remain unutilized. In this scenario
sign Aids—Optimization; B.6.3 [Logic Design]: Design counting chip area in terms of CLBs not presents the actual
Aids—Hardware Description Languages; B.7.2 [Integrated area utilization of the chip. Because the hardware within a
Circuits]: Design Aids—Placement and routing CLB is not fully utilized.
In this paper we will present and discuss some techniques
to effectively utilize the FPGA resources in order to speed up
General Terms the clock rates and reduce the area utilization. The remain-
Performance, Design der of this paper is organized as follows. We briefly review
the internal architecture of a modern Xilinx’s FPGA in sec-
tion 2. Section 3 describes the conventional coding approach
Keywords with the help of an implementation example. In section 4
FPGA, Efficient Implementation, Resource Utilization, Syn- we present the optimized techniques for the implementation
thesis, Technology Mapping of Section 3 and compare and concludes the results of both
approaches. In Section 5 we discuss some optimized design
techniques for wide input boolean operations. Finally, we
1. INTRODUCTION provide some conclusions.
FPGA (Field Programmable Gate Array) Technology is
continuously gaining momentum and becoming the essential 2. ARCHITECTURE REVIEW OF XILINX
part of today’s modern embedded systems. Since its inven-
tion by Xilinx in 1984, FPGAs have gone from being simple FPGAS
glue logic chips to actually replacing custom Application- Before going to the actual topic we first take a quick re-
Specific Integrated Circuits (ASICs) and processors for signal view of internal architecture of some modern Xilinx’s FP-
processing and control applications [2]. With growing use of GAs. Figure 1 illustrates the internal architecture of a Xil-
inx FPGA. A Xilinx’s FPGA internal architecture consists
of basic building blocks called CLBs (Configurable Logic
Blocks). Each CLB contains 4 Slices, which in turn contain
Permission to make digital or hard copies of all or part of this work for 2 LUTs each.
personal or classroom use is granted without fee provided that copies are Figure 2 and 3 illustrates the simplified view of Xilinx
not made or distributed for profit or commercial advantage and that copies CLB and Slices respectively.
bear this notice and the full citation on the first page. To copy otherwise, to The new generation FPGA architecture includes dedi-
republish, to post on servers or to redistribute to lists, requires prior specific cated two-input multiplexers for combining LUTs, allowing
permission and/or a fee.
FIT’09 , December 16-18, 2009, CIIT, Abbottabad, Pakistan. devices to support up to eight or even higher inputs. These
Copyright 2009 ACM 978-1-60558-642-7/09/12 ...$10.00. specialized multiplexers improve the performance, density,
and size of wide logic that can be implemented in each CLB.
In addition, the slices also contain a dedicated two-input OR
IOB IOB IOB IOB
gate (ORCY) and a two-input AND gate (MULT AND) to
perform operations involving wide input AND and OR gates.
IOB

IOB
CLB CLB CLB CLB These combine the four-input LUT outputs. These gates can
Input/Output  Switch
be cascaded in a chain to provide wide AND functionality
SM SM SM
Block Matrix across slices. The output from the cascaded AND gates can
IOB

IOB
CLB CLB CLB CLB then be combined with the dedicated ORCY to produce a
Wire Sum of Products (SOP) function.
SM SM SM Segments
IOB

IOB
CLB CLB CLB CLB
Configurable
3. THE CONVENTIONAL APPROACH
SM SM SM
Logic Block Straightforward design approach is to code the design logic
in a HDL (Hardware Description Language), and then let
IOB

IOB
CLB CLB CLB CLB
IOB IOB IOB IOB
the synthesis tool to do the job. The drawback of this ap-
    proach is that synthesis tools are not intelligent enough and
map all of the logic to a LUT based architecture, which re-
sults in consumption of bigger chip area and longer path de-
Figure 1: Generic FPGA Architecture
lays. Hence, design becomes bigger and run at slower clock
rates. We explain this approach with the help of an exam-
ple. Lets consider we have to design a 8 input AND gate.
We can simply code it using a HDL instruction, following is
an example of Verilog.

assign out = a[0] & a[1] & a[2] & a[3] & a[4] &
Configurable Logic Block (CLB)
a[5] & a[6] & a[7];
Slice 0 Slice 2

Logic  Logic  Where a is 8-bit input and out is output of AND gate.
Cell (LC) Cell (LC)
IOB IOB IOB IOB
This instruction AND the 8 input bits of variable a and out-
IOB

IOB

CLB CLB CLB CLB


Logic  Logic 
SM SM SM
Cell (LC) Cell (LC)
put goes to out. The synthesis tool will map this instruction
IOB

IOB

CLB CLB CLB CLB

SM SM SM to 8-bit AND function using 3 4-input LUTs. First two LUT


IOB

IOB

CLB

SM
CLB

SM
CLB

SM
CLB
Slice 1 Slice 3 perform the AND operation on two 4 bit groups of input and
then resulting two bits will be ANDed using third 4-input
IOB

IOB

CLB CLB CLB CLB


IOB IOB IOB IOB Logic  Logic 
Cell (LC) Cell (LC) LUT. Figue 4 illustrates the resulting hardware.
Logic  Logic 
Cell (LC) Cell (LC)
   

a [0 ]
a [1 ]
a [2 ] LUT
Figure 2: Xilinx’s CLB a [3 ]

out
LUT
a [4 ]
a [5 ]
a [6 ] LUT
a [7 ]

COUT
YB
Y

G4 S
G3
I4  Look­Up
I3   Table
Carry
and
D Q YQ Figure 4: 8 input AND Function - LUT Based Ar-
O CK
G2
G1
I2  (LUT) Control
Logic EC chitecture
I1 R

F5IN
BY
In Xilinx Spartan-3 FPGA a LUT4 has a gate-delay of
SR
CLK 0.479ns and net-delay of 0.976ns, the overall critical path
CE
delay of this circuit is 9.215ns.
XB
X
  
G4 I4  Look­Up Carry D S Q XQ
G3
G2
I3   Table
I2   (LUT)
O
and
Control
CK 4. THE OPTIMIZED APPROACH
Logic EC
G1 I1 R
In previous example there are two stages of LUTs, there-
fore there will be involvement of two stage delay in critical
 
BX  
CIN path length of the output. We can avoid the second LUT
stage using some dedicated hardware within a Slice. By uti-
Figure 3: Simplified Slice Structure lizing a dedicated AND gate (MULT AND) or a dedicated
multiplexer (MUXCY) we can achieve the same function-
ality with lesser path delay. Following is an example code
using a MULT AND gate in place of third LUT.
assign temp1 = a[0] & a[1] & a[2] & a[3]; a [0 ]
assign temp2 = a[4] & a[5] & a[6] & a[7]; a [1 ]

MULT_AND MULT_AND_inst ( a [2 ] LUT


a [3 ]
.LO(out); DI

MUXCY
.I0(temp1); out
CI
.I1(temp2);
a [4 ]
); a [5 ]
S
a [6 ] LUT
Figure 5 describes the resulting hardware. The MULT AND a [7 ]

gate in Spartan-3 FPGA has a gate-delay of only 0.001ns and


has no net-delay. Now the overall critical path delay of the
circuit will be LUT4 delay plus MULT AND delay which Figure 6: 8 input AND Function - Using MUXCY
equals to 2.171ns. Which is much lesser then previous cir-
cuit and there is benefit of saving one LUT4, which is very
important in terms of area saving. This circuit is advan- Table 1: Timing Results For Different Xilinx De-
tageous when the output of MULT AND used within the vices
chip, however a drawback that the output of MULT AND Device Conventional Optimized Percent
can not directly be connected to IO buffers of the chip. To Approach Approach Improvement
route its output to the IO buffer, carry chain logic multi- Spartan-3 2.615ns 0.480ns 444%
plexer i.e MUXCY must be used which put another extra Virtex-2 1.685ns 0.417ns 304%
delay of 1.664ns in the critical path plus Output buffer de- Virtex-2Pro 1.349ns 0.277ns 387%
lay of 4.909ns. The final critical path length will be 8.744ns. Virtex-5 0.743ns 0.160ns 364%
Which is still lesser than LUT only architecture.

a [0 ]
5. SOME TECHNIQUES FOR WIDE INPUT
a [1 ]
a [2 ] LUT
GATES
a [3 ]

MULT_AND out
5.1 Wide input AND Operation
MUXCY gates can combine the 4-input LUTs outputs
a [4 ]
across the slices and can cascade them into a chain to pro-
a [5 ]
a [6 ] LUT vide a wide AND functionality [6]. Figure 7 describes the
a [7 ] 16 input AND gate implementation. The technique uti-
lizes the 4-input LUT to provide the SELECT signal for
the MUXCY. The SELECT signal is simple AND operation
of 4 inputs. The VCC at the bottom reach the output only
Figure 5: 8 input AND Function - Using
when all of the input signals are at logic high. This use of
MULT AND
carry logic helps to perform AND functions at high speed
MUXCY can directly be used for the same circuit func- and saves hardware resources.
tionality. Following is an example code using MUXCY in
AND_OUT
place of MULT AND.
LUT MUXCY

assign temp1 = a[0] & a[1] & a[2] & a[3];


assign temp2 = a[4] & a[5] & a[6] & a[7]; LUT MUXCY
MUXCY MUXCY_inst (
.o(out); Slic e 1
.CI(1’b0);
.DI(temp2);
.S(!temp1) LUT MUXCY

);
LUT Output:
LUT MUXCY
out = i1 & i2 &  i3 & i4
Figure 6 describes the resulting hardware. The MUXCY in v cc
Spartan-3 FPGA has a gate-delay of 0.983ns and net-delay Slic e 0
of 0.681ns, the overall critical path delay of the circuit is
8.743ns.
Figure 7: 16-bit AND Gate Implementation
Table 1 compares the timing results of 8-input AND gate
implementation with conventional and optimized approach
using MULT AND gate. Results are shown for commonly
used Xilinx FPGAs. For simplicity and to understand the 5.2 Sum of Product Function (SOP)
timing effects of two different approaches more clearly, in- The output of cascaded AND gates (Figure 7) can be com-
put and output buffer delays have been omitted. The last bined with the dedicated ORCY gate to produce a Sum of
column of table shows the percent improvement in terms of Product (SOP) function [6]. Several numbers of slices can
critical path delay for each device. be used to provide the Sum Of Product depending upon the
width of desired data. Figure 8 describes the SOP of 64 bit
wide inputs using 4 cascaded 16-bit AND operations.

F8
MUXF8 c o m bine s th e
Slic e S3 G tw o MUXF7 o u tp u ts
(Tw o CLBs )

F5
F
ORCY ORCY
i[0 ] i[0 ]
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY

F6
i[0 ] i[0 ]
i[1 ] i[1 ] Slic e S2 MUXF6 c o m bine s th e
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY G tw o MUXF5 o u tp u ts
fro m Slic e s S2 an d S3

F5
F
Slic e 1 Slic e 3

i[0 ] i[0 ]
i[1 ] i[1 ]

F7
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY
MUXF7 c o m bine s th e
G tw o MUXF6 o u tp u ts
Slic e S1 fro m Slic e s S0 an d S2

F5
i[0 ] i[0 ]
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY
F
vcc vcc

Slic e 0 Slic e 2

CLB

F6
MUXF6 c o m bine s th e
G tw o MUXF5 o u tp u ts
fro m Slic e s S0 an d S1

F5
SOP

OUT F
ORCY ORCY Slic e S0
i[0 ] i[0 ]
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY

i[0 ] i[0 ]
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY Figure 9: MUXF5 and MUXFX Multiplexers [9]

Slic e 1 Slic e 3

i[0 ] i[0 ]
Select
Output Output
i[1 ] i[1 ]
i[2 ]
LUT MUXCY i[2 ]
LUT MUXCY
LUT LUT
i[3 ] i[3 ]

Input Inputs
i[0 ] i[0 ]
s
Enable
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY

vcc vcc

Slic e 0 Slic e 2

CLB
Figure 10: 4-input LUT as a 2:1 MUX

Figure 8: Sum Of Product (SOP) Function using


Cascaded AND Gates

5.3 Wide input MUX Operation


In addition to MUXCY and ORCY gates modern Xil-
inx FPGAs also contain MUXFX multiplexers dedicated for
the design of wide input multiplexers [9]. Virtex-II archi-
tecture contain two dedicated MUXs per Slice MUXF5 and
MUXFX. The MUXFX multiplexer implements the MUXF6,
MUXF7, or MUXF8, as shown in Figure 9. Each CLB el-
ement has two MUXF6 multiplexers, one MUXF7 multi-
plexer and one MUXF8 multiplexer.
Using these MUXs each slice can implement a 4:1 multi-
plexer, each CLB can implement a 16:1 multiplexer and two
CLBs can implement a 32:1 multiplexer. However, a 4 input
LUT can support maximum of 2:1 MUX as shown in Figure
10.
Figure 11 shows how 8:1 and 16:1 multiplexers can be
implemented using these dedicated MUXFXs.
Table 2 summarize the hardware required to implement a
particular multiplexer.
Figure 11: Multiplexer Implementation using
5.4 General Wide Gate Input Functions MUXFX [6]
MUXFX logic can also be used to implement other wider
techniques not utilized the internal hardware resources of a
Table 2: Hardware Requirement for Different Mul- CLB optimally. Therefore, counting area in terms of CLBs
tiplexers not presents the actual utilized area of the chip. Using opti-
Hardware Resources MUX mized techniques CLB’s hardware resources may be utilized
2 LUTs + MUXF5 4:1 optimally and more logic may added to a single CLB. Hence,
2 Slices + MUXF6 8:1 reduces the overall CLB count and presents the actual uti-
4 Slices + MUXF7 16:1 lized chip area while counting in terms of CLBs.
2 CLBs + MUXF8 32:1
7. REFERENCES
input functions. These dedicated MUXs are named so that [1] The Field Programmable Gate Array (FPGA):
they itself describe their functionality. MUXF6 can imple- Expanding Its Boundries. Instant Market Research,
ment any function of 6 inputs, likewise MUXF7 can imple- April 2006.
ment 7-input function and MUXF8 can implement 8-input [2] BDTI. FPGAs for DSP. BDTI Focus Report, BDTI
function. Using MUXFX logic we can implement a custom Benchmarking, 2006.
Boolean function of upto 39 inputs within a single CLB or [3] M. D. Ciletti. Advanced Digital Design with the
79 inputs wide function into two CLBs. Figure 12 shows an Verilog HDL. PEARSON, Prentice Hall, 2007.
example of 39 inputs wide custom Boolean function imple- [4] N. Instruments. FPGAs - Under the Hood.
mentation within a single CLB. http://zone.ni.com/devzone/cda/tut/p/id/6983, April
2008.
[5] N. Instruments. Introduction to FPGA Technology:
Top Five Benefits.
http://zone.ni.com/devzone/cda/tut/p/id/6984, June
2008.
[6] R. Krueger and B. Przybus. Virtex Variable-Input
LUT Architecture. Xilinx White Paper: Virtex and
Virtex-II Series FPGAs, January 2004.
[7] M. Thompson. FPGAs Accelerate Time to Market for
Industrial Designs. EE Times, July 2004.
[8] Xilinx. Spartan-3 FPGA Family: Complete Datasheet.
http://www.xilinx.com/, April 2008.
[9] Xilinx. Virtex-II Platefrom FPGAs: Complete
Datasheet. http://www.xilinx.com/, November 2007.
[10] Xilinx. Virtex-II Platefrom FPGAs: User Guide.
http://www.xilinx.com/, November 2007.

Figure 12: 39-input wide Custom Boolean Function


in a CLB [6]

6. CONCLUSIONS
In this paper we have presented some useful techniques
to effectively and efficiently utilize the FPGA hardware re-
sources. By considering the discussed techniques not only
the utilized area of FPGA can be minimized but the critical
path lengths of designs can also be reduced. Consequently,
the designs can run at higher clock rates and more logic
may be added to the chip. Xilinx FPGAs dedicated hard-
ware resources are discussed to minimize the reliance of de-
signs on LUT based architectures, which will be helpful in
reducing area consumption and more timing efficient archi-
tectures. Normally, utilized chip area of FPGA is calculated
in terms of CLBs count. However, conventional mapping