Você está na página 1de 6

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory

Architectures

Steven J.E. Wilton


Department of Electrical and Computer Engineering
University of British Columbia
Vancouver, BC, Canada, V6T 1Z4
stevew@ece.ubc.ca 

Abstract can be combined to implement larger user memories [1].


FPGAs employing the coarse-grained approach, on the
It has become clear that large embedded configurable other hand, contain large embedded arrays which are used
memory arrays will be essential in future FPGAs. Em- to implement the storage parts of circuits. Examples of
bedded arrays provide high-density high-speed implemen- such devices are the Altera 10K, Apex, and Stratix de-
tations of the storage parts of circuits. Unfortunately, they vices [2, 3, 4], the Xilinx Virtex and Virtex II FPGAs [5],
require the FPGA vendor to partition the device into mem- the Actel 3200DX and SPGA parts [6, 7], and the Lattice
ory and logic resources at manufacture-time. This leads to ispLSI FPGAs [8].
a waste of chip area for customers that do not use all of the The coarse-grained approach results in significantly
storage provided. This chip area need not be wasted, and denser memory implementations, since the per-bit over-
can in fact be used very efficiently, if the arrays are config- head is much smaller [9]. Unfortunately, it also requires
ured as large multi-output ROMs, and used to implement the FPGA vendor to partition the chip into memory and
logic. logic regions when the FPGA is designed. Since circuits
In this paper, we investigate how the architecture of the have widely-varying memory requirements, this average-
FPGA embedded arrays affects their ability to implement case partitioning may result in poor device utilizations for
logic. Specifically, we focus on architectures which contain logic-intensive or memory-intensive circuits. In particular,
more than one size of memory array. We show that these if a circuit does not use all the available memory arrays
heterogeneous architectures result in significantly denser to implement storage, the chip area devoted to the unused
implementations of logic than architectures with only one arrays is wasted.
size of memory array. We also show that the best heteroge- This chip area need not be wasted, however, if the un-
neous architecture contains both 2048 bit arrays and 128 used memory arrays are used to implement logic. Con-
bit arrays. figuring the arrays as ROMs results in large multi-output
lookup-tables that can very efficiently implement some
logic circuits. In [10], a new tool, SMAP, was presented
1 Introduction that packs as much circuit information as possible into the
available memory arrays, and maps the rest of the circuit
On-chip storage has become an essential component of into four-input lookup-tables. It was shown that this tech-
high-density FPGAs. The large systems that will be im- nique results in extremely dense logic implementations for
plemented on these FPGAs often require storage; imple- many circuits; not only is the chip area of the unused arrays
menting this storage on-chip results in faster clock frequen- not wasted, but it is used more efficiently than if the arrays
cies and lower system costs. Two implementations of on- were replaced by logic blocks. Thus, even customers that
chip memory in FPGAs have emerged: fine-grained and do not require storage can benefit from embedded memory
coarse-grained. In FPGAs employing fine-grained on-chip arrays.
storage, such as the Xilinx 4000 FPGAs, each lookup ta- The effectiveness of this mapping technique, however,
ble can be configured as a small RAM, and these RAMs is very dependent on the architecture of the embedded
 This work was supported by the Natural Sciences and Engineering memory arrays. If the arrays are too small, the amount
Research Council of Canada, and UBCs Centre for Integrated Computer of logic that can be packed into each will be small, while
Systems Research. if the arrays are too large, much of each array will be
unused. Previous studies have focused on the architec-
L
ture of these memory resources when implementing stor-
N
age [11, 12, 13]. Since they are so effective at implement- M
P H J K
N
P
ing logic, however, it is important that the design of the D
G Q M Q

embedded memory arrays also consider this. F


C B
In [14], the the effects of array depth, width, and flexi- E
A C A F
bility of memory arrays when they are used to implement E

logic were explored. That paper, however, only considered a) Original Circuit b) Final Implementation
homogeneous memory architectures, ie. architectures in Figure 1: Example Mapping to a 8-Input, 3-Output Mem-
which each memory array is identical. In this paper, we ory Block
show that significant density improvements are possible if
the FPGA contains a heterogeneous memory architecture,
that is, an architecture with more than one size of memory benchmark circuits to each architecture. Each circuit con-
array. tained between 527 and 6598 4-LUTs. Fifteen of the cir-
The goals of this paper are as follows: cuits were sequential. The combinational circuits and 9
1. The first goal is to quantify the density improvements of the sequential circuits were obtained from the Micro-
that are possible with a heterogeneous memory archi- electronics Corporation of North Carolina (MCNC) bench-
tecture (compared to a homogeneous memory archi- mark suite, while the remaining sequential circuits were
tecture) when used to implement logic. obtained from the University of Toronto and were the re-
sult of synthesis from VHDL and Verilog. All circuits were
2. There are many possible heterogeneous memory ar- optimized using SIS [15] and mapped to four-input lookup-
chitectures (different array sizes, numbers, etc.). The tables using Flowmap and Flowpack [16]. The SMAP al-
second goal of this paper is to find the heterogeneous gorithm was then used to pack as much circuit information
memory architecture that can most efficiently imple- as possible into the available memory arrays. The number
ment logic. of nodes that can be packed to the available arrays is used
The architectural space explored in this paper is de- as a metric to compare memory array architectures.
scribed in Section 2. Section 3 describes the experimental The results in this paper depend heavily on the SMAP
methodology and reviews the SMAP algorithm. Finally, algorithm, which was originally developed for architec-
Section 4 presents experimental results. tures in which all arrays are the same size. The follow-
ing subsection reviews SMAP, while the subsequent sub-
section shows how SMAP can be used to map logic to a
2 Embedded Array Architectures heterogeneous memory architecture.

Table 1 summarizes the parameters that define the 3.1 Review of SMAP
FPGA embedded memory array architecture, along with
values of these parameters for several commercial devices. This section briefly reviews SMAP; for more details,
In this paper we are considering architectures with two dif- see [10].
ferent array sizes; we denote the number of bits in each The SMAP algorithm is based on Flowpack, a post-
type of array as and . The number of each type of processing step of Flowmap [16]. Given a seed node, the
arrays is denoted  and  . We assume that all arrays algorithm finds the maximum-volume k-feasible cut, where
have the same set of allowable data widths, and denote that  is the number of address inputs to each memory array. A
set by eff . For a fixed size, a wider memory implies fewer -feasible cut is a set of no more than  nodes in the fanin-
memory words in each array. In the Altera FLEX10K for network of the seed such that the the seed can be expressed
example,  bits, and eff    , meaning entirely as a function of the  nodes; the maximum-volume
each array can be configured to be one of 2048x1, 1024x2, -feasible cut is the cut which contains the most nodes be-
512x4, or 256x8. tween the cut and the seed. The nodes that make up the
cut become the memory array inputs. Figure 1(a) shows an
example circuit along with the the maximum 8-feasible cut
3 Methodology for seed node A.
Given a seed node and a cut, SMAP then selects which
To compare memory array architectures, we employed nodes will become the memory array outputs. Any node
an experimental methodology in which we varied the var- that can be expressed as a function of the cut nodes is a po-
ious architectural parameters, and mapped a set of 28 tential memory array output. The selection of the outputs
Parameter Meaning Commercial Devices Range in
Altera 10K Vantis VF1 Lattice isp6192 this paper
Number of Type-1 Arrays 3-16 28-48 1 1-9
Number of Type-2 Arrays - - - 1-9
 Bits per Type-1 Array 2048 128 4608 128-8192
 Bits per Type-2 Array - - - 128-8192
eff Allowable Data Widths 1,2,4,8 4 9,18 1,2,4,8

Table 1: Architectural Parameters

is an optimization problem, since different combination of 4 Results


outputs will lead to different numbers of nodes that can be
packed into the arrays. In [10], a heuristic was presented; 4.1 Homogeneous Architecture Results
the outputs with the largest number of nodes in their maxi-
mum fanout-free cone (maximum cone rooted at the poten- We first consider architectures in which all arrays are
tial output such that no node in the cone drives a node not of the same size (this is the homogeneous case considered
in the cone) are selected. As shown in [10], those nodes in [14]). Figure 2 shows how the effectiveness of each
in the maximum fanout-free cones of the outputs can be memory array in implementing logic depends on the array
packed into the array. All other nodes in the network must size, assuming 8 arrays are available. Figure 2(a) shows the
be implemented using logic blocks. In Figure 1(a), nodes number of logic blocks that can be packed into the arrays
C, A, and F are the selected outputs; Figure 1(b) shows the (averaged over our 28 benchmark circuits) vs. array size.
resulting circuit implementation. Figure 2(b) shows the estimated chip area of the 8 memory
Since the selection of the seed node is so important, we arrays, also as a function of array size. The area estimates
repeat the algorithm for each seed node, and choose the were obtained from a detailed area model [17] and are ex-
best results. pressed in logic block equivalents (LBE). One LBE is the
area required to implement one logic block.
If there is more than one array available, we map to the Figure 2(c) shows the packing density as a function of
first array as described above. Then, we remove the nodes array size. The packing density is defined as the ratio of
implemented by that array, and repeat the entire algorithm the number of logic blocks that can be packed into the
for the second array. This is repeated for each available available memory arrays over the area required to imple-
array. ment the memory arrays (in LBEs). A packing density of
1 means that the density of logic implemented in memory
arrays is equal to that if the logic was implemented in logic
blocks. A packing density greater than 1 means that the
3.2 Extension to Heterogeneous Memory Archi- density of logic implemented in memory arrays is greater
tectures than that if logic blocks were used. As Figure 2(c) shows,
the packing density is greater than 1 for all but the largest
memory array. The highest packing density occurs when
The SMAP algorithm was developed assuming a homo- the arrays each contain 512 bits. See [14] for a more thor-
geneous memory architecture; that is, one in which each ough coverage of homogeneous architectures.
memory array is identical. Since the arrays are packed one
at a time, the above algorithm can be applied directly to 4.2 Heterogeneous Architecture Results
architectures with different sized memory arrays. The only
issue is whether the large or small arrays should be filled In this section, we consider architectures which contain
first. Experimentally, we have determined that the best two different sizes of memory arrays. Using the terminol-
results are obtained if we fill all of the large arrays first. ogy of Section 2, each FPGA will have  arrays of bits
The SMAP algorithm is greedy, in that, for each array, the each and  arrays of bits each. We restrict our atten-
largest portion of logic that can be mapped to the array is tion to architectures with three different ratios of    :
selected. Thus, the largest gains are likely to be obtained 1:1, 1:2, and 1:3.
from the first few arrays that are filled; therefore it makes Figure 3 shows the packing density for several sizes of
sense that these first few arrays are the large ones. and , assuming the ratio    (that is, there
3
350 350

Area (equiv. logic blocks)


300 300
Packed Logic Blocks

Packing Ratio
250 250
200 200 2
150 150
100 100
50 50
1
0 0
128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192
Bits per Array Bits per Array Bits per Array

a) Logic Blocks Packed b) Area c) Packing Ratio

Figure 2: Homogeneous Architecture Results, 8 arrays

3.5

Packing Density
3.0

2.5
Array 1 size (B1)
2.0
128 256 512 1024 2048 4096 8192
Array 2 size (B2)

1.5
128 2.04 2.17 2.67 2.79 3.42 2.43 1.55 8192
256 2.10 2.61 2.73 3.33 2.41 1.56 1.0 4096
512 2.77 2.86 3.27 2.40 1.57 8192 2048
4096 1024
1024 2.73 2.98 2.28 1.53 2048
2048 2.63 2.04 1.43 1024 512
512 256
4096 1.63 1.24 256
8192 0.99 128 128 B1
B2

a) Numerical Results b) Graphical Results

Figure 3: Heterogeneous Architectures, 4 arrays of each type

are four of each kind of array). As the results show, the was the case for all architectures which we investigated,
best packing density occurs when there are four arrays of except the    case as described above).
2048 bits each, and four arrays of 128 bits each (we did not It is interesting to note that although an FPGA with both
consider array sizes smaller than 128 bits, since such small 128 bit arrays and 2048 bit arrays was found to be best,
arrays would not be suitable for implementing the memory in some cases, (Figures 4(c) and (e)) the majority of the
parts of circuits, and thus, would not likely be considered arrays should contain 2048 bits, while in other cases, the
by an FPGA manufacturer). The packing density at this majority of the arrays should contain 128 bits (Figures 4(d)
point is 23% higher than the best packing density obtained and (f)). This can be observed in the graphs by noticing that
for homogeneous architectures. in Figures 4(c) and (e), the highest point is to the left of
We repeated the experiments for several values of  the center of the graph, while in Figure 4(d) and (f), the
and  ; selected graphical results are shown in Figure 4. highest point is to the right of the center of the graph.
In Figure 4(a), one of each type of array is assumed. In this We have investigated other architectures with a   
case, the best architecture is a homogeneous architecture ratio of    and   , and have confirmed that, as the
in which both arrays contain 2048 bits. This was the only total number of arrays increases, the preference for smaller
configuration for which a homogeneous architecture was arrays increases. Intuitively, if there are more arrays, the
found to be the best. SMAP tool is less able to effectively fill the larger arrays
Results for FPGAs with the ratio     with logic.
(that is, FPGAs for which there are twice as many type-2 A second conclusion that can be drawn from the results
arrays as type-1 arrays) are shown in Figure 4(c) and (d). in Figure 4 (and confirmed by other experiments we have
Results for FPGAs with the ratio       (three performed) is that as the total number of arrays increases,
times as many type-2 arrays as type-1 arrays) are shown in the advantage due to heterogeneous architectures (com-
Figure 4(e) and (f). In both cases, the best architecture was pared to homogeneous architectures) tends to increase. If
found to consist of 2048 bit arrays and 128 bit arrays (this there are only two arrays, a homogeneous architecture is
better, while if there are 12 arrays (Figures 4(d) and (f)), [9] T. Ngai, J. Rose, and S. J. E. Wilton, An SRAM-
the heterogeneous architecture is considerably better (22% Programmable field-configurable memory, in Proceedings
better in each case). of the IEEE 1995 Custom Integrated Circuits Conference,
pp. 499502, May 1995.
[10] S. J. E. Wilton, SMAP: heterogeneous technology map-
5 Conclusions ping for FPGAs with
embedded memory arrays, in ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, pp. 171
Although embedded arrays in FPGAs were developed 178, February 1998.
in order to implement on-chip storage, it is clear that these
arrays can also be configured as ROMs and used to imple- [11] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Architec-
ture of centralized field-configurable memory, in Proceed-
ment logic. In this paper, we have shown that significant
ings of the ACM/SIGDA International Symposium on Field-
density improvements are possible if the FPGA contains Programmable Gate Arrays, pp. 97103, 1995.
a heterogeneous memory architecture, that is, an architec-
ture with more than one size of memory array. The amount [12] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Mem-
ory/logic interconnect flexibility in FPGAs with large em-
of improvement depends on how many memory arrays are
bedded memory arrays, in Proceedings of the IEEE 1996
present; if there are eight arrays, we have shown that the Custom Integrated Circuits Conference, pp. 144147, May
best heterogeneous architecture can implement logic 23% 1996.
more efficiently than the best homogeneous architecture.
[13] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, Memory-
In virtually all cases, we have found that the best het-
to-memory connection structures in FPGAs with embedded
erogeneous architecture consists of some 2048 bit arrays, memory arrays, in ACM/SIGDA International Symposium
and some 128 bit arrays. The exact number of each size of on Field-Programmable Gate Arrays, pp. 1016, February
array depends on the total number of arrays available; the 1997.
more arrays that are present, the larger the proportion that
[14] S. J. E. Wilton, Implementing logic in FPGA embedded
should be 128 bits. memory arrays: Architectural implications, in IEEE Cus-
We have also shown that the benefits of heterogeneous tom Integrated Circuits Conference, May 1998.
architectures become more significant as the number of ar-
[15] E. Sentovich, SIS: A system for sequential circuit analy-
rays increase. This is a compelling argument for hetero-
sis, Tech. Rep. UCB/ERL M92/41, Electronics Research
geneous memory architectures. Future architectures are Laboratory, University of California, Berkeley, May 1992.
likely to contain more memory than they do now; FP-
[16] J. Cong and Y. Ding, FlowMap: an optimal technology
GAs with such large memory capacities would signifi-
mapping algorithm for delay optimization in lookup-table
cantly benefit if a heterogeneous architecture is used.
based FPGA designs, IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 13,
pp. 112, January 1994.
References [17] S. J. E. Wilton, Architectures and Algorithms for Field-
Programmable Gate Arrays with Embedded Memory. PhD
[1] Xilinx, Inc., Virtex 2.5 V Field Programmable Gate Arrays, thesis, University of Toronto, 1997.
ver. 1.6, July 1999.
[2] Altera Corporation, FLEX 10K Embedded Programmable
Logic Family Data Sheet, ver. 4.1, Mar 2001.
[3] Altera Corporation, APEX 20K Programmable Logic De-
vice Family Data Sheet, ver. 2.1, Feb 2002.
[4] Altera Corporation, Stratix Programmable Logic Device
Family Datasheet, 2002.
[5] Xilinx, Inc., XC4000E and XC4000X Series Field Pro-
grammable Gate Arrays, ver. 1.6, May 1999.
[6] Actel Corporation, Datasheet: 3200DX
Field-Programmable Gate Arrays, 1995.
[7] Actel Corporation, Actels Reprogrammable SPGAs, 1996.
[8] Lattice Semiconductor Corporation, Datasheet: ispLSI and
pLSI 6192 High Density Programmable Logic with Dedi-
cated Memory and Register/Counter Modules, July 1996.
4.0
Packing Density

2.5

Packing Density
3.5
2.0
3.0
1.5

1.0 8192
2.5
8192 4096
4096 0.5
2.0 2048
2048 8192 1024
8192 1024 4096
4096 2048 512
2048 512 1024
1024
512 256 512 256
256 128 256 B1
128 B1 128 128
B2 B2

a) ,  b) , 

4.0 3.0
Packing Density

3.5 Packing Density


2.5
3.0
2.5 2.0
2.0
8192 1.5 8192
1.5 4096 4096
1.0 1.0
2048 2048
8192 1024 8192 1024
4096 4096
2048 512 2048 512
1024 1024
512 256 512 256
256 B1 256 B1
128 128 128 128
B2 B2

c) ,  d) , 

4.0
2.5
Packing Density

Packing Density

3.5

3.0 2.0

2.5
1.5 8192
2.0 8192
4096 4096
1.5 1.0
2048 2048
8192 8192 1024
4096 1024 4096
2048 512 2048 512
1024 1024
512 256 512 256
256 B1 256 B1
128 128 128 128
B2 B2

e) ,  f) , 
Figure 4: Other Selected Heterogeneous Architecture Results

Você também pode gostar