Escolar Documentos
Profissional Documentos
Cultura Documentos
1 Introduction
chronously,
multiple custom VLSI chips, communicating synchronously, with common clock, and
250
levels are
oso
rocessgP'
mult=p
SOS
Synthesis
I o| Systems [
|
.....
I mixed
~.~,,,,custom
clJ .....VL~I
f styles ~
......... clock
for Systems
I Tas,k .Style I
I ::~elecuon I
partition evaluation I
used by
syste.m-fevel
teem
,
I
/
^
. .ff
.
~ascaoe uesign Automation ]
SMASH
uatapam memory
synthesis
Module allocation for dal
ath memory.
are coordinated in the USC tool set. SMASH [9] (Figure 2), the tool set for memory-intensive design, is used
for systems designed for specific applications, where the
memory-access pattern is not only relatively fixed but
also known before hand. This mostly-deterministic access characteristic helps us in being more specific, hence
more efficient in our designs.
The original MIMOLA system was the first system to
make tradeoffs in the use of multiport memories [14].
Lippens et al. [13] describe techniques to perform automatic memory allocation and address allocation for
high speed applications. They synthesize memory after the design of arithmetic units, datapath scheduling and allocation. IMEC's CATHEDRAL-II compiles
multi-dimensional data structures into distributed dualport register files and single-port SRAMs. They use a
polyhedral-based model for high-level memory management for linear, piecewise linear and data dependent signal indexing[7]. SMASH, however, combines the tradeoffs
in datapath as well as storage architecture as explained
below.
SMASH performs high-level synthesis of an integrated
system consisting of a datapath and a two-level memory
hierarchy, from a given behavioral specification and with
constraints on cost and performance. The two memory
251
of the stepwise construction of the system takes into account the next step by looking ahead so that the next
step is not overly constrained. Global design parameters,
like the memory bandwidth and timing constraints, are
considered when constructing the partial design in each
step, tying the whole synthesis process together.
I ProcessTransformations I
J PIL'I'H'r'L'"I"I!
, '""
Scheduling/Allocationfor [
CommunicatingProcesses
I InterconnecvControlsynmesi
ler Is
t
I SiliconOompiler I
Figure 3: A flow chart of the synthesis system for asynchronous multi-chip designs
3
Synthesis
Systems
of Asynchronous
Multi-chip
In practice, we find that many DSP and other ASIC designs consist of multiple concurrent and interacting processes. Though high-level synthesis has received enormous
attention over the years, most approaches were concentrated on synthesizing single process (one thread of control) designs. Synthesizing a design with multiple concurrent processes poses many new challenges. For example,
since the processes interact with each other, the synthesis tool has to solve all the timing constraints imposed
by one process on another concurrently [18]. Furthermore, resource allocation for each process on a chip cannot be done without taking into account the area versus
performance characteristics of each process since the total resources taken by all processes on a chip are limited
by the chip package. The goals of this research are to
provide an integrated system for synthesizing multi-chip
designs with multiple concurrent processes as well as to
speed up the redesign of multi-chip systems. Figure 3
shows a flow chart which illustrates the approach used
in this synthesis system. First, the multi-process speci-
fication is translated from VHDL to a synthesizable representation called the Design Data Structure (DDS) [3].
The next step is to perform a number of process transformations in order to trade off among hardware sharing,
control complexity, communication overhead and cost. A
process-level chip partitioner, ProPart [4], is then used to
find new cost-effective chip boundaries according to the
up-to-date packaging library. In addition, the partitioner
will distribute chip resources to the processes according
to their performance-versus-area characteristics and determine the interconnection structure as well. Next, a
concurrent approach for multiple-process synthesis is used
to synthesize each process into its own datapath and control path. The objective is to meet the timing, area and
performance constraints as well as to synchronize the communication among the processes. Finally, we use a hybrid symbolic/numeric simulation to verify the functional
and timing correctness of the RTL implementation [5].The
RTL implementation is submitted to the ADAM system
to obtain the final chip layout.
ProPart was used in the experiment of a JPEG image
compression system to be described later in this paper.
Unlike most of the previous behavioral partitioning approaches which focus on partitioning design behaviors at
the operation level into a number of synchronized chips.
ProPart tries to partition a set of sequential and/or concurrent behaviors into custom chips. There are several
advantages to process-level partitioning. For example,
there are far fewer objects at the process level than those
at the operation level, which allows us to utilize much
more comprehensive techniques like mixed integer-linear
programming and at the same time to take into account
more partitioning issues, like chip package selection and
chip resource distribution.
SpecPart [19] is the first system-level behavioral partitioning work which elevates the objects to be partitioned
to a higher level of abstraction (such as processes and procedures), and uses a group migration technique similar to
the Kernighan-Lin algorithm for partitioning. A comprehensive survey of other behavioral partitioning approaches
at the operation level has been done by Vahid [20].
4
Multiprocessor
Synthesis
Multiprocessors of various styles can be synthesized using the SOS (Synthesis of Systems) [17] set of tools. The
input to SOS is a specification in the form of a task flow
graph. SOS decides on the number and types of processors to be used, the interconnections between them, and
the schedule of execution of tasks onto processors.
Related work in the subject of system synthesis include: graph-based theoretical approaches [1]; analytical
modeling approach [10]; and mathematical programming
formulation [6]. The approach used by SOS considers a
more general case than the ones studied previously, that is
scheduling and allocation of tasks related by a precedence
252
SOS
Design Inputconstraints
SMASH output
i
Functional
BWon.o#
Buffer Functional Execution
number area (words/Rt, uJ Wb~ size
time
(1061.tm2) cycle)
(words) resources (cycles)
30
20
18
10
6
I Schedulingand
Multiprocessor
system
I
Synthesis
Allocation
Structure ~ j ~
Schedule
Mapping of Subtasks
to Processors
Figure 4: Overall operation of the SOS tools
graph and with non-zero communication costs over a set
of heterogenic processors connected by a interconnection
network.
SOS tools use mixed integer-linear programming
(MILP) to model the synthesis problem. Some of the
tools can generate the MILP constraints automatically,
and others currently rely on constraints written by the
user. The tools rely on a branch-and-bound solver called
BOZO [11]. Figure 4 shows the overall operation of the
SOS tools.
5
Area
(~tm2)
Delay
(ns)
Multiplier 2398490
55
Adder
Subtractor
30
30
81558
81558
Module
name
Dist./Join
Register
0
96624
6
5
4
3
3
3
3
3
1
3
3+,4-, 12"
3 +,3-,8"
3 +,3-,7"
2+,2-,4" [
2+,2-,2" !
8
13
15
20
26
30
0
3
Interconnect area
(106 gm2)
area
(10 6 ~tm2) Muxes and; Wiring
A
29.35
19.68
17.20
9.92
5.12
12
6
4
3
3
Area Delay
(gtm2) (ns)
Comparator 81558
5
3
3
3
2
Total
arm
(106 ixm2)
Registers
B
C = 2(A+B)
3.74
66.18
99.27
3.87
47.09
70.63
4.05
42.50
63.74
3.67
27.18
40.77
4.12
18.48
27.73
253
Image O'ai~
3ompressed
ImageData
Decompose 2D-DCT
into two 1D-DCTs ~ .
m-ocr
VDHL
Descfiation
S M A S H and
MABAL
Propart
RTLFive]1DD e sOCT
i g n s).
Choice 1:~ii~
2D-DCT estimation
. J ~ _ L ~ L . _ r ' C h i p
Choice :2
SOS
(dct
(b) PackageLibrary
Package Area
Pins
Cost
kl00
k209
k271
"'k464
260
304
344
452
1214
4240
6849
18923
84174
175528
227306
389800
i~:~:~! .....
I - ~ - .'~
~ ~:~. .:""'~"
!II~ _at---:
b:"
'i
'1 -
".~..xt,~,
...
.....
~"~=~
i ~g~:~..'.
i--
,.,.
_.
il
.......
--,:?~'~
' ....
ili ~~ l ~ ~ " ~
.....
~-,
' l!l]t]llli i
'::
:. : . z , L . ~ , , . , , ~ . ~ . ;.~-. ...... ~ : ~ ;
~ , ' ~ . - , ' ; : ~ :"
~ i
:: iz~.'-.~.,~'~:-.,~,..--16
'-.'~ ;.
,n~.ci~,'~,ji~" .~.,,~,3i .J,,,
~
:":. ~'.;.~,%-~
:,$131.~(~.
- - ' i , hI; ~ f.-.,,i,,~-,,'-.,~'
' !l
'
;:~.~'.r~llll
~"
__
"
,
f ,n
II
'~-,41', ; , I
. ....
.'"
.....
....
,;I,,
'1[ e ~ : | l ~ l t
-.~
.....
,
. ".
_ _ 1
b, "
, ' ; ' " " ~:~'~~-'|~"~
'"
"" . . . . .
;~,t:'"l
.". . . . . . . . . l ~:
n~, ~,~.'.'.: -.... . . . ~.: .
[" [ ~ " I rl~ I I ;..
. . . .
~,
o:
""'
.=',:
'
-~..~
.~
!11,$1
of pixels. Task T 9 is a join-distribute operator, which indicates the second set of 1D-DCTs cannot start until the
first are completed. Tasks T10 . . . T17 operate columnwise on the results of tasks T1 . . . T8. We assumed a
macro-pipeline execution between the set of tasks T1 . . .
T8 and T10 . . . T17. While tasks T 1 . . . T8 are being performed for frame i + 1, T10 . . . T17 are being performed
on the previous result of tasks T1 . . . T8.
!1111|
.t,1.
J l III
*.,
,0*
~b
~ L ~
The design space was searched for various performance constraints with the objective of minimizing the
cost for the target architecture shown in Figure 9.
Cost/performance parameters predicated from the RTL
netlists for the 1D-DCT implementations (Table 3) were
input to SOS, so that it could choose from all three 1DD C T implementations.
FqBuffer
Memory
Bellcore
We considered three possible execution-time constraints, where the execution time is defined as the time
to compute T1 . . . TS. All processors send results to the
buffer m e m o r y over a c o m m o n 8-pixel wide bus. The
sets of processors found by SOS for various timing constraints are shown in Table 7. These results clearly show
the cost/performance tradeoff at the architectural level.
LSI
9.5x9.5
Quant.~equant.(4.4x4.4)
9.0x9.0
9.9x9.9
Enc~er
(3.5x3.5)
6.6x6.2
7.4x7.4
Decker
(3.5x3.5)
7.5x8.4
7.4x7.4
Conclusions
Input to i
sos
Design
number constraint
lime
Processors*
ns
Cost
Execution
time/ Pixel rate
8x8 flame
106 I.tm2
ns
106pixel/s
6400
4 P5
110.90
6350
10.08
3200
2 PI,
2 P5
255.08
3200
20.00
950
5 P1,
2 P2,
1 P3
1402.93
950
67.37
Outputs from s o s
five 1D-DCTdesignsfromSMASH.