(Doi 10.1145/196244.196367) Gupta, Pravil Chen, Chih-Tung DeSouza-Batista, J. C. Parker, - (ACM Press The 31st Annual Conference - San Diego, California, United States (1994.06.06-1994.06.10) ) P

Experience with Image Compression Chip Design
using Unied System Construction T ools

Pravil Gupta, Chih-Tung Chen, J.C. DeSouza-Batistayand Alice C. Parker
Departmen t of Electrical Engineering - Systems
University of Southern California
EEB 300, MC 2562
Los Angeles, CA 90089-2562, USA
The user of the USC software rst selects a style for the
Abstract { This paper describes the use of Unied System Construction tools under development at the Univer- system. Styles currently supported include
sity of Southern California. The goal of the project is to
heterogeneous multiprocessors consisting of
automate the construction of heterogeneous, application{ processors interconnected with point-to-point
specic systems. Key elements of the USC system inconnections, operating in a non-pipelined fashclude multiprocessor synthesis, multi-chip datapath synion.
thesis, memory-intensive synthesis, and multi-chip par{ non-pipelined processors in a ring,
{ non-pipelined processors connected by a bus,
titioning. The tools were applied to design of an image
{ pipelined processors with point-to-point conneccompression chip set, and results of using these tools are
tions,
reported on here. Our results are comparable to manual
designs reported in the literature.
multiple custom VLSI chips, communicating asyn-
1 Introduction
chronously,
Communications, entertainment, and other electronic

systems are in widespread use. These systems are generally multi-chip, heterogeneous, and application-specic.
Chip-level synthesis tools are invaluable for the rapid production of such systems, and such tools are becoming
available for general use. System-level tools can also be
used to signicantly increase a designer's ability to meet
a schedule along with a set of performance and cost constraints, but only a few of these tools have been available
in the past.
The Unied System Construction (USC) project at the
University of Southern California involves the production of an integrated set of system-level tools for synthesizing multi-chip, heterogeneous application-specic systems which meet cost, performance and power constraints.
multiple custom VLSI chips, communicating synchronously, with common clock, and
memory-intensive modules consisting of a custom
VLSI chip and a separate memory chip.

Many other styles of systems are currently under development. Once a style is selected, specialized tools
are invoked to complete the design process. Ultimately,
any custom VLSI chips which must be synthesized are
then processed by the ADAM high-level synthesis system, which produces a cell netlist. This netlist is input to
the Cascade Design Automation Chipcrafter Silicon Compiler, and a chip layout is produced.
The following sections give an overview of each major
style of design, in the order each was applied to the compression example. The remaining sections describe the
This paper presents the use of these system-level tools to image compression system to be designed and detail v arperform a multi-chip design exercise, a JPEG image com- ious design activities conducted using the USC tools.
pression chip set. The focus of the USC project is on real- 2 Synthesis of Memory-Intensive Systime systems, such as entertainment and communication
tems
technologies, but does not exclude other applications reA subset of the USC tools performs automatic synthequiring specialized system design. A block diagram of the
sis
of memory-intensive application-specic systems, with
system is shown in Figure 1.
emphasis
on hierarchical storage architecture design. The
This work was supported by the Advanced Research Projects
storage
architecture
is closely connected to the datapath
Agency and monitored by the Federal Bureau of Investigation under of the system, and isolating its synthesis from datapath
Contract No. JFBI90092.
y Supported by Conselho Nacional de Desenvolvimento Cientico
synthesis may not result in an ecient solution. Theree Tecnologico - CNPQ/Brazil.
fore, the design of the datapath and storage architecture
31ST ACM/IEEE Design Automation Conference
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage,
the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying it is by permission of the Association for
Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
1994 ACM 0-89791-653-0/94/0006 3.50
250
levels are
oso
rocessgP'
mult=p
SOS
Synthesis
I o| Systems [
|
.....
I mixed
~.~,,,,custom
clJ .....VL~I
f styles ~
......... clock
for Systems
I Tas,k .Style I
I ::~elecuon I
CHOP: 2-way partitioning, I
partition evaluation I
1. On-chip foreground memory. This consists of two

subparts: datapath memory to store the intermediate or temporary variables in the datapath, and I/O
buffers to temporarily store the inputs and outputs
to the datapath chip, allowing fast access and off-chip
storage of the data.
2. Off-chip background memory. In our model this is
the bulk storage required for the inputs and outputs.
All the I/O data values from/to the external world
are stored here.
Module
Description
=chitecturebehoffA is
beginprocess
end process,
used by
syste.m-fevel
teem
,
I
/
^
. .ff
.
~ascaoe uesign Automation ]
SMASH
Data Path Operations,

I/O buffer reads/writes, and
Datapath memory reads/writes.
Figure 1: Block diagram of the Unified System Construction (USC) Project
uatapam memory
synthesis
Module allocation for dal
ath memory.
are coordinated in the USC tool set. SMASH [9] (Figure 2), the tool set for memory-intensive design, is used
for systems designed for specific applications, where the
memory-access pattern is not only relatively fixed but
also known before hand. This mostly-deterministic access characteristic helps us in being more specific, hence
more efficient in our designs.
The original MIMOLA system was the first system to
make tradeoffs in the use of multiport memories [14].
Lippens et al. [13] describe techniques to perform automatic memory allocation and address allocation for
high speed applications. They synthesize memory after the design of arithmetic units, datapath scheduling and allocation. IMEC's CATHEDRAL-II compiles
multi-dimensional data structures into distributed dualport register files and single-port SRAMs. They use a
polyhedral-based model for high-level memory management for linear, piecewise linear and data dependent signal indexing[7]. SMASH, however, combines the tradeoffs
in datapath as well as storage architecture as explained
below.
SMASH performs high-level synthesis of an integrated
system consisting of a datapath and a two-level memory
hierarchy, from a given behavioral specification and with
constraints on cost and performance. The two memory
251
!a. I/0 transfer schedulinh between

i off-chip memory and I/0 buffers.
;b. Module allocation for I/0 buffers.
a. I/0 transfer schedule between ext. wol

and off-chip memory.
b. Module allocation for off-chip memory.
Figure 2: SMASH: Synthesis of Memory-intensive Application Specific Hardware

The synthesis is performed in the following two steps:
First, datapath synthesis with operation scheduling is performed combined with scheduling of local data transfers
to/from memory. As a result of this scheduling, constraints are placed on the memory structure. During the
second step, the storage hierarchy design is completed,
which includes determining the data transfers between
different levels of memory hierarchy and completing synthesis of the storage structures. We ensure that each step
of the stepwise construction of the system takes into account the next step by looking ahead so that the next
step is not overly constrained. Global design parameters,
like the memory bandwidth and timing constraints, are
considered when constructing the partial design in each
step, tying the whole synthesis process together.
I ProcessTransformations I
J PIL'I'H'r'L'"I"I!
, '""
Scheduling/Allocationfor [
CommunicatingProcesses
I InterconnecvControlsynmesi
ler Is
t
I SiliconOompiler I
Figure 3: A flow chart of the synthesis system for asynchronous multi-chip designs
3
Synthesis
Systems
of Asynchronous
Multi-chip
In practice, we find that many DSP and other ASIC designs consist of multiple concurrent and interacting processes. Though high-level synthesis has received enormous
attention over the years, most approaches were concentrated on synthesizing single process (one thread of control) designs. Synthesizing a design with multiple concurrent processes poses many new challenges. For example,
since the processes interact with each other, the synthesis tool has to solve all the timing constraints imposed
by one process on another concurrently [18]. Furthermore, resource allocation for each process on a chip cannot be done without taking into account the area versus
performance characteristics of each process since the total resources taken by all processes on a chip are limited
by the chip package. The goals of this research are to
provide an integrated system for synthesizing multi-chip
designs with multiple concurrent processes as well as to
speed up the redesign of multi-chip systems. Figure 3
shows a flow chart which illustrates the approach used
in this synthesis system. First, the multi-process speci-
fication is translated from VHDL to a synthesizable representation called the Design Data Structure (DDS) [3].
The next step is to perform a number of process transformations in order to trade off among hardware sharing,
control complexity, communication overhead and cost. A
process-level chip partitioner, ProPart [4], is then used to
find new cost-effective chip boundaries according to the
up-to-date packaging library. In addition, the partitioner
will distribute chip resources to the processes according
to their performance-versus-area characteristics and determine the interconnection structure as well. Next, a
concurrent approach for multiple-process synthesis is used
to synthesize each process into its own datapath and control path. The objective is to meet the timing, area and
performance constraints as well as to synchronize the communication among the processes. Finally, we use a hybrid symbolic/numeric simulation to verify the functional
and timing correctness of the RTL implementation [5].The
RTL implementation is submitted to the ADAM system
to obtain the final chip layout.
ProPart was used in the experiment of a JPEG image
compression system to be described later in this paper.
Unlike most of the previous behavioral partitioning approaches which focus on partitioning design behaviors at
the operation level into a number of synchronized chips.
ProPart tries to partition a set of sequential and/or concurrent behaviors into custom chips. There are several
advantages to process-level partitioning. For example,
there are far fewer objects at the process level than those
at the operation level, which allows us to utilize much
more comprehensive techniques like mixed integer-linear
programming and at the same time to take into account
more partitioning issues, like chip package selection and
chip resource distribution.
SpecPart [19] is the first system-level behavioral partitioning work which elevates the objects to be partitioned
to a higher level of abstraction (such as processes and procedures), and uses a group migration technique similar to
the Kernighan-Lin algorithm for partitioning. A comprehensive survey of other behavioral partitioning approaches
at the operation level has been done by Vahid [20].
4
Multiprocessor
Synthesis
Multiprocessors of various styles can be synthesized using the SOS (Synthesis of Systems) [17] set of tools. The
input to SOS is a specification in the form of a task flow
graph. SOS decides on the number and types of processors to be used, the interconnections between them, and
the schedule of execution of tasks onto processors.
Related work in the subject of system synthesis include: graph-based theoretical approaches [1]; analytical
modeling approach [10]; and mathematical programming
formulation [6]. The approach used by SOS considers a
more general case than the ones studied previously, that is
scheduling and allocation of tasks related by a precedence
252
Task Flow Graph
Processor Style Library
SOS
Design Inputconstraints
SMASH output
i
Functional
BWon.o#
Buffer Functional Execution
number area (words/Rt, uJ Wb~ size
time
(1061.tm2) cycle)
(words) resources (cycles)
30
20
18
10
6
I Schedulingand
Multiprocessor
system
I
Synthesis
Allocation
MultiprocessorSystem/,~-.~.~,. Task Execution
Structure ~ j ~
Schedule
Mapping of Subtasks
to Processors
Figure 4: Overall operation of the SOS tools
graph and with non-zero communication costs over a set
of heterogenic processors connected by a interconnection
network.
SOS tools use mixed integer-linear programming
(MILP) to model the synthesis problem. Some of the
tools can generate the MILP constraints automatically,
and others currently rely on constraints written by the
user. The tools rely on a branch-and-bound solver called
BOZO [11]. Figure 4 shows the overall operation of the
SOS tools.
5
J P E G Image Compression Design
Due to the bandwidth constraints imposed by both still

and video image transmission, data compression is a key
function. As a result, we chose to focus our design activity on a standard for still image compression, J P E G [21],
and to eventually expand the example to cover MPEG
standards as well.
Module
name
Area
(~tm2)
Delay
(ns)
Multiplier 2398490
55
Adder
Subtractor
30
30
81558
81558
Module
name
Dist./Join
Register
0
96624
6
5
4
3
3
3
3
3
1
3
3+,4-, 12"
3 +,3-,8"
3 +,3-,7"
2+,2-,4" [
2+,2-,2" !
8
13
15
20
26
Design Activities and Results

We began with the synthesis of the D C T (Discrete Cosine Transform) function. The 2D-DCT was decomposed
into repeated row-column 1D-DCTs prior to the application of the system-level tools. The 1D-DCT macro was
synthesized first and used to construct a 2D-DCT, clearly
a bottom-up step. SMASH was used to generate five
schedules for a 1D-DCT macro from a behavioral VHDL
description of the DCT described in the referenced article
[8]. The module library used is shown in Table 1. These
datapath schedules with varying cost and performance are
shown in Table 2. SMASH also determined buffer-size and
bandwidth requirement as shown in Table 2.
The 1D-DCT schedules were then processed by the
ADAM tool MABAL [12] to generate the RTL datapath
netlists. These netlists were analyzed to obtain the area
characteristic of the datapath as shown in Table 3. The
area for functional units, multiplexers and registers was
determined from the netlists, and wiring area was estimated manually using a rule-of-thumb which we observed
in our earlier experiments [16]. (We did not use our wiring
area estimation tools here due to lack of time.)
Functional
Design
30
0
3
Interconnect area
(106 gm2)
area
(10 6 ~tm2) Muxes and; Wiring
A
29.35
19.68
17.20
9.92
5.12
Table 1: Module library used by SMASH

Figure 5 shows the input JPEG-specification, the design flow used for this experiment and the output of our
system. It is important to note that the design flow has
some bottom-up portions, which represent the flow between the application of each tool, each of which operates
in essentially a top-down fashion. Thus the design flow is
both top-down and bottom-up.
12
6
4
3
3
Table 2: 1D DOT design parameters from SMASH
Area Delay
(gtm2) (ns)
Comparator 81558
5
3
3
3
2
Total
arm
(106 ixm2)
Registers
B
C = 2(A+B)
3.74
66.18
99.27
3.87
47.09
70.63
4.05
42.50
63.74
3.67
27.18
40.77
4.12
18.48
27.73
Table 3: 1D-DCT RTL designs from MABAL

For multi-chip partitioning of the compression system
we estimated the performance and silicon area of all the
parts in the system. A 2D-DCT architecture consisting
of two 1D-DCT modules ~/Od an 8 8 frame buffer was
selected as shown in the literature [8]. The worst-case
253
JPEG SYSTEM SPECIFICATION

Sourc~
Image O'ai~
3ompressed
ImageData
Decompose 2D-DCT
into two 1D-DCTs ~ .
m-ocr
VDHL
Descfiation
S M A S H and
MABAL
Propart
Partitioned 3 Chip Implementation of
RTLFive]1DD e sOCT
i g n s).
Choice 1:~ii~
2D-DCT estimation
. J ~ _ L ~ L . _ r ' C h i p
Choice :2
SOS
Layout Generation using ChipCrafter
Three 2D- DCT

Multiprocessor Architectures
2D- DCT Layout (Chip 1)
Figure 5: Design flow for still inaage compression system example

datapath delay was used to compute the performance for
each implementation with a two-phase non-overlapping
clock. The quantizer performance and silicon area were
also estimated, and the parameters used are comparable
to those reported in the literature [8]. We used parameters
of an existing chip for the Huffman coding [15].
After estimating the performance and silicon area of all
the parts in the compression system, it was partitioned by
ProPart. The d a t a used for each function in the system is
shown in Table 4a. The package library used for ProPart
is derived from a commercial ASIC library and is shown
in Table 4b.
(a) Process Characteristics
P~occss
(dct
Estimated Area/Delay points
217096/480 159819/780 146024/900

10010611200 7400411560
quan 19036/89 (fixed)
IICO 12200/100 (fixed)
deo 12200/100 (fixed)
dequ 19036/89 (fixed)
I
idct 217096/480 159819/7/80 146024/9001
10010611200 74004/I 560
the second 1D-DCT design produced by SMASH. Note

that ProPart placed the DCT and IDCT on separate dies,
and lumped the remaining functions on a single die.
Finally, we generated the layouts of the 1D-DCT macro
and 2D-DCT chip using Cascade Design Automation's
ChipCrafter (Figures 6 and 7), and analyzed the area distribution (Table 5).
iF
(b) PackageLibrary
Package Area
Pins
Cost
kl00
k209
k271
"'k464
260
304
344
452
1214
4240
6849
18923
84174
175528
227306
389800
i~:~:~! .....
I - ~ - .'~
~ ~:~. .:""'~"
!II~ _at---:
b:"
'i
'1 -
".~..xt,~,
"' ' , a'~.
...
.....
~"~=~
i ~g~:~..'.
i--
,.,.
_.
il
.......
--,:?~'~
' ....
ili ~~ l ~ ~ " ~
.....
~-,
' l!l]t]llli i
'::
:. : . z , L . ~ , , . , , ~ . ~ . ;.~-. ...... ~ : ~ ;
~ , ' ~ . - , ' ; : ~ :"
~ i
I., j,~. Ii, flOl.~

"'
:: iz~.'-.~.,~'~:-.,~,..--16
'-.'~ ;.
,n~.ci~,'~,ji~" .~.,,~,3i .J,,,
~
:":. ~'.;.~,%-~
:,$131.~(~.
- - ' i , hI; ~ f.-.,,i,,~-,,'-.,~'
' !l
'
;:~.~'.r~llll
Note: 1. A r e a is 103 lain 2.

2. D e l a y is in ns.
3. C o s t is a f u n c t i o n o f
a r e a a n d pin c a p a c i t i e s .
~"
__
"
,
f ,n
II
'~-,41', ; , I
. ....
.'"
.....
....
,;I,,
'1[ e ~ : | l ~ l t
-.~
.....
,
. ".
_ _ 1
b, "
, ' ; ' " " ~:~'~~-'|~"~
'"
"" . . . . .
;~,t:'"l
.". . . . . . . . . l ~:
n~, ~,~.'.'.: -.... . . . ~.: .
[" [ ~ " I rl~ I I ;..
. . . .
~,
o:
""'
.=',:
Table 4: Parameters used by Propart

Figure 6: Layout of 1D-DCT module
The results produced by ProPart are shown in Figure 5. ProPart selected the 2D-DCT design which uses
A comparison of our chip set with others [2] is shown

254
s..,tl, s,,,.~ c ~ r + , t , + (~ttv) ~.z
'
-~..~
.~
!11,$1
of pixels. Task T 9 is a join-distribute operator, which indicates the second set of 1D-DCTs cannot start until the
first are completed. Tasks T10 . . . T17 operate columnwise on the results of tasks T1 . . . T8. We assumed a
macro-pipeline execution between the set of tasks T1 . . .
T8 and T10 . . . T17. While tasks T 1 . . . T8 are being performed for frame i + 1, T10 . . . T17 are being performed
on the previous result of tasks T1 . . . T8.
!1111|
.t,1.
J l III
*.,
,0*
~b
Figure 7: Layout of 2 D - D C T chip

Interconnect
Total Functional Controller Muxes+
Wiring
Registers
ID-DCT 8 3 . 7 2 19.83
1.03
4.67 l 58.19
2D-DCT 209.94 39.66
2.06" 13.91- ] 154.31
* controller for the frame-buffer not included
** a 64 word on-chip RAM included.
Area
~ L ~
Figure 8: Task flow graph for SOS
Table 5: Area analysis for the layouts

in Table 6. Since we obtained the Huffman coding chip
parameters from another source, they are only compared
here to show t h a t the parameters we are using are comparable to those in the literature. The die we did design,
the DCT, has s o m e w h a t larger die size than the industrial chips, but the performance was comparable. The
technologies used by the industrial chips was not mentioned in [2], so we were not able to determine whether
our 1.2 micron CMOS technology was inherently larger
than the industrial technologies.
The design space was searched for various performance constraints with the objective of minimizing the
cost for the target architecture shown in Figure 9.
Cost/performance parameters predicated from the RTL
netlists for the 1D-DCT implementations (Table 3) were
input to SOS, so that it could choose from all three 1DD C T implementations.
FqBuffer
Memory
Figure 9: Target architecture for SOS

Area (mm2)
Ours
Bellcore
We considered three possible execution-time constraints, where the execution time is defined as the time
to compute T1 . . . TS. All processors send results to the
buffer m e m o r y over a c o m m o n 8-pixel wide bus. The
sets of processors found by SOS for various timing constraints are shown in Table 7. These results clearly show
the cost/performance tradeoff at the architectural level.
LSI
2DDCT/IDCT 11.8x17.7 10.7x10.2
9.5x9.5
Quant.~equant.(4.4x4.4)
9.0x9.0
9.9x9.9
Enc~er
(3.5x3.5)
6.6x6.2
7.4x7.4
Decker
(3.5x3.5)
7.5x8.4
7.4x7.4
Table 6: Chip-set parameters

Architecture tradeoff study
To search the design space for a wider range of implementations, we also applied the SOS multiprocessor
synthesis tool to the first stage of the 2D-DCT task flow
graph shown in Figure 8. In this graph, tasks T1 . . . T8
are 1D-DCTs which operate row-wise on the 8 8 array
255
Conclusions
The entire design exercise took less than a week of time

for three graduate students familiar with all the tools, and
able to write VHDL descriptions rapidly. We found during the course of our tool usage t h a t the tools tended to be
used in a b o t t o m - u p fashion. We also found out that our
software was not designed to take advantage of already designed macros like the 1D-DCT, so coding changes were
Input to i
sos
Design
number constraint
lime
Processors*
ns
Cost
Execution
time/ Pixel rate
8x8 flame
106 I.tm2
ns
106pixel/s
6400
4 P5
110.90
6350
10.08
3200
2 PI,
2 P5
255.08
3200
20.00
950
5 P1,
2 P2,
1 P3
1402.93
950
67.37
* PI ... P5 are the
[8] H. Fujiwara, M.L. Liou, M.T. Sun, K.M. Yang,

M.M. Maruyama, K. Shomura, and K. Ohyama. An
AII-ASIC Implementation of a Low Bit-Rate Video
Codec. IEEE Trans. on Circuits and Systems for
Video Technology, 2(2):123-134, June 1902.
[9] P. Gupta and A. C. Parker. SMASH: A Program
for Scheduling Memory-Intensive Application Specific Hardware. In 7th lnt'l Syrup. on High-Level Synthesis, May 1994.
[10] E. K. Haddad. Optimal Load Allocation for Parallel and Distributed Processing. Tech. Rep. T R 89-12,
Department of Computer Science, Virginia Polytechnic Institute and State Univ., April 1989.
Outputs from s o s
five 1D-DCTdesignsfromSMASH.
Table 7: 2D-DCT implementations from SOS

required in order to obtain the final layouts from the RTL
datapaths. It was clear from the exercise that many more
dimensions of the design space could have been searched
by our tools, given more time. For example, SOS produced a variety of architectures for 2D-DCT that can be
used to meet different design requirements. Our estimation tools, which were not used, will provide valuable information early for our tools when used in actual design
situations.
References
[1] S. H. Bokhari. Assignment Problems in Parallel and
Distributed Computing. Kluwer Academic Publishers, 1987.
[2] C.F. Chang and B.J. Sheu. A Multi-Chip Module
Design for Portable Video Compression Systems. In
IEEE Multi-Chip Module Conf., pages 39-44, 1993.
[3] C.T. Chen and A.C. Parker. VHDL2DDS: A VHDL
Language to DDS Data Structure Translator. Tech.
Rep. C Eng 91-21, Dept. of EE-Systems, Univ. Southern California, July 1991.
[4] C.T. Chen and A. C. Parker. ProPart: A ProcessLevel Behavioral Partitioner. Tech. Rep. CEng 9338, Dept of EE-Systems, Univ. of Southern California, Sept. 1993.
[5] C.T. Chen and A. C. Parker.
A Hybrid Numeric/Symbolic Program for Checking Functional
and Timing Compatibility of Synthesized Designs. In
7th lnt'l Syrup. on High-Level Synthesis, May 1994.
[6] W. W. Chu, L. J. Hollaway, M.--T. Lan, and K. Ere.
Task allocation in distributed data processing. Computer, 13(11):57-69, Nov. 1980.
[7] F. Franssen, F. Balasa, M. Swaaij, F. Catthoor, and
H. De Man. Modeling multidimenstional data and
control flow. IEEE Tran. on VLSI Systems, 1(3):319327, 1993.
256
[11] L. J. Haler and E. Hutchings. Bringing up Bozo.

Tech. Rep. CMPT T R 90-2, School of Computing
Science, Simon Fraser Univ., Burnaby, B.C., V5A
1S6, Mar. 1990.
[12] K. Kucukcakar and A. C. Parker. Data Path Tradeoffs using MABAL. 27th Design Automation Conf.,
June 1990.
[13] P.E.R. Lippens, J.L. van Meerbergen, A. van der
Weft, W.F.J. Verhaegh, and B.T. McSweeney. Memory Synthesis for High Speed DSP Applications. In
Proc. of the IEEE Custom Integrated Circuits Conf.,
May 1991.
[14] P. Marwedel. The MIMOLA Design System: Detailed Description of the Software System. In 16th
Design Automation Conf., June 1979.
[15] H. Park and V.K. Prasanna. Area Efficient VLSI Architectures for Huffman Coding. Int. Conf. on Aco.ustics, Speech and Signal Processing, 1993.
[16] A. C. Parker, P. Gupta, and A. Hussain. The Effects of Physical Design Characteristics on the AreaPerformance Tradeoff Curve. In 28th Design Automation Conf., June 1991.
[17] S. Prakash and A. Parker.
SOS: Synthesis of
Application-Specific Heterogeneous Multiprocessor
Systems. Journal of Parallel and Distributed Computing, 16:338-351, Dec. 1992.
[18] A. Takach and W. Wolf. Scheduling Constraint Generation for Communicating Processes. Tech. Rep.
SRC Pub C93069, Princeton Univ., Feb. 1993.
[19] F. Vahid and D. D. Gajski. Specification Partitioning
for System Design. In 29th Design Automation Conf.,
June 1992.
[20] F. Vahid. A Survey of Behavioral-Level Partitioning
Systems. Tech. Rep. T R ICS 91-71, UC Irvine, 1991.
[21] G. K. Wallace. The JPEG Still Picture Compression
Standard. Communication of the ACM, 34(4):31-44,
April 1991.

(Doi 10.1145/196244.196367) Gupta, Pravil Chen, Chih-Tung DeSouza-Batista, J. C. Parker, - (ACM Press The 31st Annual Conference - San Diego, California, United States (1994.06.06-1994.06.10) ) P

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

(Doi 10.1145/196244.196367) Gupta, Pravil Chen, Chih-Tung DeSouza-Batista, J. C. Parker, - (ACM Press The 31st Annual Conference - San Diego, California, United States (1994.06.06-1994.06.10) ) P

Enviado por

Direitos autorais:

Formatos disponíveis

Experience with Image Compression Chip Design

using Uni ed System Construction T ools

Communications, entertainment, and other electronic

 memory-intensive modules consisting of a custom

VLSI chip and a separate memory chip.

CHOP: 2-way partitioning, I

1. On-chip foreground memory. This consists of two

Data Path Operations,

Figure 1: Block diagram of the Unified System Construction (USC) Project

!a. I/0 transfer schedulinh between

a. I/0 transfer schedule between ext. wol

Figure 2: SMASH: Synthesis of Memory-intensive Application Specific Hardware

Task Flow Graph

Processor Style Library

MultiprocessorSystem/,~-.~.~,. Task Execution

J P E G Image Compression Design

Due to the bandwidth constraints imposed by both still

Design Activities and Results

Table 1: Module library used by SMASH

Table 2: 1D DOT design parameters from SMASH

Table 3: 1D-DCT RTL designs from MABAL

JPEG SYSTEM SPECIFICATION

Partitioned 3 Chip Implementation of

Layout Generation using ChipCrafter

Three 2D- DCT

2D- DCT Layout (Chip 1)

Figure 5: Design flow for still inaage compression system example

Estimated Area/Delay points

217096/480 159819/780 146024/900

the second 1D-DCT design produced by SMASH. Note

"' ' , a'~.

I., j,~. Ii, flOl.~

Note: 1. A r e a is 103 lain 2.

Table 4: Parameters used by Propart

A comparison of our chip set with others [2] is shown

s..,tl, s,,,.~ c ~ r + , t , + (~ttv) ~.z

Figure 7: Layout of 2 D - D C T chip

Figure 8: Task flow graph for SOS

Table 5: Area analysis for the layouts

Figure 9: Target architecture for SOS

2DDCT/IDCT 11.8x17.7 10.7x10.2

Table 6: Chip-set parameters

The entire design exercise took less than a week of time

* PI ... P5 are the

[8] H. Fujiwara, M.L. Liou, M.T. Sun, K.M. Yang,

Table 7: 2D-DCT implementations from SOS

[11] L. J. Haler and E. Hutchings. Bringing up Bozo.

Você também pode gostar

using Unied System Construction T ools

memory-intensive modules consisting of a custom