Exploring VLIW ASIP Design Space Using Trimaran Based Framework

Exploring VLIW ASIP Design Space using
Trimaran Based Framework
A dissertation submitted in partial fulfillment

of the requirements for the degree of
MASTER OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING
Submitted by
Vundela Srinivas Reddy
Under the guidance of

Prof. Anshul Kumar
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY, DELHI
May 2006
Certificate
This is to certify that the thesis titled Exploring VLIW ASIP Desgin
Space using Trimaran Framework being submitted by Vundela Srini-
vasa Reddy for the award of Masters of Technology in Computer Sci-
ence & Engineering is a record of bonafide work carried out by him under
my guidance and supervision at the Department of Computer Science &
Engineering, Indian Institute of Technology, Delhi and EPFL, Lau-
sanne, Switzerland. The work presented in this thesis has not been sub-
mitted elsewhere, either in part or full, for the award of any other degree or
diploma.
Prof. Anshul Kumar

Department of Computer Science & Engg.
Indian Institute of Technology, Delhi.
Acknowledgments
I am greatly indebted to my supervisor Prof. Anshul Kumar & Prof. M. Bal-

akrishnan IIT Delhi, for their invaluable technical guidance and moral support
during the course of the project. I am greatful to Prof. Paolo Ienne for his
help during the project.
I would like to thanks Nagaraju, Neeraj, Ajay Kumar Verma for their
feedbacks during the informal discussions and my other colleagues for their
cooperations and support.
Vundela Srinivas Reddy

Abstract
The main aim of the project was to explore the VLIW ASIP design space
to find out the optimal benifit of using ASIP for VLIW processor through a
fully automated design methodology using the Trimaran framework. Pluggable
modules for automatic identification, evaluation and selection of the coarse
grained functional units are implemented in Elcor phase of Trimaran to take
advantage of architectural dependent optimizations available in the early phase
of Elcor. An optimal algorithm and a heuristic methodology for estimating the
the coarse grained fuctional units are proposed. This methodology enabled to
demonostrate the ASIP design flow model for several benchmarks.
Contents
1 Introduction & Motivation 1

1.1 ASIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 VLIW Architectures . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 VLIW & ASIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Processor Specialization Techniques . . . . . . . . . . . . . . . . 4
1.5 Objectives of the current work . . . . . . . . . . . . . . . . . . . 5
1.6 Outline of the report . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 ASIP Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Framework - Trimaran Compiler Infrastructure . . . . . . . . . 11
2.2.1 MDES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 HPL-PD Architecture . . . . . . . . . . . . . . . . . . . 15
2.2.3 IMPACT . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 ELCOR . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Design space exploration for single issue machines . . . . . . . . 17
2.3.1 Identification of Coarse Grained FU’s . . . . . . . . . . . 18
2.3.2 Objective Function based Estimation . . . . . . . . . . . 20
2.4 Static List Scheduling Algorithm . . . . . . . . . . . . . . . . . 20
3 Design Space Exploration for VLIW processors 23
i
3.1 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Motivation Example . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Objective Function . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Scheduling based Estimation . . . . . . . . . . . . . . . . 30
3.3 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Implementation 32
4.1 Earlier Framework . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Enhanced Framework . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Results 39
5.1 rawdaudio application . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 fir application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 mmdyn application . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 fact application . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 sqrt application . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusion 47
6.1 Conclusions and contributions . . . . . . . . . . . . . . . . . . . 47
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2.1 Multiple AFUs per region . . . . . . . . . . . . . . . . . 48
6.2.2 Extending Trimaran to remove input/output port con-
straint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.3 Control Flow in AFUs . . . . . . . . . . . . . . . . . . . 49
6.2.4 Multiple Objective Selection . . . . . . . . . . . . . . . . 49
6.2.5 AFUs with memory accessing capability . . . . . . . . . 50
ii
List of Figures
2.1 ASIP Synthesis Methodology . . . . . . . . . . . . . . . . . . . 8

2.2 Components of Trimaran Compiler Infrastructure . . . . . . . . 12
2.3 MDES in Trimaran . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 General MIMO Identification Algortihm. . . . . . . . . . . . . . 19
2.5 The static list scheduling algorithm . . . . . . . . . . . . . . . . 22
3.1 Pseudo convex cut . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Motivation Example . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Motivation Example 2 . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Earlier Framework . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Enhanced Framework . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 VLIW ASIP exploration . . . . . . . . . . . . . . . . . . . . . . 37
iii
List of Tables
3.1 Hardware and sofware latencies of the operations in Fig. 3.1 . . 23
5.1 rawcaudio benchmark results . . . . . . . . . . . . . . . . . . . . 41

5.2 fir benchmark results . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 mmdyn benchmark results . . . . . . . . . . . . . . . . . . . . . 44
5.4 fact benchmark results . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 sqrt benchmark results . . . . . . . . . . . . . . . . . . . . . . . 46
iv
Chapter 1
Introduction & Motivation
Over the last decade, It has been experimentally proven that Application Spe-
cific Instruction set Processors (ASIPs) which falls in between the conven-
tionaly methodology for high performance and low power implementations for
the applications, Application Specific Integrated Circuits (ASICs) and flexible
methodology, General Purpose Processors (GPPs) for single issue machines
like Reduced Instruction Set Computers (RISC) and Complex Instruction Set
Computers (CISC). The main aim of this work focusses on different aspects
of ASIP synthesis for multiple issue machines mainly Very Long Instruction
Word (VLIW) processors.
-
1.1 ASIPs
For high performance and lower power implemention of the applications, ASICs
are the best solution. But this solution leads to larger area, most rigid solution,
much time taking and coslty solution to the problem. GPPs are the most
flexible implementation for all applications. But this leads to low performance
and high power consumption. Research has been shifted towards the medium
solution, which takes the advantage of both the ASICs and GPPs termed as
ASIPs. ASIPs gives higher performance than GPPs and more flexible than
1
ASICs.
In the last decade, research in design methodologies for system-on-chip
has been mainly revolving around the synthesis of ASIPs. This involved the
automatic generation of complete instruction sets for specific applications. In
that contex, the goal is typically to design an instruction set which minimises
some important metric(e.g., run time, program memory size, execution unit
count).
Most recently, the attention has shifted towards extending generic proces-
sors with units specialised for a given domain, rather than desiging completely
custom processors. The main goal of this process is exploit the frequently
executing kernals of the application and map them to custumised hardware
units which finally leads to better performance. ASIPs can be very efficient
when applied to a specific applications such as digital signal processing, servo
motor control, image processing applications etc. However, ASIPs can save
the area of the processor by removing the redundant units not utilized by the
application domain.
1.2 VLIW Architectures

High performance computing is based on the exploitation of fine grain prar-
allelism occuring within a given application. A number of alternatives have
been devised to perform spatial computation. These include Pipelining, use
of Multiple Processors, multiple issue and execution of instructions using
Superscalar or VLIW architectures.
Though both the VLIW and the superscalar architectures rely on extraction
of instruction level parallelism from the application, VLIW based architecture
provides for some implementation advantages.
The VLIW based processors have relatively simple control logic because
they do not perform any dynamic scheduling nor reordering of operations as
2
in the case in most contemporary superscalar processors. Since ILP is exploited
with software, it eliminates the need for the instruction and reorder buffers,
and hence benefit is less logic, and thus less lost.
A large size of the window of instructions to be looked at to extract ILP
would require a more sophisticated dispatcher and would directly increase the
amount of hardware and chip logic used in case of Superscalare implementa-
tions. Alternatively, VLIW uses a software window and hence is able to exploit
the ability of spatial computation more successfully.
Some of the advantages sought by the superscalar implementations, in-
cluding access to dynamic information, and speculative execution are easily
available to VLIW based architecture too.. VLIW can easily mimic the spec-
ulative execution using register space. Speculative results are placed in tem-
porary registers whose values are discarded, once the brach in known to be
mis-predicated. Trace driven compilation techniques are able to record the
dynamic program behaviour and facilitates a better exploitation of ILP.
VLIW based architecture moves the complexity from hardware to the com-
piler. Thus the complexity is paid for only once, when the compiler is written
and not every time the chip is fabricated.
1.3 VLIW & ASIPs

ASIPs are programmable processors, specialised for a particular application or
for a family of applications. For processor specialisation, two design tools are of
great importance: Retargetable Compiler and Configurable Simulator.
These requirements directly point to the use of VLIW based architecture for
ASIP synthesis.
Retargetable Compiler can be used for a range of different target archi-
tectures. The actual code optimization and code generation is done by the
3
compiler, based on a description of the target processor architecture. The ar-
chitecture description is formulated in an architecture descritpiton language.
Configurable Simulator can be configured for a particular architecture(based
on an architecture description) and can generate statistics regarding through-
put, delay, power/energy consumption etc.
1.4 Processor Specialization Techniques

Instruction set, microarchitecture and/or memory system of a processor can be
customised for an application or family of applications to maximise efficiency
in terms of latency, power etc.
Various forms of specialisations include Instruction Set Specialisation, Func-
tion unit and Data path Specialisation, Memory Specialisation, Cache config-
uration and Interconnet specialisation.
Instruction set specilisation involves automatic generation of complete in-
struction sets for specific applications. In that context, the goal is typically to
cluster all the atomic operations required for an application into an instruc-
tion set which minimises some important metric(e.g., execution time, program
memory size, number of execution units etc.). It some times excludes atomic
operations which are not used for the application or set of applications, which
leads to minimum number of functional units in the customized processor.
Functional unit and Data path specialisation involves adaption of word
length, register number, and functional units. Highly specialised functional
units can be introduced for string matching and manipulation, pixel oper-
ation, arithmetics and even complex units to perform certain sequences of
computations (co-processors).
Memory Specialisation may adapt the number and size of memory banks
and ports. They both influence the degree of parallelism in memory acess.
Having several smaller memory blocks(instead of one big) increases parallelism
4
and speed, and reduces power consumption.
Cache configuration includes parameters like seperate instruction/date caches,
associativity, cache size, line size, which depend very much on the characteris-
tics of the application and, in particular, on the properties related to locality.
This form of specialisation can have a very large impact on performance and
power consumption,
Interconnect Specialization which affects the interconnnection of functional
modules and registers, momory and chche increases the potential of parallelism.
The current work tires to achieve processor specialisation through Instruc-
tion Set Specialisation and Function Unit Specilisation. Special instruc-
tion, specific to the application and synthesised are mapped on to spefial FUs
designed especially for them. These FUs are so desinged as to exploit many
of the optimisations possible including bit width reduction, hard wiring of
constants, spatial computation, chaining of operations etc.
1.5 Objectives of the current work

The main objective of the present work is to develop the complete automatic
design framework for VLIW ASIP synthesis to evaluate the hardware/software
tradeoffs provided by ASIPs.
This process includes automoatic identification of hot spots(kernals) of the
application at the coarse level that can be implemented using the hardware
functional units to improve the performance of the applciation. Pluggable
modules are designed for automatic identification of such kernal of the appli-
cation.
The major objective of this project is to find the better estimation method-
ology in terms of base architecture that estimates each Application Specific
Functional units(AFUs). Two algorithms are proposed and implemented to
correctly estimate the each AFU.
5
In order to take the advantages of the architectural dependent optimiza-
tions available in the Trimaran framework and eliminate the need for two
compiler infrastructures used in the earlier work, the pluggable modules are
now implemented in the Elcor phase of the Trimaran infrastructure. It also
eliminate the need to modification of the source code of the compiler framework
and also the need of several scripts used in the earlier work.
1.6 Outline of the report

Chapter 2 gives the background work behind this project. It includes literatrue
survay, ASIP design methodology, algorithms used for single issue machines to
get the optimal benifit by using the ASIPs. Chapter 3 gives the approach we
are proposing for multiple issue machines. Chapter 4 gives the procedure how
to extend the trimaran compiler framework to use the architectrue dependent
optimizations available in the base architecture. Chapter 5 gives the results
of several bechmark applications taken from media bench. Finally Chapter 6
gives the summarizes the work we have done and discusses the possible future
directions.
6
Chapter 2
Background
2.1 ASIP Synthesis

Interest in synthesis of Application Specific Instruction Processors(ASIPs) has
increased considerbly and a number of methodologies have been proposed in
the last decade. The main requirements of the design of application-specific
architectures [10] are as follows.
• Design starts with the application behavior.
• Evaluate several architectural options.
• Identify hardware functionalities to speed up the application.
• Introduce hardware resources for frequently used operations only if it

can be supported during the compilation.
Fig. 2.1 shows the main steps, which meets the above requirements for
Compiler based frameworks.
1. Application Analysis : Input to the ASIP design process is an ap-

plication or a set of applications, along with their test data and design
constraints. It is essential to get the desired charecteristics/requirements
which can guide the estimation of each Application specific Functional
7
Application Code &
Design Constriants
Application Analysis
Identification Algorithms Automatic Identification of Course Grained FUs
Estimation Algorithms Evaluation of FU’s
Selection Algorithms Selection of FU’s
Customization of Instruction Set

Machine Architecture Specialization
Synthesis Code Generation
Simlation
Figure 2.1: ASIP Synthesis Methodology
8
Units(AFUs) as well as the selection of AFUs. An application written
in a high level language is analysed statically and dynamically using the
test data and the analysed information is stored in some suitable inter-
mediate format, which is used in the subsequent steps.
2. Identification of Course Grained FU’s : In this step all the pos-

sible AFU’s are searched in the basic block depending on the design
constraints like inputs/ouputs to the AFU, area of the AFU. Here some
search control techniques also called pruning techinques are used depend-
ing on the architectural constriants to limit the search space.
3. Estimation of AFU’s : In this step all the identified AFUs are esti-
mated depending on the base architecture. Estimation further removes
the inferior AFUs, which decreses the search space for the next step
selection.
The following techniques are generally used for estimation individualy or

a combination at different stages of the ASIP Synthesis methodology.
(a) Objective Function based : Since the general techniques for es-
timation are very costly to be used at early stages of ASIP synthesis
due to large search space we need an heuristic estimation methodol-
ogy to decrease the estimation time at early stages. Depending on
the architecture under consideration a heuristic objective function
is formed which indeed gives the estimation of gain when the AFU
is implemented in the hardware. Wrong objective function leads to
poor estimation and poor selection of AFUs, which intern decreases
the performance.
(b) Scheduler based : In this type of approach the problem is formu-

lated as a resource constrained scheduling problem with the selected
9
AFUs and the FUs available on the base architecture and the ap-
plication is scheduled to generate an estimate of the cycle count.
(c) Simulation based: In this approach a simulation model of the

architecture based on the selected features is generated and the
application is simulated on this model to compute the performance
4. Selection of AFU’s : In this a few AFUs which are superior among the
AFUs, which passed trough the Estimation step are selected depending
on some criteria.
5. Customization of Machine Architecture : In this step the machine

architecture is specialized to include the selected AFU’s which gives bet-
ter performance for the appliation, or a set of applications.
6. Instruction Set Specialization : This step involves specifying the

behaviour of the selection AFUs. This helps the retargetable compiler
to identify the patterns for the new AFUs and assigns them to the new
AFUs.
7. Code Generation : This step replaces the set of operations with the
new intructions, which can use the new AFUs now available on the ma-
chine architecture.
8. Simulation : In this step the applications are simulated with the new
AFUs avaliable in the architecture, to evaluate the gain of using the
AFUs in the base architecture.
10
2.2 Framework - Trimaran Compiler Infras-
tructure
ASIP synthesis is dependent on the availability of Retargetable Compiler
and Configurable Simulator. Hence, the Trimaran Framwork which has
these tools inherently built in, is used as a natural choice to facilitate the
design methodology.
Trimaran [1] is a compiler infrastructure for supporting state of art research
in compiling for Instruction Level Parallel(ILP) architectures. Trimaran com-
piler infrastructure is comprised of the following components as shown in the
Fig. 2.2
• A machine description facility, MDES, for describing ILP architectures.
• A parameterized ILP Architecture called HPL-PD.
• A compiler front-end, called IMPACT, for C. It performs parsing, type

checking, and a large suite of high-level(i.e., machine independent) opti-
mizations.
• A compiler back end, called Elcor, parameterized by a machine descrip-

tion, performing instruction scheduling, register allocation, and machine-
dependent optimizations. Each stage of the back-end may easily be re-
placed or modified by a compiler researcher.
• An extensible IR (intermediate program representation) which has both

an internal and textual representation, with conversion routines between
the two. The textual language is called Rebel. This IR supports mod-
ern compiler techniques by representing control flow, data and control
dependence, and many other attributes. It is easy to use in its internal
representation and its textual representation.
11
IMPACT
* C Parsing, Renaming, Flattening
Application Code & * Control Flow Profiling, Function Inlining Bridge Code
Test Data * Machine independent optimizations
* Block Formation
* ILP Transformations
* Machine dependent
Code Optimizations
* Code Scheduling
* Register Allocation
Elcor IR
SIMULATOR GENERATOR
Execution SIMULATOR * Elcor IR to low level C files

Statistics
* HPL−PD virtual machine
* Cache Simulation
HMDES
Figure 2.2: Components of Trimaran Compiler Infrastructure
12
• A cycle-level simulator of the HPL-PD architecture which is configurable
by a machine description and provides run-time information on execution
time, branch frequencies, and resource utilization. This information can
be used for profile-driven optimizations as well as to provide validation
of new optimizations.
2.2.1 MDES
The machine description model (MDES) in TRIMARAN allows the user to

develop a machine description for the HPL-PD family of processors in a high-
level language, which is then translated into a low-level representation for
effiecient use by the compiler. The high- level language allows the specification
of detailed execution constraints in an easy-to- understand, maintainable, and
retargetable manner. The low-level representation is designed to allow the
compiler to check execution constraints with high space/time efficiency.
The MDES [12] infrastrcture in Trimaran is shown in Fig. 2.3. The target
(HPL-PD processor) is described in a general relational database description
language called MD language. Though the MD language allows for a variety
of representations, the HPL-PD machine descriptions are restricted to follow a
particular format called HMDES (High-level Machine Description) version 2.
After the macro processing and compilation of the high-level descripiton, the
corresponding low-level specification, expressed in LMDES version 2, is loaded
within the compiler using the customize module that reads the textual LMDES
specification and builds the internal data-structures of the MDES database.
The information contained within the machine description database is made
available to various modules of the Trimaran infrastructure through a query
interface called mQS.
13
Compiler
hmdes2 Hmdes2pp Lmdes
scheduler
register
mQS allocation
mQS
Optimizer & RU Map
Pre−processor Translator interface
Mdes
DB Simulator
Customize
mGen
Figure 2.3: MDES in Trimaran
14
2.2.2 HPL-PD Architecture
HPL-PD [11]is a parameterized ILP architecture from HP labs. It servers as

a vahicle for processor architecture and compiler optimization research. It
admits both VLIW and superscalar implementations. The HPL-PD parame-
ter space includes number and types of fuctional units, number and type of
registers, width of the instruction word and instruction latencies.
HPL-PD has a number of intersting architectural features, which includes
• Support for speculative execution
– data speculation
– control speculation
• Predicated Execution
• Memory System
– Compiler-visible cache hierarchy
– Serial behavior of parallel reads/writes
• Brach architecture
• Efficient boolean reduction support
• Software loop pipelining support
2.2.3 IMPACT
The IMPACT forms the front end compiler of the Trimaran Infrastructre.
It is a generalized C compiler that can generate optimized code for various
architectures and machine resource configurations. It also implements new
research optimizations so that their effect on code can be analyzed.
15
IMPACT is divided into three distinct parts based on the intermediate code
representation used. The highest level is called PCode and is a parallel C code
representation with intact loop constructs. Memory system optimizations,
loop- level transformations and memory dependence analysis takes place at
this level.
The next lower level is called Hcode and is a simplified C representation
with only simple if-then-else and goto control flow constructs. Statement-
level profiling, profile-guided code layout and function inline expansion are
performed at this level.
The final and lowest code representation is called Lcode. Lcode is a gener-
alized register transfer language. It is similar to most RISC instruction-based
assembly languages. At the Lcode level, classic machine independent optimiza-
tions are performed. Lcode is an instruction set for a load/store architecture
that supports unlimited virtual registers. It is broken down into data and
function blocks. The functions are composed of control blocks containing the
Lcode instrucions.
2.2.4 ELCOR
Elcor forms the back end compiler of the Trimaran Framework. It has sev-
eral components, and each of this component reads, analyzes and tranforms
the Rebel. The researcher can remove some of the components which are not
required with out effecting the compilation process. It is possible to add plug-
gable modules which reads the intemediate data structures and can do some
transformations on the Rebel.
Elcor has the following main components
• Elementary data structures
• Intermediate Representation data structures
16
• I/O modules
• Mdes interface
• Analysis modules
• Transformation and optimization modules
• Scheduling modules
• Rotating register allocator
• Static register allocator
2.2.5 Simulator
The simulator provided by the Trimaran Framework converts the Elcor gener-
ated IR into an executable code and emulates its execution on a virtual HPL-
PD processor. The simulator is capable of gathering the execution statistics
like the number of cycles taken for the exection, average number of operations
executed per cycle etc, and also emits the execution trace.
2.3 Design space exploration for single issue

machines
ASIP design space exploration involves the following three steps
• Identification of Course Grained FU’s
• Evaluation of FU’s
• Selection of FU’s
All the steps specified for ASIP design space exploration are explored for
single issue machines in [2],[4],[7].
17
2.3.1 Identification of Coarse Grained FU’s
Idenfication step searches the whole design space for all possible Coarse Grained
FU’s given the design constraints like number of maximum input/output ports,
area, the type of operations that can be part of FU’s. Unfortunately the de-
sign space for ASIPs is exponential on the number of nodes present in the
design space. In order to limit the search space, some heuristics are needed to
purne the search space. This pruning should discard only the inferior FUs. In
the last decade a number of Identification algorithms with some good pruning
techinques are developed to decrease the search space to subexponential.
Sec. 2.3.1 gives the problem statement for identification of MIMOs. Sec 2.4
gives the optimal identification algorithm, based on the publication [2].
Problem Statement
We call G(V, E) the DAGs representing the dataflow of each basic block; the
nodes V represent primitive operations and the edges E represent data depen-
dencies. Each graph G is associated to a graph G+ (V ∪ V + , E ∪ E + ) which
contains additional nodes V + and edges E + . The additional nodes V + rep-
resent input and ouput varibles of the basic block. The additional edges E +
connect nodes V + to V , and nodes V to V + .
A cut S of G : S ⊆ G. There are 2|V | possible cuts, where |V | is the number
of nodes in G. An arbitrary function M(S) measures the merit of a cut S. It
is the objective function of the optimisation problem introduced below and
typically represents an estimation of the speedup achievable by implementing
S as a special instruction.
We call IN(S) the number of predecessor nodes of those edges which enter
the cut S from the rest of the graph G+ . They represent the number of
input values used by the operations in S. Similarly, OUT is the number of
predecessor nodes in S of edges exiting the cut S. They represent the number
18
of values produced by S and used by other operations, either in G or in other
basic blocks.
Finally, we call the cut S convex if there exists no path from a node u ∈ S
to another node v ∈ S which involves a node w ∈
/ S.
Considering each basic block independently, the identification problem can
now be formally stated as follows.
Problem 1 Given a graph G+ , find the cut S which maximises M(S) under
the following constraints:
1. IN(S) ≤ Nin
2. OUT(S) ≤ Nout
3. S is convex
The user-defined values Nin and Nout indicate the register-file read and
write ports, respectively, which can be used by the special instruction. The
convexity constraint ensures the feasible schedule for single issue machines like
RISCs. Unfortunately it is not true for multiple issue machines. Some non-
convex cuts have feasible schedule in multiple issue machines. The convexity is
one of the best legility check, which prunes the search space to subexponential.
identification () {
for (i = 0; i < NODES; i++) cut [i] = 0;
topological sort() ;
search(1,0) ;
search(0,0) ; }
search ( current choice, current index ) {
cut[current index] = current choice;
if (current choice == 1) {
if (!output port check ()) return;
if (!convexity check ()) return;
if (!input port check ()) {
calculate speedup ();
update best solution (); } }
if ((current choice + 1) == NODES) return;
current index = current index + 1;
search (1,current index);
search (0,current index); }
Figure 2.4: General MIMO Identification Algortihm.
19
The algorithm shown in the Fig 2.4 is a general identification algorithm. It
is a naive algorithm which searches all the possible cuts. But the pruning based
on the Output port check and convexity check decreases the search space to
subexponential. Since the nodes are considered in the topological sorting order
the number of output ports doesnot decrease by including the higher number
nodes in the topological sort. So, we can prune all these cuts by not searching
the branch. Similarly, once the convexity constraints fails, it is impossible to
regain convexity by including the higher number nodes in the topological sort.
These two pruning techniques decreases the search space to sub exponential.
2.3.2 Objective Function based Estimation
Let λs w be the latency of the operation when executed on GPP and λhw be
the latnecy when the operation executed on the dedicated hardware. Let γ be
the frequency of the region i.e., number of times the region is executed.
P
Total number of cycles taken to execute the nodes in the cut is λsw
The time taken to execute the cut when implemented a dedicate hardware
is the hardware critical path of cut. Let CPcut .
The gain by implementing the cut using the dedicated hardware funtional
P
unit is γ ∗ ( λsw − dCPcut e).
Therefore the objective is
P
Maximize γ ∗ ( λsw − dCPcut e)
Since this heuristic correctly estimates each cut for single issue machines,
we can say that it is the optimal estimation for single issue machines.
2.4 Static List Scheduling Algorithm

For multiple issue machines like VLIW processors we need best scheduling al-
gorithm to schedule the nodes in the region. Since it is a resourse constraint
scheduling this is an NP-hard problem. A number of heuristics exits in the
20
literature for resourse constraint scheduling for different situations. In this
present work we have used the following Static List based Scheduling Algo-
rithm.
The Static list scheduling algorithm [9]also termed as as-late-as-possible(ALAP )
heuristic is based on a combination of the as-soon-as-poosible (ASAP ) schedul-
ing and the list-scheduling heuristic using ALAP and ASAP values of an oper-
ation as its priority function. Given a performance constraint, the ALAP (ASAP )
value of an operation is the latest (earliest) time-step in which the operation
can start execution without violating a performance constraint. The use of
ALAP and ASAP values in high-level synthesis sheduling has been limited to
deriving other priority functions such as ”mobility”, ”freedom” and ”force”.
The ALAP heuristic accepts a data flow graph, delay associated with each
operation in the data flow graph, a clock cycle, and a set of resources. The
output of the heuristic is a time-step partition of the data flow graph with
the mapping of the start of the execution of an operation to a time-step. The
ALAP heuristic schedules the graph minimizing overall delay subject to a
fixed resource count. The ALAP value computation requires a performance
constraint, but since the ALAP value is used for ordering the nodes only,
the absolute ALAP values are not important and any performance constraint
value (even less than the critical path value) would suffice. By this argument
the designer need not specify the performance constraint, and the program
can just as well use the critical path delay, which is done. The nodes of the
data flow graph are assigned the ALAP value by traversing the graph in a
bottom-up fashion from sink to source.
The worst-case computational complexity of the ALAP heuristic is
O(nlog(n) + Elog(E) + pE, where n is the number of nodes (operations) in
the data flow graph, E is the number of edges (excluding edges with root as
source and outport as sink) denoting precedence constraints in the data flow
graph, and p is the maximum number of resources of any operation type.
21
1. Read the data flow graph, number of modules of each type, performance
constraint and clock cycle.
2. Find the ALAP value for each node. In case of a large clock cycle
value, two or more operations connected by a precedence relation may be
assigned identical ALAP values indicating potential for chaining these
operations, Chaining is possible if the combined delays of the chained
operations does not exceed clock cycle value and is subject to resource
availability.
3. Sort nodes in ascending order of ALAP value. All the nodes with iden-
tical ALAP values are sorted in increasing order of their ASAP value.
Since operations with precedence relationship may have identical ALAP
values (in case of possible chaining), sorting according to precedence re-
lationship is also done. Place the sorted result in list L .
4. While (L not empty)
(a) Let nextnode be the next node in list L .

(b) Assign node n to start execution in hte earliest time-step ensuring
that its precedence relationship is satisfied and a module of correct
type is available. Chain the node if possible.
(c) Delete node n form list L
Figure 2.5: The static list scheduling algorithm
22
Chapter 3
Design Space Exploration for

VLIW processors
In this chapter we show the design space exploration techniques used for single
issues machines may not be optimal for multiple issue machines and proposes
the best exploration methedologies for VLIW processors. [8], [3] started the
design space exploration in multiple issue machines.
3.1 Identification
The general identification algorithm used for single issue machines shown in
subsection 2.3.1 can be used for VLIW processors, if the pruning techniques
used for single issues machines are also valid for VLIW machines.
In the pruning techniques used for single issue machines Output port con-
straint is also true for multiple issue machines. But, the convexity principle
Operation H/W latency S/W latency

∗ 2.70 3
+, − 0.33 1
>>, << 0.10 1
Table 3.1: Hardware and sofware latencies of the operations in Fig. 3.1
23
Cycle− 0 >>
Cycle−1
Cycle− 2 * +
Cycle− 3
Cycle− 4 +
Figure 3.1: Pseudo convex cut
24
may not be true for multiple issue machines. Fig. 3.1 shows a non convex cut
for which scheduling exits.
Table. 3.1 shows the hardware and software values for the operations used
in the Fig. 3.1. COLUMN 1 has operations, column 2 have the hardware syn-
thesized hardware values, and column 3 represents the corresponing software
cycles to execute the operations on the GPP.
Since the single issue machines has only one functional unit active at a time
the cut shown in the Fig. 3.1 cannot be scheduled. But, in VLIW processor we
have multiple functional units active at a time, the cut is scheduled to a new
AFU and “+” operation can be scheduled to one of the available functional
units in the processor. After cycle-0 the AFU allocated for the cut gives an
output and takes an input in the cycle-2. Meanwhile the “+” operation is
executed by the functional unit and gives the required input for the AFU.
Therefore, there exits non convex cuts for which scheduling exists for mul-
tiple issue machines. Lets call them as pseudo convex cuts. But, these pseudo
convex cuts are subsets of some convex cut and number of outputs for the
pseudo convex cut are greater than or equal to the number of outputs of
the convex cut.
As shown in the Fig. 3.1 the “+” operation which is not in the cut can be
included with out decreasing the performance and increasing the number of
inputs and outputs.
Since our main objective is to maximize the performance, including the
nodes which are not in the pseudo convex cut in the hardware does not increase
the output ports and increases the performance. So we can use the same
identification methodology used in Sec. 2.3.1 for multiple issues machines with
out decrease in performance.
25
3.2 Estimation
In this section we give a motivate example which shows that the objective
function used for single issue machine is not good estimation heuristic for
multiple issue machines. Next, we propose an estimation heuristic for multiple
issue machines, which involves forming an objective function for multiple issue
machines under some assumptions. Next, we will give an estimation based on
resourse scheduling.
3.2.1 Motivation Example
Fig. 3.2 shows a motivation example where the heuristic used for the single
issue machines cannot estimate correctly the performance of the cut for mul-
tiple issue machines. For multiple issue machines it is the critical path that
matters then the performance of the individual cut.
Assume that we have a constraint of two inputs and one output. We have
to select the best cut of the DAG shown in the Fig. 3.2 satisfying the above
constraints. we do not count Const inputs as inputs to the cut, Since we
can hardwire there Const in the hardware. The heuristic used for single issue
machines selects the left connected component of the DAG, Because the gain
by implementing it using heuristic is 2 cycles.
For multiple issue machines it is the critical path that matters than the
individual performance gain of the cut. The above cut which was selected
using heuristic of single issue machines does not decrease the critical path
length. The best cut is the connected component on the right side of the DAG
shown in the figure.
Therefore we need a differnt method for the estimation of the cut. The
next two subsections gives two different methods for estimation of the cut for
multiple issue machines.
26
Const x y Const w z
<< +
_ *
Const
>>
Figure 3.2: Motivation Example
27
3.2.2 Objective Function
The main objective of this project is to form an better objective function for
estimating the cut for multiple issue machines. For multiple issue machines it
is the critical path that matters than the local speed up of the cut. So, we need
to estimate the critical path length of the region when this cut is implemented
as a new instrution.
Multiple issue machines are meant to improve performance using the in-
struction level parallelism(ILP) available in the applications. Since ILP is
available is the VLIW processor, we assumed infinite parallel processors i.e.,
there is no resource constraint in the VLIW processors and incorporting extra
functional units to the processor will not improve performance. The results
shows that this assumption is resonable assumption and gives results almost
same or near to the results of scheduling method. The down side of this as-
sumption is that it fails to estimate correctly when the VLIW processor does
not have sufficient parallelism.
Fig. 3.3 explains the parameters to form the objective function available
for multiple issue machines. We will try to find the critical path length of the
region after introducing the Special AFU for the cut under the assumption of
infinite parallel available in the VLIW processors.
The notations used in the Fig. 3.3 are as follows
• Cpre : estimated number of cycles before the execution of AFU
• Cpost : estimated number of cycles required after the execution of AFU
• CPhw : hardware latency of the cut [5]
• CS 0 : The length of the critical path of the subgraph G − S
Lets assume that the cut includes the nodes in the critical path of the
region. Since, some nodes of the region are now implemented in hardware
28
Cycle − 0
Cpre Cycle − 1
Cycle − 2
CPhw Cycle − 3
Cs’
Cycle − 4
Cycle − 5
Cpost Cycle − 6
Cycle − 7
Figure 3.3: Motivation Example 2
29
functional units the length of the critical path changes and it may no longer
be the critical path of the region. So, we need to find out the length of the
critical path after for each cut. CS 0 keeps track of the estimated critical path
length for each cut dynamically. This varible can be undated intermentally
while identification path with appropriate data structures. We can devide the
nodes into O(n) individual components where n is the number of nodes in the
region, for which CS 0 remains constant. So we need to update CS 0 atmost O(n)
times. Therefore the total time complexity to change the varible in varible CS 0
is atmost O(n) * time complexity to find CS 0 .All the remaining variables can
be updated in constant time at each node of the identification search tree.
Let CG reprents the number of scheduled cylces to execute the region with
out any AFU. The estimated number of cylces to execute the region with
special AFU for the cut is shown in the eq. 3.1
M ax({Cpre + CPhw + Cpost }, CS 0 ) (3.1)
The number of cycles gained by implementing a special AFU for the cut is
shown in the equation eq. 3.2
CG − M ax({Cpre + CPhw + Cpost }, CS 0 ) (3.2)
Since our objective is to maximize the gain obtained by implementing spe-

cial AFU for the cut, the objective function is as shown in the eq. 3.3
M aximize(CG − M ax({Cpre + CPhw + Cpost }, CS 0 )) (3.3)
3.2.3 Scheduling based Estimation
This method we can call it as naive method of estimation for multiple issue
machines like VLIW processors. Let CG be the number of scheduled cycles with
out any special AFU and Ccut be the number of scheduled cycles assuming the
AFU is implemented by a new instruction.
30
Therefore the gain by using a special AFU for the region is
M aximize(CG − Ccut ) (3.4)
3.3 Selection
Let γi be the frequency of the region and Gaini be the gain of region i by
selecting the best AFU for each region. The total number of cycles gained for
each region is γi ∗ Gaini . Sort these in descending order and select the top k
AFU’s which gives the maximum benifit.
31
Chapter 4
Implementation
In this chapter we find discuss about the earlier implementation [3] and then
we give the limitations of the earlier work. Next section gives the enhanced
implementation which eliminates the most of the implementations discussed
in the earlier work.
4.1 Earlier Framework

Fig. 4.1 shows the earlier implementation of VLIW ASIP design space explo-
ration. The components of the framework which are shown in the pink color
are modified in the earlier framework to incoperate the modules of the design
space exploration.
Identification, estimation and selection modules are written in the Mach-
SUIF compiler infrastructure environment. Since, MachSUIF environment is
mainly developed for research in single issue machines like RISC and CISC ma-
chines it does not have the support for multiple issue machines. For evaluation
we need an compiler infrastructure which gives cycle level estimation for VLIW
processors. Trimaran compiler infrastructure is mainly developed for research
in multiple issue machines. Therefore they used Trimaran for evaluation.
Since, all the explorations are designed in MachSUIF we need an interface
between MachSUIF and Trimaran. After selection all the AFU’s are replaced
32
Application code &
Design Constraints
Identification IMPACT
&
MachSUIF
Estimation Hcode Bridge Code

Instrumented Pcode IS Extension Lcode with new instruction
C Program
Selection
Modified
ELCOR
Recognise
New Instrcution
SIMULATOR GENERATOR
Elcor IR
Exececution SIMULATOR with semantics of new operations
Statistics
HM DES
33 Framework
Figure 4.1: Earlier
by function calls as cited in [3]. These functional call are identified in the
PCODE phase of IMPACT component of Trimaran. The identified special
functional calls are replaced by new instructions by appropriately modifying
the intermediate data structure of Trimaran to identify the new instructions.
Afther replacement of the functional calls by new instructions the source
code of the ELCOR component are modified to make it to identify the new
instructions. The source code of the SIMU component are modified to identify
the new behaviour of the new instructions.
Execution statistics are compared with out adding any special AFU’s and
with special AFU’s. Basically, Trimaran is used for evaluation purpose and
not for the design space exploration.
4.1.1 Limitations
• The main limitation of the earlier framework is the estimation method

used for estimating the AFU’s. Basically estimation methodology used
is based on the single issue machines. This methodology over estimates
some of the AFU’s and under estimates the better AFU’s which may
lead to inferior selection of AFU’s.
• Another limitation is use of two compiler infrastructures. Trimaran have

lot of in built architecture dependent and architecture independent op-
timisations available. Since, the implementation of design space explo-
ration steps in MachSUIF we could not use the optimisations available
in the Trimaran compiler infrastructure.
• The ISA available in MachSUIF is alpha machine which is RISC based

single issue machine. Therefore the exploration was done in the ISA
which is also not a multiple isssue machine.
• In order to make the Trimaran to identify the new instruction earlier
34
framework modified the source code of the framework, which requires
the compilation of whole framework for each benchmark.
• Lot of perl scripts are written to modify the source code to make the
Trimaran to identify the new instruction.
4.2 Enhanced Framework

Fig.4.2 shows the enhanced implementation, which eliminate the limitations
of the earlier framework. The design space exploration modules are now im-
plemented in elcor phase of trimaran to take the advantages of architecture
dependent and architecture independent optimisations avaible in the ELCOR
phase of Trimaran. The phase is shown in tbe light blue specifies the design
space exploration modules implemented in the ELCOR phase. The modules
which are shown in the pink are implemented and not incorporated due to the
limitations in the Trimaran. Trimaran supports only four 4 inputs or 4 out-
puts for each AFU. Due to this limitation these modules are not incorporated.
But, these can be incorporated if the Trimaran infrastructure is modified for
as many inputs or outputs.
ASIP design space exploration methodology specific to VLIW processors is
shown in the Fig. 4.2. Heuristic-1 is the estimation methodology used in the
earlier framework. This estimation methodology is used only for comparison
purpose. Heurisitc-2 is the new estimation method proposed in the earlier
chapter for multiple issue machines. The scheduling method for estimation is
also implemented.
We implemented one scheduling algorithm which assumes atmost one spe-
cial AFU and schedules the instructions for VLIW using the static scheduling
algorithm which was given in sec. 2.4.
Identification and estimation are combined and searches the whole design
space and finds best AFU for each region using the three estimation methods.
35
Application IMPACT Bridge Code
Code
ELCOR−I
VLIW ASIP
Exploration
AFU’s data
Extended Extended Processor

Behaviour HM DES Extension
ELCOR−II
SIMULATOR
Results & SIMULATOR Elcor IR
GENERATOR
Statistics
36
Figure 4.2: Enhanced Framework
Identification &Estimation
(Heuristic 1,2 &Scheduling )
Select top K AFUs
Scheduler
( At most one AFU for region )
Statistics
Figure 4.3: VLIW ASIP exploration
37
After that top k AFU’s are selected for three estimation methods. These AFU’s
are again passed through the static scheduling module to correctly estimate
the gain of k top AFU’s.
Three estimatation methodologies are compared. In all the cases the Heuristic-
2 estimation methodoly gives as good as scheduling method. But, one can use
the incremental approach for static scheduling method which can be imple-
mented in O(1)amortized cost.
Heuristic-1 estimates better than Heuristic-2 where the base architecture
doesnot have sufficient parallelism. But, according to the result for all generic
architecture Heuristic-2 gives better results than Heuristic-1 with the same
time complexity.
38
Chapter 5
Results
We have done case study on several applications from media benchmarks, as

well as several applications available in the trimaran benchmarks directory. All
the applications are passed through Trimaran and searched the design space
of the applications for the best top k AFU’s using the same identification
algorithm, but the estimation methods are different. We have shown that
multiple issue machines needs different estimation methods and we also shown
that the estimation objective function(heuristic-2) proposed for the multiple
issue machines good results than the earlier objective function based (heuristic-
1) estimation method. These two estimation methods are again compared with
the scheduled method.
For all the applications we have started with some basic architecture with
number of integer FU’s, float FU’s , memory units, branch units, input ports
and output ports available. For simplicity we assumed two inputs for each
functional units and one output port for each functional unit.
The basic question with multiple issue machines is whether providing infi-
nite resources to the processor gives better performance than ASIP’s. The
present work answers this question by increasing the functional units and
adding AFUs for those architectures. For simpilicity we have assumed that
AFU’s donot add any extra input and output ports. i.e., we assumed that
these new AFU’s shares the availble input and output ports available in the
39
processor.
The architectures are specified in the format
#Int FU’s/#Float FU’s/#Branch Units/#Memory Units/#input ports/#output
ports
5.1 rawdaudio application

Table. 5.1 shows the results of the media benchmark application rawcaudo. In
this experiment we have increased the number of FU’s and AFU’s. Increasing
the architecture to 4/4/4/4/32/16 gives a perfomance of approximately 1.3
times the base architecture.
By adding 8 AFUs to the architecture 1/1/2/2/12/6 we improved the per-
formace to 1.817. The rawcaudio is an application which is fully computational
based and requires more number of input ports to find out the best AFUs. In
this case increasing the functional units increasing the number of ports but
not explores the parallelism available in the application.
With sufficient number of ports, heuristic-2 and scheduling methods gives
the same performance i.e., they are able to estimate correctly than heuristic-1.
5.2 fir application

Table. 5.2 shows the results for fir application. For this application heuristic-1
and heuristic-2 estimates the AFUs same way, but they are not able to estimate
as correct as scheduling method. In any case all the three methods gives better
performance than just increasing the FUs of the base processor.
40
Heuristic-1(afu’s) Heuristic-2(afu’s) Sched(afu’s)
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.202 1.301 1.461 1.173 1.284 1.439 1.202 1.319 1.484
1/1/2/2/12/6 1.279 1.477 1.629 1.718 1.525 1.629 1.75 1.525 1.687 1.817
1/1/4/4/20/10 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
1/2/1/1/10/5 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
1/2/2/2/14/7 1.279 1.477 1.629 1.718 1.525 1.629 1.75 1.525 1.687 1.817
1/2/4/4/22/11 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
1/4/1/1/14/7 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
1/4/2/2/18/9 1.279 1.478 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
1/4/4/4/26/13 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
2/1/1/1/10/5 1.08 1.218 1.32 1.485 1.218 1.284 1.44 1.218 1.32 1.485
2/1/2/2/14/7 1.279 1.477 1.629 1.718 1.525 1.629 1.75 1.525 1.687 1.817
2/1/4/4/22/11 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
2/2/1/1/12/6 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
2/2/2/2/16/8 1.279 1.477 1.629 1.718 1.525 1.629 1.75 1.525 1.687 1.817
2/2/4/4/24/12 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
2/4/1/1/16/8 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
2/4/2/2/20/10 1.279 1.478 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
2/4/4/4/28/14 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
4/1/1/1/14/7 1.08 1.218 1.32 1.485 1.218 1.284 1.44 1.218 1.32 1.485
4/1/2/2/18/9 1.279 1.477 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
4/1/4/4/26/13 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
4/2/1/1/16/8 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
4/2/2/2/20/10 1.279 1.477 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
4/2/4/4/28/14 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
4/4/1/1/20/10 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
4/4/2/2/24/12 1.279 1.478 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
4/4/4/4/32/16 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
Table 5.1: rawcaudio benchmark results
41
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.139 1.223 1.314 1.139 1.223 1.314 1.139 1.223 1.34
1/1/2/2/12/6 1.219 1.372 1.452 1.495 1.372 1.452 1.495 1.372 1.452 1.502
1/1/4/4/20/10 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
1/2/1/1/10/5 1.123 1.252 1.318 1.354 1.252 1.318 1.354 1.252 1.318 1.359
1/2/2/2/14/7 1.219 1.372 1.452 1.495 1.372 1.452 1.495 1.372 1.452 1.502
1/2/4/4/22/11 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
1/4/1/1/14/7 1.198 1.345 1.376 1.386 1.345 1.376 1.386 1.345 1.379 1.392
1/4/2/2/18/9 1.284 1.455 1.491 1.503 1.455 1.491 1.503 1.455 1.497 1.512
1/4/4/4/26/13 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
2/1/1/1/10/5 1.018 1.162 1.25 1.345 1.162 1.25 1.345 1.162 1.25 1.347
2/1/2/2/14/7 1.221 1.374 1.455 1.498 1.374 1.455 1.498 1.374 1.455 1.505
2/1/4/4/22/11 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
2/2/1/1/12/6 1.125 1.253 1.32 1.356 1.253 1.32 1.356 1.253 1.32 1.361
2/2/2/2/16/8 1.221 1.374 1.455 1.498 1.374 1.455 1.498 1.374 1.455 1.505
2/2/4/4/24/12 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
2/4/1/1/16/8 1.198 1.345 1.376 1.386 1.345 1.376 1.386 1.345 1.379 1.392
2/4/2/2/20/10 1.284 1.455 1.491 1.503 1.455 1.491 1.503 1.455 1.497 1.512
2/4/4/4/28/14 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
4/1/1/1/14/7 1.034 1.183 1.274 1.373 1.183 1.274 1.373 1.183 1.274 1.376
4/1/2/2/18/9 1.221 1.375 1.455 1.499 1.375 1.455 1.499 1.375 1.455 1.506
4/1/4/4/26/13 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
4/2/1/1/16/8 1.143 1.276 1.346 1.383 1.276 1.346 1.383 1.276 1.346 1.388
4/2/2/2/20/10 1.221 1.375 1.455 1.499 1.375 1.455 1.499 1.375 1.455 1.506
4/2/4/4/28/14 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
4/4/1/1/20/10 1.198 1.345 1.354 1.362 1.345 1.376 1.387 1.345 1.379 1.392
4/4/2/2/24/12 1.284 1.455 1.491 1.503 1.455 1.491 1.503 1.455 1.497 1.512
4/4/4/4/32/16 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
Table 5.2: fir benchmark results
42
5.3 mmdyn application
Table. 5.3 shows the results for the application mmdyn. mmdyn is one of
the benchmark available in the Trimaran benchmark directory. It is the better
benchmark to explain the benfit of using the heuristic-2 for estimation. By
using heuristic-1 we are able to get a perfomance improment of 1.28 which is
near to the performance improvement acheivable by increasing the FU’s.
Heuristic-2 estimates nearer to the scheduling method. This benchmark
is the good example for the need of different estimation method for multiple
issue machines.
5.4 fact application

Table. 5.4 shows the results for this application. This is a good application to
show the benifit of using the AFU’s than increasing the FU’s. The application
is fully computation based and require special FU’s to get the performance
improvement. We are able to increase performance upto 2X by specail AFUs
which is huge improvement than increasing the FU’s of the processor.
5.5 sqrt application

Table. 5.5 shows the resutls of the application sqrt. This application also is
the best example to show the benfit of using special AFUs. In this case also
using special AFUs increases the performance better than just adding FUs.
43
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.034 1.056 1.058 1.17 1.198 1.21 1.179 1.227 1.25
1/1/2/2/12/6 1.209 1.249 1.27 1.272 1.468 1.497 1.514 1.468 1.527 1.545
1/1/4/4/20/10 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
1/2/1/1/10/5 1.035 1.064 1.072 1.074 1.218 1.238 1.25 1.218 1.249 1.261
1/2/2/2/14/7 1.229 1.249 1.27 1.272 1.482 1.512 1.515 1.497 1.527 1.545
1/2/4/4/22/11 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
1/4/1/1/14/7 1.043 1.072 1.081 1.082 1.229 1.25 1.262 1.229 1.26 1.272
1/4/2/2/18/9 1.24 1.26 1.271 1.273 1.497 1.528 1.531 1.497 1.528 1.546
1/4/4/4/26/13 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
2/1/1/1/10/5 1.021 1.05 1.065 1.066 1.2 1.229 1.24 1.2 1.239 1.261
2/1/2/2/14/7 1.23 1.25 1.271 1.273 1.483 1.512 1.515 1.497 1.528 1.546
2/1/4/4/22/11 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
2/2/1/1/12/6 1.035 1.065 1.08 1.081 1.219 1.239 1.251 1.219 1.249 1.261
2/2/2/2/16/8 1.23 1.25 1.271 1.273 1.483 1.512 1.515 1.497 1.528 1.546
2/2/4/4/24/12 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
2/4/1/1/16/8 1.043 1.072 1.081 1.082 1.229 1.25 1.262 1.229 1.26 1.272
2/4/2/2/20/10 1.24 1.26 1.271 1.273 1.497 1.528 1.531 1.497 1.528 1.546
2/4/4/4/28/14 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
4/1/1/1/14/7 1.029 1.057 1.073 1.074 1.21 1.239 1.251 1.21 1.25 1.262
4/1/2/2/18/9 1.23 1.25 1.271 1.273 1.483 1.512 1.515 1.497 1.528 1.546
4/1/4/4/26/13 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
4/2/1/1/16/8 1.036 1.065 1.073 1.074 1.22 1.24 1.251 1.22 1.25 1.262
4/2/2/2/20/10 1.23 1.25 1.271 1.273 1.483 1.512 1.515 1.497 1.528 1.546
4/2/4/4/28/14 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
4/4/1/1/20/10 1.043 1.072 1.081 1.082 1.229 1.25 1.262 1.229 1.26 1.272
4/4/2/2/24/12 1.24 1.26 1.271 1.273 1.497 1.528 1.531 1.497 1.528 1.546
4/4/4/4/32/16 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
Table 5.3: mmdyn benchmark results
44
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.625 1.696 1.773 1.625 1.696 1.773 1.625 1.696 1.773
1/1/2/2/12/6 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
1/1/4/4/20/10 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
1/2/1/1/10/5 1.04 1.733 1.773 1.814 1.733 1.773 1.814 1.733 1.773 1.814
1/2/2/2/14/7 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
1/2/4/4/22/11 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
1/4/1/1/14/7 1.054 1.773 1.814 1.814 1.773 1.814 1.814 1.773 1.814 1.814
1/4/2/2/18/9 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
1/4/4/4/26/13 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
2/1/1/1/10/5 1.013 1.66 1.733 1.814 1.66 1.733 1.814 1.66 1.733 1.814
2/1/2/2/14/7 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
2/1/4/4/22/11 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
2/2/1/1/12/6 1.04 1.733 1.773 1.814 1.733 1.773 1.814 1.733 1.773 1.814
2/2/2/2/16/8 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
2/2/4/4/24/12 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
2/4/1/1/16/8 1.054 1.773 1.814 1.814 1.773 1.814 1.814 1.773 1.814 1.814
2/4/2/2/20/10 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
2/4/4/4/28/14 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
4/1/1/1/14/7 1.013 1.66 1.733 1.814 1.66 1.733 1.814 1.66 1.733 1.814
4/1/2/2/18/9 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
4/1/4/4/26/13 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
4/2/1/1/16/8 1.04 1.733 1.773 1.814 1.733 1.773 1.814 1.733 1.773 1.814
4/2/2/2/20/10 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
4/2/4/4/28/14 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
4/4/1/1/20/10 1.054 1.773 1.814 1.814 1.773 1.814 1.814 1.773 1.814 1.814
4/4/2/2/24/12 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
4/4/4/4/32/16 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
Table 5.4: fact benchmark results
45
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.182 1.201 1.215 1.251 1.28 1.296 1.251 1.28 1.296
1/1/2/2/12/6 1.082 1.374 1.399 1.402 1.374 1.399 1.402 1.374 1.399 1.402
1/1/4/4/20/10 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
1/2/1/1/10/5 1.025 1.283 1.29 1.29 1.283 1.297 1.299 1.29 1.305 1.307
1/2/2/2/14/7 1.082 1.374 1.399 1.402 1.383 1.409 1.411 1.383 1.409 1.411
1/2/4/4/22/11 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
1/4/1/1/14/7 1.087 1.29 1.305 1.306 1.373 1.39 1.392 1.382 1.399 1.401
1/4/2/2/18/9 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
1/4/4/4/26/13 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
2/1/1/1/10/5 1.02 1.268 1.28 1.283 1.356 1.372 1.382 1.356 1.381 1.391
2/1/2/2/14/7 1.088 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
2/1/4/4/22/11 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
2/2/1/1/12/6 1.082 1.29 1.305 1.306 1.373 1.39 1.392 1.373 1.399 1.401
2/2/2/2/16/8 1.088 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
2/2/4/4/24/12 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
2/4/1/1/16/8 1.087 1.29 1.305 1.306 1.373 1.39 1.392 1.382 1.399 1.401
2/4/2/2/20/10 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
2/4/4/4/28/14 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
4/1/1/1/14/7 1.025 1.283 1.297 1.306 1.365 1.381 1.391 1.365 1.39 1.4
4/1/2/2/18/9 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
4/1/4/4/26/13 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
4/2/1/1/16/8 1.082 1.29 1.305 1.306 1.373 1.39 1.392 1.373 1.399 1.401
4/2/2/2/20/10 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
4/2/4/4/28/14 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
4/4/1/1/20/10 1.087 1.29 1.305 1.306 1.373 1.39 1.392 1.382 1.399 1.401
4/4/2/2/24/12 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
4/4/4/4/32/16 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
Table 5.5: sqrt benchmark results
46
Chapter 6
Conclusion
6.1 Conclusions and contributions

• We moved a step further by moving the design space exploration modules
into the backend of Trimaran compiler infrastructure called Elcor to take
the advantage of architecture dependent optimizations available in the
Trimaran compiler infrastructure.
• We eleminated the need of two compiler infrastructures for ASIP synthe-

sis by moving the Identification, Estimation and Selection modules into
ELCOR backend of Trimaran. This indeed eleminates the need of lot of
scripts and modifying the source code of the framework.
• We moved a step futher by proposing good Identification, Estimation

and Selection methods for VLIW architectures which searches the design
space and selects the best AFU’s by correct estimation. A heuristic
based estimation for multiple issue machines as well as a scheduling based
methods are proposed and implemented.
• We have designed static list based scheduling algorithm which schedules

atmost one AFU per region by assuming a new instruction for the AFU.
47
• We evaluated the design by using the static list best scheduling which
schedules all the three methods and compares the results.
• We have acheived good performance improved compared to the earlier

method by using the enhaced Framework.
• Trimaran is enhaced for design space exploration of VLIW processor,

which considers the generic machine architecture.
The work done is able to demonostrate the automatic design space explo-
ration for VLIW processors. The framework is generic in the sence any of the
modules can be upgrated for future enhancement with out effecting the whole
framework. The Identification algorithm presently support at most one AFU
for the region. But, it can be changed for multiple AFU with out effecthing
the whole framework.
The static list based scheduling algorithm can also be modifed for multiple
AFUs with appropriated data structures. One can use the incremental ap-
porach for the implementation of scheduling algorithm by using appropriate
data structures to make the amortized cost for estimatation to a constant.
Presently scheduling alrorithm is not implemented in incremental approach.
6.2 Future work
6.2.1 Multiple AFUs per region
Presently the framework support at most one AFU per region. The frequency
of execution of some regions is very high and for some regions it is very low.
Instead of selecting AFUs from the low freqently executed region, if we select
multiple AFUs from the high frequently executed regions we can improve the
performance more. Presently all the algorithms and data structures supports
48
at most one AFU per region. One can use appropriate data structures and
algorithms to extend it for multiple AFUs.
6.2.2 Extending Trimaran to remove input/output port

constraint
Presently Trimaran framwork supports atmost 4 inputs and amost 4 outputs

for the functional unit. Since, it limits the design space exploration we have
estimated by using static list based scheduling which dont have any input and
output port constraint. If one want to fully implement the synthesis phases
using Trimaran, there is a need for enhancing the mdes source files to remove
the input and output port constriant.
6.2.3 Control Flow in AFUs
Identification of coarse grained function units with control is another intersting

research. Since, VLIW processors are compiler driven based we need fixed
dealy for AFUs. With control statements the delay of the AFU may vary.
But, if we observe the LOAD instruction may take varible time depending on
the availability of data. If the data is not available then page fault occurs.
This means there is a possibility for variable delay functional units. One need
to enhance the simulator for variable delay functional units.
6.2.4 Multiple Objective Selection
In present framework the selection is purely based on the estimated perfor-

mance gain. One can form objective functions using multiple selection objec-
tives like power consumption, area etc. One need to form objective function
for estimation and selection using multiple objectives.
49
6.2.5 AFUs with memory accessing capability
Trimaran supports different levels of memories, some are very nearer to the
processor and some are very far. One can use the in built memory heirarchy
of trimaran and model AFUs with memory accessing capability.
50
Bibliography
[1] The trimaran compiler infrastructure, http://www.trimaran.org
[2] K.Atasu, L.Pozzi, and P.Ienne. Automatic application-specific instruction-set

extensions under microarchitectural constraints. In Proceedings of the 40th De-
sign Automation Conference, pages 256-61, Anaheim, Calif, June 2003.
[3] Diviya Jain, Anshul Kumar, Laura Pozzi, and Paolo Ienne. Automatically cus-
tomising VLIW architectures with coarse grained application-specific functional
units.In Proceedings of the 8th International Workshop on Software and Com-
pilers for Embedded Systems, Amsterdam, September 2004.
[4] Laura Pozzi, Kubilay Atasu, and Paolo Ienne. Exact and approximate al-
gorithms for the extension of embedded processor instruction sets. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
25(7):120929, July 2006.
[5] Paolo Ienne, Laura Pozzi, and Miljan Vuletic. On the limits of processor spe-
cialisation by mapping dataflow sections on ad-hoc functional units. Technical
Report 01/376, Swiss Federal Institute of Technology Lausanne (EPFL), Com-
puter Science Department (DI), Lausanne, December 2001.
[6] Paolo Ienne, Laura Pozzi, and Miljan Vuletic. On the limits of processor spe-
cialisation by mapping dataflow sections on ad-hoc functional units. Technical
Report 01/376, Swiss Federal Institute of Technology Lausanne (EPFL), Com-
puter Science Department (DI), Lausanne, December 2001.
51
[7] N. Clark, H. Zhong, and S. Mahlke. Processor acceleration through automated
instruction set customisation. In Proceedings of the 36th Annual International
Symposium on Microarchitecture, San Diego, Calif., Dec. 2003
[8] Bhuvan Middha, Varun Raj, Anup Gangwar, Anshul Kumar, M. Balakrishnan,
and Paolo Ienne. A Trimaran based framework for exploring the design space
of VLIW ASIPs with coarse grain functional units. In Proceedings of the 15th
International Symposium on System Synthesis, Kyoto, October 2002.
[9] R. Jain, A. Mujumdar, A. Sharma, and H. Wang. Empirical evaluation of some

high-level synthesis scheduling heuristics. In DAC ’91, pages 210–215, 1991.
[10] Manoj Kumar Jain, M.Balakrishnan, Anshul Kumar. Asip design methodolo-
gies: Survey and issues. In Proceedings of the 14th IEEE International Confer-
ence on VLSI Design, January 2001, India.
[11] V. Kathail, M. Schlansker and B. R. Rau. HPL-PD Architecture Specification:

Version 1.1. Technical Report HPL-93-80 (R.1). Hewlett-Packard Laboratories,
February 1994 (revised July 1999).
[12] J. C. Gyllenhaal, W.-m. W. Hwu and B. R. Rau. HMDES Version 2.0 Specifica-
tion. Technical Report IMPACT-96-3. University of Illinois at Urbana- Cham-
paign, 1996.
52

Exploring VLIW ASIP Design Space Using Trimaran Based Framework

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Exploring VLIW ASIP Design Space Using Trimaran Based Framework

Enviado por

Direitos autorais:

Formatos disponíveis

Exploring VLIW ASIP Design Space using

Trimaran Based Framework

A dissertation submitted in partial fulfillment

Vundela Srinivas Reddy

Under the guidance of

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Prof. Anshul Kumar

I am greatly indebted to my supervisor Prof. Anshul Kumar & Prof. M. Bal-

Vundela Srinivas Reddy

1 Introduction & Motivation 1

3 Design Space Exploration for VLIW processors 23

2.1 ASIP Synthesis Methodology . . . . . . . . . . . . . . . . . . . 8

3.1 Pseudo convex cut . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Earlier Framework . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1 Hardware and sofware latencies of the operations in Fig. 3.1 . . 23

5.1 rawcaudio benchmark results . . . . . . . . . . . . . . . . . . . . 41

Introduction & Motivation

1.2 VLIW Architectures

1.3 VLIW & ASIPs

1.4 Processor Specialization Techniques

1.5 Objectives of the current work

1.6 Outline of the report

2.1 ASIP Synthesis

• Design starts with the application behavior.

• Evaluate several architectural options.

• Identify hardware functionalities to speed up the application.

• Introduce hardware resources for frequently used operations only if it

1. Application Analysis : Input to the ASIP design process is an ap-

Identification Algorithms Automatic Identification of Course Grained FUs

Estimation Algorithms Evaluation of FU’s

Selection Algorithms Selection of FU’s

Customization of Instruction Set

Synthesis Code Generation

Figure 2.1: ASIP Synthesis Methodology

2. Identification of Course Grained FU’s : In this step all the pos-

The following techniques are generally used for estimation individualy or

(b) Scheduler based : In this type of approach the problem is formu-

(c) Simulation based: In this approach a simulation model of the

5. Customization of Machine Architecture : In this step the machine

6. Instruction Set Specialization : This step involves specifying the

• A machine description facility, MDES, for describing ILP architectures.

• A parameterized ILP Architecture called HPL-PD.

• A compiler front-end, called IMPACT, for C. It performs parsing, type

• A compiler back end, called Elcor, parameterized by a machine descrip-

• An extensible IR (intermediate program representation) which has both

Execution SIMULATOR * Elcor IR to low level C files

Figure 2.2: Components of Trimaran Compiler Infrastructure

The machine description model (MDES) in TRIMARAN allows the user to

Figure 2.3: MDES in Trimaran

HPL-PD [11]is a parameterized ILP architecture from HP labs. It servers as

• Support for speculative execution

– Compiler-visible cache hierarchy

– Serial behavior of parallel reads/writes

• Efficient boolean reduction support

• Software loop pipelining support

• Elementary data structures

• Intermediate Representation data structures

• Transformation and optimization modules

• Rotating register allocator

• Static register allocator

2.3 Design space exploration for single issue

• Identification of Course Grained FU’s

Figure 2.4: General MIMO Identification Algortihm.