Escolar Documentos
Profissional Documentos
Cultura Documentos
MASTER OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING
Submitted by
This is to certify that the thesis titled Exploring VLIW ASIP Desgin
Space using Trimaran Framework being submitted by Vundela Srini-
vasa Reddy for the award of Masters of Technology in Computer Sci-
ence & Engineering is a record of bonafide work carried out by him under
my guidance and supervision at the Department of Computer Science &
Engineering, Indian Institute of Technology, Delhi and EPFL, Lau-
sanne, Switzerland. The work presented in this thesis has not been sub-
mitted elsewhere, either in part or full, for the award of any other degree or
diploma.
The main aim of the project was to explore the VLIW ASIP design space
to find out the optimal benifit of using ASIP for VLIW processor through a
fully automated design methodology using the Trimaran framework. Pluggable
modules for automatic identification, evaluation and selection of the coarse
grained functional units are implemented in Elcor phase of Trimaran to take
advantage of architectural dependent optimizations available in the early phase
of Elcor. An optimal algorithm and a heuristic methodology for estimating the
the coarse grained fuctional units are proposed. This methodology enabled to
demonostrate the ASIP design flow model for several benchmarks.
Contents
2 Background 7
2.1 ASIP Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Framework - Trimaran Compiler Infrastructure . . . . . . . . . 11
2.2.1 MDES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 HPL-PD Architecture . . . . . . . . . . . . . . . . . . . 15
2.2.3 IMPACT . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 ELCOR . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Design space exploration for single issue machines . . . . . . . . 17
2.3.1 Identification of Coarse Grained FU’s . . . . . . . . . . . 18
2.3.2 Objective Function based Estimation . . . . . . . . . . . 20
2.4 Static List Scheduling Algorithm . . . . . . . . . . . . . . . . . 20
i
3.1 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Motivation Example . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Objective Function . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Scheduling based Estimation . . . . . . . . . . . . . . . . 30
3.3 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Implementation 32
4.1 Earlier Framework . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Enhanced Framework . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Results 39
5.1 rawdaudio application . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 fir application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 mmdyn application . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 fact application . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 sqrt application . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusion 47
6.1 Conclusions and contributions . . . . . . . . . . . . . . . . . . . 47
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2.1 Multiple AFUs per region . . . . . . . . . . . . . . . . . 48
6.2.2 Extending Trimaran to remove input/output port con-
straint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.3 Control Flow in AFUs . . . . . . . . . . . . . . . . . . . 49
6.2.4 Multiple Objective Selection . . . . . . . . . . . . . . . . 49
6.2.5 AFUs with memory accessing capability . . . . . . . . . 50
ii
List of Figures
iii
List of Tables
iv
Chapter 1
Over the last decade, It has been experimentally proven that Application Spe-
cific Instruction set Processors (ASIPs) which falls in between the conven-
tionaly methodology for high performance and low power implementations for
the applications, Application Specific Integrated Circuits (ASICs) and flexible
methodology, General Purpose Processors (GPPs) for single issue machines
like Reduced Instruction Set Computers (RISC) and Complex Instruction Set
Computers (CISC). The main aim of this work focusses on different aspects
of ASIP synthesis for multiple issue machines mainly Very Long Instruction
Word (VLIW) processors.
-
1.1 ASIPs
For high performance and lower power implemention of the applications, ASICs
are the best solution. But this solution leads to larger area, most rigid solution,
much time taking and coslty solution to the problem. GPPs are the most
flexible implementation for all applications. But this leads to low performance
and high power consumption. Research has been shifted towards the medium
solution, which takes the advantage of both the ASICs and GPPs termed as
ASIPs. ASIPs gives higher performance than GPPs and more flexible than
1
ASICs.
In the last decade, research in design methodologies for system-on-chip
has been mainly revolving around the synthesis of ASIPs. This involved the
automatic generation of complete instruction sets for specific applications. In
that contex, the goal is typically to design an instruction set which minimises
some important metric(e.g., run time, program memory size, execution unit
count).
Most recently, the attention has shifted towards extending generic proces-
sors with units specialised for a given domain, rather than desiging completely
custom processors. The main goal of this process is exploit the frequently
executing kernals of the application and map them to custumised hardware
units which finally leads to better performance. ASIPs can be very efficient
when applied to a specific applications such as digital signal processing, servo
motor control, image processing applications etc. However, ASIPs can save
the area of the processor by removing the redundant units not utilized by the
application domain.
2
in the case in most contemporary superscalar processors. Since ILP is exploited
with software, it eliminates the need for the instruction and reorder buffers,
and hence benefit is less logic, and thus less lost.
A large size of the window of instructions to be looked at to extract ILP
would require a more sophisticated dispatcher and would directly increase the
amount of hardware and chip logic used in case of Superscalare implementa-
tions. Alternatively, VLIW uses a software window and hence is able to exploit
the ability of spatial computation more successfully.
Some of the advantages sought by the superscalar implementations, in-
cluding access to dynamic information, and speculative execution are easily
available to VLIW based architecture too.. VLIW can easily mimic the spec-
ulative execution using register space. Speculative results are placed in tem-
porary registers whose values are discarded, once the brach in known to be
mis-predicated. Trace driven compilation techniques are able to record the
dynamic program behaviour and facilitates a better exploitation of ILP.
VLIW based architecture moves the complexity from hardware to the com-
piler. Thus the complexity is paid for only once, when the compiler is written
and not every time the chip is fabricated.
3
compiler, based on a description of the target processor architecture. The ar-
chitecture description is formulated in an architecture descritpiton language.
Configurable Simulator can be configured for a particular architecture(based
on an architecture description) and can generate statistics regarding through-
put, delay, power/energy consumption etc.
4
and speed, and reduces power consumption.
Cache configuration includes parameters like seperate instruction/date caches,
associativity, cache size, line size, which depend very much on the characteris-
tics of the application and, in particular, on the properties related to locality.
This form of specialisation can have a very large impact on performance and
power consumption,
Interconnect Specialization which affects the interconnnection of functional
modules and registers, momory and chche increases the potential of parallelism.
The current work tires to achieve processor specialisation through Instruc-
tion Set Specialisation and Function Unit Specilisation. Special instruc-
tion, specific to the application and synthesised are mapped on to spefial FUs
designed especially for them. These FUs are so desinged as to exploit many
of the optimisations possible including bit width reduction, hard wiring of
constants, spatial computation, chaining of operations etc.
5
In order to take the advantages of the architectural dependent optimiza-
tions available in the Trimaran framework and eliminate the need for two
compiler infrastructures used in the earlier work, the pluggable modules are
now implemented in the Elcor phase of the Trimaran infrastructure. It also
eliminate the need to modification of the source code of the compiler framework
and also the need of several scripts used in the earlier work.
6
Chapter 2
Background
Fig. 2.1 shows the main steps, which meets the above requirements for
Compiler based frameworks.
7
Application Code &
Design Constriants
Application Analysis
Simlation
8
Units(AFUs) as well as the selection of AFUs. An application written
in a high level language is analysed statically and dynamically using the
test data and the analysed information is stored in some suitable inter-
mediate format, which is used in the subsequent steps.
3. Estimation of AFU’s : In this step all the identified AFUs are esti-
mated depending on the base architecture. Estimation further removes
the inferior AFUs, which decreses the search space for the next step
selection.
(a) Objective Function based : Since the general techniques for es-
timation are very costly to be used at early stages of ASIP synthesis
due to large search space we need an heuristic estimation methodol-
ogy to decrease the estimation time at early stages. Depending on
the architecture under consideration a heuristic objective function
is formed which indeed gives the estimation of gain when the AFU
is implemented in the hardware. Wrong objective function leads to
poor estimation and poor selection of AFUs, which intern decreases
the performance.
9
AFUs and the FUs available on the base architecture and the ap-
plication is scheduled to generate an estimate of the cycle count.
4. Selection of AFU’s : In this a few AFUs which are superior among the
AFUs, which passed trough the Estimation step are selected depending
on some criteria.
7. Code Generation : This step replaces the set of operations with the
new intructions, which can use the new AFUs now available on the ma-
chine architecture.
8. Simulation : In this step the applications are simulated with the new
AFUs avaliable in the architecture, to evaluate the gain of using the
AFUs in the base architecture.
10
2.2 Framework - Trimaran Compiler Infras-
tructure
ASIP synthesis is dependent on the availability of Retargetable Compiler
and Configurable Simulator. Hence, the Trimaran Framwork which has
these tools inherently built in, is used as a natural choice to facilitate the
design methodology.
Trimaran [1] is a compiler infrastructure for supporting state of art research
in compiling for Instruction Level Parallel(ILP) architectures. Trimaran com-
piler infrastructure is comprised of the following components as shown in the
Fig. 2.2
11
IMPACT
* C Parsing, Renaming, Flattening
Application Code & * Control Flow Profiling, Function Inlining Bridge Code
Test Data * Machine independent optimizations
* Block Formation
* ILP Transformations
* Machine dependent
Code Optimizations
* Code Scheduling
* Register Allocation
Elcor IR
SIMULATOR GENERATOR
HMDES
12
• A cycle-level simulator of the HPL-PD architecture which is configurable
by a machine description and provides run-time information on execution
time, branch frequencies, and resource utilization. This information can
be used for profile-driven optimizations as well as to provide validation
of new optimizations.
2.2.1 MDES
13
Compiler
hmdes2 Hmdes2pp Lmdes
scheduler
register
mQS allocation
mQS
Optimizer & RU Map
Pre−processor Translator interface
Mdes
DB Simulator
Customize
mGen
14
2.2.2 HPL-PD Architecture
– data speculation
– control speculation
• Predicated Execution
• Memory System
• Brach architecture
2.2.3 IMPACT
The IMPACT forms the front end compiler of the Trimaran Infrastructre.
It is a generalized C compiler that can generate optimized code for various
architectures and machine resource configurations. It also implements new
research optimizations so that their effect on code can be analyzed.
15
IMPACT is divided into three distinct parts based on the intermediate code
representation used. The highest level is called PCode and is a parallel C code
representation with intact loop constructs. Memory system optimizations,
loop- level transformations and memory dependence analysis takes place at
this level.
The next lower level is called Hcode and is a simplified C representation
with only simple if-then-else and goto control flow constructs. Statement-
level profiling, profile-guided code layout and function inline expansion are
performed at this level.
The final and lowest code representation is called Lcode. Lcode is a gener-
alized register transfer language. It is similar to most RISC instruction-based
assembly languages. At the Lcode level, classic machine independent optimiza-
tions are performed. Lcode is an instruction set for a load/store architecture
that supports unlimited virtual registers. It is broken down into data and
function blocks. The functions are composed of control blocks containing the
Lcode instrucions.
2.2.4 ELCOR
Elcor forms the back end compiler of the Trimaran Framework. It has sev-
eral components, and each of this component reads, analyzes and tranforms
the Rebel. The researcher can remove some of the components which are not
required with out effecting the compilation process. It is possible to add plug-
gable modules which reads the intemediate data structures and can do some
transformations on the Rebel.
Elcor has the following main components
16
• I/O modules
• Mdes interface
• Analysis modules
• Scheduling modules
2.2.5 Simulator
The simulator provided by the Trimaran Framework converts the Elcor gener-
ated IR into an executable code and emulates its execution on a virtual HPL-
PD processor. The simulator is capable of gathering the execution statistics
like the number of cycles taken for the exection, average number of operations
executed per cycle etc, and also emits the execution trace.
• Evaluation of FU’s
• Selection of FU’s
All the steps specified for ASIP design space exploration are explored for
single issue machines in [2],[4],[7].
17
2.3.1 Identification of Coarse Grained FU’s
Idenfication step searches the whole design space for all possible Coarse Grained
FU’s given the design constraints like number of maximum input/output ports,
area, the type of operations that can be part of FU’s. Unfortunately the de-
sign space for ASIPs is exponential on the number of nodes present in the
design space. In order to limit the search space, some heuristics are needed to
purne the search space. This pruning should discard only the inferior FUs. In
the last decade a number of Identification algorithms with some good pruning
techinques are developed to decrease the search space to subexponential.
Sec. 2.3.1 gives the problem statement for identification of MIMOs. Sec 2.4
gives the optimal identification algorithm, based on the publication [2].
Problem Statement
We call G(V, E) the DAGs representing the dataflow of each basic block; the
nodes V represent primitive operations and the edges E represent data depen-
dencies. Each graph G is associated to a graph G+ (V ∪ V + , E ∪ E + ) which
contains additional nodes V + and edges E + . The additional nodes V + rep-
resent input and ouput varibles of the basic block. The additional edges E +
connect nodes V + to V , and nodes V to V + .
A cut S of G : S ⊆ G. There are 2|V | possible cuts, where |V | is the number
of nodes in G. An arbitrary function M(S) measures the merit of a cut S. It
is the objective function of the optimisation problem introduced below and
typically represents an estimation of the speedup achievable by implementing
S as a special instruction.
We call IN(S) the number of predecessor nodes of those edges which enter
the cut S from the rest of the graph G+ . They represent the number of
input values used by the operations in S. Similarly, OUT is the number of
predecessor nodes in S of edges exiting the cut S. They represent the number
18
of values produced by S and used by other operations, either in G or in other
basic blocks.
Finally, we call the cut S convex if there exists no path from a node u ∈ S
to another node v ∈ S which involves a node w ∈
/ S.
Considering each basic block independently, the identification problem can
now be formally stated as follows.
Problem 1 Given a graph G+ , find the cut S which maximises M(S) under
the following constraints:
1. IN(S) ≤ Nin
2. OUT(S) ≤ Nout
3. S is convex
The user-defined values Nin and Nout indicate the register-file read and
write ports, respectively, which can be used by the special instruction. The
convexity constraint ensures the feasible schedule for single issue machines like
RISCs. Unfortunately it is not true for multiple issue machines. Some non-
convex cuts have feasible schedule in multiple issue machines. The convexity is
one of the best legility check, which prunes the search space to subexponential.
identification () {
for (i = 0; i < NODES; i++) cut [i] = 0;
topological sort() ;
search(1,0) ;
search(0,0) ; }
search ( current choice, current index ) {
cut[current index] = current choice;
if (current choice == 1) {
if (!output port check ()) return;
if (!convexity check ()) return;
if (!input port check ()) {
calculate speedup ();
update best solution (); } }
if ((current choice + 1) == NODES) return;
current index = current index + 1;
search (1,current index);
search (0,current index); }
19
The algorithm shown in the Fig 2.4 is a general identification algorithm. It
is a naive algorithm which searches all the possible cuts. But the pruning based
on the Output port check and convexity check decreases the search space to
subexponential. Since the nodes are considered in the topological sorting order
the number of output ports doesnot decrease by including the higher number
nodes in the topological sort. So, we can prune all these cuts by not searching
the branch. Similarly, once the convexity constraints fails, it is impossible to
regain convexity by including the higher number nodes in the topological sort.
These two pruning techniques decreases the search space to sub exponential.
Let λs w be the latency of the operation when executed on GPP and λhw be
the latnecy when the operation executed on the dedicated hardware. Let γ be
the frequency of the region i.e., number of times the region is executed.
P
Total number of cycles taken to execute the nodes in the cut is λsw
The time taken to execute the cut when implemented a dedicate hardware
is the hardware critical path of cut. Let CPcut .
The gain by implementing the cut using the dedicated hardware funtional
P
unit is γ ∗ ( λsw − dCPcut e).
Therefore the objective is
P
Maximize γ ∗ ( λsw − dCPcut e)
Since this heuristic correctly estimates each cut for single issue machines,
we can say that it is the optimal estimation for single issue machines.
20
literature for resourse constraint scheduling for different situations. In this
present work we have used the following Static List based Scheduling Algo-
rithm.
The Static list scheduling algorithm [9]also termed as as-late-as-possible(ALAP )
heuristic is based on a combination of the as-soon-as-poosible (ASAP ) schedul-
ing and the list-scheduling heuristic using ALAP and ASAP values of an oper-
ation as its priority function. Given a performance constraint, the ALAP (ASAP )
value of an operation is the latest (earliest) time-step in which the operation
can start execution without violating a performance constraint. The use of
ALAP and ASAP values in high-level synthesis sheduling has been limited to
deriving other priority functions such as ”mobility”, ”freedom” and ”force”.
The ALAP heuristic accepts a data flow graph, delay associated with each
operation in the data flow graph, a clock cycle, and a set of resources. The
output of the heuristic is a time-step partition of the data flow graph with
the mapping of the start of the execution of an operation to a time-step. The
ALAP heuristic schedules the graph minimizing overall delay subject to a
fixed resource count. The ALAP value computation requires a performance
constraint, but since the ALAP value is used for ordering the nodes only,
the absolute ALAP values are not important and any performance constraint
value (even less than the critical path value) would suffice. By this argument
the designer need not specify the performance constraint, and the program
can just as well use the critical path delay, which is done. The nodes of the
data flow graph are assigned the ALAP value by traversing the graph in a
bottom-up fashion from sink to source.
The worst-case computational complexity of the ALAP heuristic is
O(nlog(n) + Elog(E) + pE, where n is the number of nodes (operations) in
the data flow graph, E is the number of edges (excluding edges with root as
source and outport as sink) denoting precedence constraints in the data flow
graph, and p is the maximum number of resources of any operation type.
21
1. Read the data flow graph, number of modules of each type, performance
constraint and clock cycle.
2. Find the ALAP value for each node. In case of a large clock cycle
value, two or more operations connected by a precedence relation may be
assigned identical ALAP values indicating potential for chaining these
operations, Chaining is possible if the combined delays of the chained
operations does not exceed clock cycle value and is subject to resource
availability.
3. Sort nodes in ascending order of ALAP value. All the nodes with iden-
tical ALAP values are sorted in increasing order of their ASAP value.
Since operations with precedence relationship may have identical ALAP
values (in case of possible chaining), sorting according to precedence re-
lationship is also done. Place the sorted result in list L .
22
Chapter 3
In this chapter we show the design space exploration techniques used for single
issues machines may not be optimal for multiple issue machines and proposes
the best exploration methedologies for VLIW processors. [8], [3] started the
design space exploration in multiple issue machines.
3.1 Identification
The general identification algorithm used for single issue machines shown in
subsection 2.3.1 can be used for VLIW processors, if the pruning techniques
used for single issues machines are also valid for VLIW machines.
In the pruning techniques used for single issue machines Output port con-
straint is also true for multiple issue machines. But, the convexity principle
Table 3.1: Hardware and sofware latencies of the operations in Fig. 3.1
23
Cycle− 0 >>
Cycle−1
Cycle− 2 * +
Cycle− 3
Cycle− 4 +
24
may not be true for multiple issue machines. Fig. 3.1 shows a non convex cut
for which scheduling exits.
Table. 3.1 shows the hardware and software values for the operations used
in the Fig. 3.1. COLUMN 1 has operations, column 2 have the hardware syn-
thesized hardware values, and column 3 represents the corresponing software
cycles to execute the operations on the GPP.
Since the single issue machines has only one functional unit active at a time
the cut shown in the Fig. 3.1 cannot be scheduled. But, in VLIW processor we
have multiple functional units active at a time, the cut is scheduled to a new
AFU and “+” operation can be scheduled to one of the available functional
units in the processor. After cycle-0 the AFU allocated for the cut gives an
output and takes an input in the cycle-2. Meanwhile the “+” operation is
executed by the functional unit and gives the required input for the AFU.
Therefore, there exits non convex cuts for which scheduling exists for mul-
tiple issue machines. Lets call them as pseudo convex cuts. But, these pseudo
convex cuts are subsets of some convex cut and number of outputs for the
pseudo convex cut are greater than or equal to the number of outputs of
the convex cut.
As shown in the Fig. 3.1 the “+” operation which is not in the cut can be
included with out decreasing the performance and increasing the number of
inputs and outputs.
Since our main objective is to maximize the performance, including the
nodes which are not in the pseudo convex cut in the hardware does not increase
the output ports and increases the performance. So we can use the same
identification methodology used in Sec. 2.3.1 for multiple issues machines with
out decrease in performance.
25
3.2 Estimation
In this section we give a motivate example which shows that the objective
function used for single issue machine is not good estimation heuristic for
multiple issue machines. Next, we propose an estimation heuristic for multiple
issue machines, which involves forming an objective function for multiple issue
machines under some assumptions. Next, we will give an estimation based on
resourse scheduling.
Fig. 3.2 shows a motivation example where the heuristic used for the single
issue machines cannot estimate correctly the performance of the cut for mul-
tiple issue machines. For multiple issue machines it is the critical path that
matters then the performance of the individual cut.
Assume that we have a constraint of two inputs and one output. We have
to select the best cut of the DAG shown in the Fig. 3.2 satisfying the above
constraints. we do not count Const inputs as inputs to the cut, Since we
can hardwire there Const in the hardware. The heuristic used for single issue
machines selects the left connected component of the DAG, Because the gain
by implementing it using heuristic is 2 cycles.
For multiple issue machines it is the critical path that matters than the
individual performance gain of the cut. The above cut which was selected
using heuristic of single issue machines does not decrease the critical path
length. The best cut is the connected component on the right side of the DAG
shown in the figure.
Therefore we need a differnt method for the estimation of the cut. The
next two subsections gives two different methods for estimation of the cut for
multiple issue machines.
26
Const x y Const w z
<< +
_ *
Const
>>
27
3.2.2 Objective Function
The main objective of this project is to form an better objective function for
estimating the cut for multiple issue machines. For multiple issue machines it
is the critical path that matters than the local speed up of the cut. So, we need
to estimate the critical path length of the region when this cut is implemented
as a new instrution.
Multiple issue machines are meant to improve performance using the in-
struction level parallelism(ILP) available in the applications. Since ILP is
available is the VLIW processor, we assumed infinite parallel processors i.e.,
there is no resource constraint in the VLIW processors and incorporting extra
functional units to the processor will not improve performance. The results
shows that this assumption is resonable assumption and gives results almost
same or near to the results of scheduling method. The down side of this as-
sumption is that it fails to estimate correctly when the VLIW processor does
not have sufficient parallelism.
Fig. 3.3 explains the parameters to form the objective function available
for multiple issue machines. We will try to find the critical path length of the
region after introducing the Special AFU for the cut under the assumption of
infinite parallel available in the VLIW processors.
The notations used in the Fig. 3.3 are as follows
Lets assume that the cut includes the nodes in the critical path of the
region. Since, some nodes of the region are now implemented in hardware
28
Cycle − 0
Cpre Cycle − 1
Cycle − 2
CPhw Cycle − 3
Cs’
Cycle − 4
Cycle − 5
Cpost Cycle − 6
Cycle − 7
29
functional units the length of the critical path changes and it may no longer
be the critical path of the region. So, we need to find out the length of the
critical path after for each cut. CS 0 keeps track of the estimated critical path
length for each cut dynamically. This varible can be undated intermentally
while identification path with appropriate data structures. We can devide the
nodes into O(n) individual components where n is the number of nodes in the
region, for which CS 0 remains constant. So we need to update CS 0 atmost O(n)
times. Therefore the total time complexity to change the varible in varible CS 0
is atmost O(n) * time complexity to find CS 0 .All the remaining variables can
be updated in constant time at each node of the identification search tree.
Let CG reprents the number of scheduled cylces to execute the region with
out any AFU. The estimated number of cylces to execute the region with
special AFU for the cut is shown in the eq. 3.1
The number of cycles gained by implementing a special AFU for the cut is
shown in the equation eq. 3.2
This method we can call it as naive method of estimation for multiple issue
machines like VLIW processors. Let CG be the number of scheduled cycles with
out any special AFU and Ccut be the number of scheduled cycles assuming the
AFU is implemented by a new instruction.
30
Therefore the gain by using a special AFU for the region is
3.3 Selection
Let γi be the frequency of the region and Gaini be the gain of region i by
selecting the best AFU for each region. The total number of cycles gained for
each region is γi ∗ Gaini . Sort these in descending order and select the top k
AFU’s which gives the maximum benifit.
31
Chapter 4
Implementation
In this chapter we find discuss about the earlier implementation [3] and then
we give the limitations of the earlier work. Next section gives the enhanced
implementation which eliminates the most of the implementations discussed
in the earlier work.
32
Application code &
Design Constraints
Identification IMPACT
&
MachSUIF
HM DES
33 Framework
Figure 4.1: Earlier
by function calls as cited in [3]. These functional call are identified in the
PCODE phase of IMPACT component of Trimaran. The identified special
functional calls are replaced by new instructions by appropriately modifying
the intermediate data structure of Trimaran to identify the new instructions.
Afther replacement of the functional calls by new instructions the source
code of the ELCOR component are modified to make it to identify the new
instructions. The source code of the SIMU component are modified to identify
the new behaviour of the new instructions.
Execution statistics are compared with out adding any special AFU’s and
with special AFU’s. Basically, Trimaran is used for evaluation purpose and
not for the design space exploration.
4.1.1 Limitations
34
framework modified the source code of the framework, which requires
the compilation of whole framework for each benchmark.
• Lot of perl scripts are written to modify the source code to make the
Trimaran to identify the new instruction.
35
Application IMPACT Bridge Code
Code
ELCOR−I
VLIW ASIP
Exploration
AFU’s data
ELCOR−II
SIMULATOR
Results & SIMULATOR Elcor IR
GENERATOR
Statistics
36
Figure 4.2: Enhanced Framework
Identification &Estimation
(Heuristic 1,2 &Scheduling )
Scheduler
( At most one AFU for region )
Statistics
37
After that top k AFU’s are selected for three estimation methods. These AFU’s
are again passed through the static scheduling module to correctly estimate
the gain of k top AFU’s.
Three estimatation methodologies are compared. In all the cases the Heuristic-
2 estimation methodoly gives as good as scheduling method. But, one can use
the incremental approach for static scheduling method which can be imple-
mented in O(1)amortized cost.
Heuristic-1 estimates better than Heuristic-2 where the base architecture
doesnot have sufficient parallelism. But, according to the result for all generic
architecture Heuristic-2 gives better results than Heuristic-1 with the same
time complexity.
38
Chapter 5
Results
39
processor.
The architectures are specified in the format
#Int FU’s/#Float FU’s/#Branch Units/#Memory Units/#input ports/#output
ports
40
Heuristic-1(afu’s) Heuristic-2(afu’s) Sched(afu’s)
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.202 1.301 1.461 1.173 1.284 1.439 1.202 1.319 1.484
1/1/2/2/12/6 1.279 1.477 1.629 1.718 1.525 1.629 1.75 1.525 1.687 1.817
1/1/4/4/20/10 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
1/2/1/1/10/5 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
1/2/2/2/14/7 1.279 1.477 1.629 1.718 1.525 1.629 1.75 1.525 1.687 1.817
1/2/4/4/22/11 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
1/4/1/1/14/7 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
1/4/2/2/18/9 1.279 1.478 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
1/4/4/4/26/13 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
2/1/1/1/10/5 1.08 1.218 1.32 1.485 1.218 1.284 1.44 1.218 1.32 1.485
2/1/2/2/14/7 1.279 1.477 1.629 1.718 1.525 1.629 1.75 1.525 1.687 1.817
2/1/4/4/22/11 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
2/2/1/1/12/6 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
2/2/2/2/16/8 1.279 1.477 1.629 1.718 1.525 1.629 1.75 1.525 1.687 1.817
2/2/4/4/24/12 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
2/4/1/1/16/8 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
2/4/2/2/20/10 1.279 1.478 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
2/4/4/4/28/14 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
4/1/1/1/14/7 1.08 1.218 1.32 1.485 1.218 1.284 1.44 1.218 1.32 1.485
4/1/2/2/18/9 1.279 1.477 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
4/1/4/4/26/13 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
4/2/1/1/16/8 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
4/2/2/2/20/10 1.279 1.477 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
4/2/4/4/28/14 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
4/4/1/1/20/10 1.218 1.397 1.533 1.658 1.397 1.485 1.629 1.397 1.533 1.687
4/4/2/2/24/12 1.279 1.478 1.63 1.718 1.525 1.63 1.75 1.525 1.687 1.817
4/4/4/4/32/16 1.314 1.525 1.63 1.718 1.576 1.687 1.817 1.576 1.687 1.817
41
Heuristic-1(afu’s) Heuristic-2(afu’s) Sched(afu’s)
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.139 1.223 1.314 1.139 1.223 1.314 1.139 1.223 1.34
1/1/2/2/12/6 1.219 1.372 1.452 1.495 1.372 1.452 1.495 1.372 1.452 1.502
1/1/4/4/20/10 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
1/2/1/1/10/5 1.123 1.252 1.318 1.354 1.252 1.318 1.354 1.252 1.318 1.359
1/2/2/2/14/7 1.219 1.372 1.452 1.495 1.372 1.452 1.495 1.372 1.452 1.502
1/2/4/4/22/11 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
1/4/1/1/14/7 1.198 1.345 1.376 1.386 1.345 1.376 1.386 1.345 1.379 1.392
1/4/2/2/18/9 1.284 1.455 1.491 1.503 1.455 1.491 1.503 1.455 1.497 1.512
1/4/4/4/26/13 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
2/1/1/1/10/5 1.018 1.162 1.25 1.345 1.162 1.25 1.345 1.162 1.25 1.347
2/1/2/2/14/7 1.221 1.374 1.455 1.498 1.374 1.455 1.498 1.374 1.455 1.505
2/1/4/4/22/11 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
2/2/1/1/12/6 1.125 1.253 1.32 1.356 1.253 1.32 1.356 1.253 1.32 1.361
2/2/2/2/16/8 1.221 1.374 1.455 1.498 1.374 1.455 1.498 1.374 1.455 1.505
2/2/4/4/24/12 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
2/4/1/1/16/8 1.198 1.345 1.376 1.386 1.345 1.376 1.386 1.345 1.379 1.392
2/4/2/2/20/10 1.284 1.455 1.491 1.503 1.455 1.491 1.503 1.455 1.497 1.512
2/4/4/4/28/14 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
4/1/1/1/14/7 1.034 1.183 1.274 1.373 1.183 1.274 1.373 1.183 1.274 1.376
4/1/2/2/18/9 1.221 1.375 1.455 1.499 1.375 1.455 1.499 1.375 1.455 1.506
4/1/4/4/26/13 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
4/2/1/1/16/8 1.143 1.276 1.346 1.383 1.276 1.346 1.383 1.276 1.346 1.388
4/2/2/2/20/10 1.221 1.375 1.455 1.499 1.375 1.455 1.499 1.375 1.455 1.506
4/2/4/4/28/14 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
4/4/1/1/20/10 1.198 1.345 1.354 1.362 1.345 1.376 1.387 1.345 1.379 1.392
4/4/2/2/24/12 1.284 1.455 1.491 1.503 1.455 1.491 1.503 1.455 1.497 1.512
4/4/4/4/32/16 1.307 1.484 1.522 1.535 1.484 1.522 1.535 1.484 1.528 1.545
42
5.3 mmdyn application
Table. 5.3 shows the results for the application mmdyn. mmdyn is one of
the benchmark available in the Trimaran benchmark directory. It is the better
benchmark to explain the benfit of using the heuristic-2 for estimation. By
using heuristic-1 we are able to get a perfomance improment of 1.28 which is
near to the performance improvement acheivable by increasing the FU’s.
Heuristic-2 estimates nearer to the scheduling method. This benchmark
is the good example for the need of different estimation method for multiple
issue machines.
43
Heuristic-1(afu’s) Heuristic-2(afu’s) Sched(afu’s)
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.034 1.056 1.058 1.17 1.198 1.21 1.179 1.227 1.25
1/1/2/2/12/6 1.209 1.249 1.27 1.272 1.468 1.497 1.514 1.468 1.527 1.545
1/1/4/4/20/10 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
1/2/1/1/10/5 1.035 1.064 1.072 1.074 1.218 1.238 1.25 1.218 1.249 1.261
1/2/2/2/14/7 1.229 1.249 1.27 1.272 1.482 1.512 1.515 1.497 1.527 1.545
1/2/4/4/22/11 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
1/4/1/1/14/7 1.043 1.072 1.081 1.082 1.229 1.25 1.262 1.229 1.26 1.272
1/4/2/2/18/9 1.24 1.26 1.271 1.273 1.497 1.528 1.531 1.497 1.528 1.546
1/4/4/4/26/13 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
2/1/1/1/10/5 1.021 1.05 1.065 1.066 1.2 1.229 1.24 1.2 1.239 1.261
2/1/2/2/14/7 1.23 1.25 1.271 1.273 1.483 1.512 1.515 1.497 1.528 1.546
2/1/4/4/22/11 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
2/2/1/1/12/6 1.035 1.065 1.08 1.081 1.219 1.239 1.251 1.219 1.249 1.261
2/2/2/2/16/8 1.23 1.25 1.271 1.273 1.483 1.512 1.515 1.497 1.528 1.546
2/2/4/4/24/12 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
2/4/1/1/16/8 1.043 1.072 1.081 1.082 1.229 1.25 1.262 1.229 1.26 1.272
2/4/2/2/20/10 1.24 1.26 1.271 1.273 1.497 1.528 1.531 1.497 1.528 1.546
2/4/4/4/28/14 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
4/1/1/1/14/7 1.029 1.057 1.073 1.074 1.21 1.239 1.251 1.21 1.25 1.262
4/1/2/2/18/9 1.23 1.25 1.271 1.273 1.483 1.512 1.515 1.497 1.528 1.546
4/1/4/4/26/13 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
4/2/1/1/16/8 1.036 1.065 1.073 1.074 1.22 1.24 1.251 1.22 1.25 1.262
4/2/2/2/20/10 1.23 1.25 1.271 1.273 1.483 1.512 1.515 1.497 1.528 1.546
4/2/4/4/28/14 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
4/4/1/1/20/10 1.043 1.072 1.081 1.082 1.229 1.25 1.262 1.229 1.26 1.272
4/4/2/2/24/12 1.24 1.26 1.271 1.273 1.497 1.528 1.531 1.497 1.528 1.546
4/4/4/4/32/16 1.25 1.271 1.282 1.284 1.512 1.543 1.562 1.528 1.559 1.578
44
Heuristic-1(afu’s) Heuristic-2(afu’s) Sched(afu’s)
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.625 1.696 1.773 1.625 1.696 1.773 1.625 1.696 1.773
1/1/2/2/12/6 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
1/1/4/4/20/10 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
1/2/1/1/10/5 1.04 1.733 1.773 1.814 1.733 1.773 1.814 1.733 1.773 1.814
1/2/2/2/14/7 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
1/2/4/4/22/11 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
1/4/1/1/14/7 1.054 1.773 1.814 1.814 1.773 1.814 1.814 1.773 1.814 1.814
1/4/2/2/18/9 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
1/4/4/4/26/13 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
2/1/1/1/10/5 1.013 1.66 1.733 1.814 1.66 1.733 1.814 1.66 1.733 1.814
2/1/2/2/14/7 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
2/1/4/4/22/11 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
2/2/1/1/12/6 1.04 1.733 1.773 1.814 1.733 1.773 1.814 1.733 1.773 1.814
2/2/2/2/16/8 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
2/2/4/4/24/12 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
2/4/1/1/16/8 1.054 1.773 1.814 1.814 1.773 1.814 1.814 1.773 1.814 1.814
2/4/2/2/20/10 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
2/4/4/4/28/14 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
4/1/1/1/14/7 1.013 1.66 1.733 1.814 1.66 1.733 1.814 1.66 1.733 1.814
4/1/2/2/18/9 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
4/1/4/4/26/13 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
4/2/1/1/16/8 1.04 1.733 1.773 1.814 1.733 1.773 1.814 1.733 1.773 1.814
4/2/2/2/20/10 1.04 1.733 1.814 1.857 1.733 1.814 1.902 1.733 1.814 1.902
4/2/4/4/28/14 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
4/4/1/1/20/10 1.054 1.773 1.814 1.814 1.773 1.814 1.814 1.773 1.814 1.814
4/4/2/2/24/12 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
4/4/4/4/32/16 1.054 1.773 1.814 1.857 1.773 1.857 1.902 1.773 1.857 1.902
45
Heuristic-1(afu’s) Heuristic-2(afu’s) Sched(afu’s)
architecture 0 2 4 8 2 4 8 2 4 8
1/1/1/1/8/4 1 1.182 1.201 1.215 1.251 1.28 1.296 1.251 1.28 1.296
1/1/2/2/12/6 1.082 1.374 1.399 1.402 1.374 1.399 1.402 1.374 1.399 1.402
1/1/4/4/20/10 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
1/2/1/1/10/5 1.025 1.283 1.29 1.29 1.283 1.297 1.299 1.29 1.305 1.307
1/2/2/2/14/7 1.082 1.374 1.399 1.402 1.383 1.409 1.411 1.383 1.409 1.411
1/2/4/4/22/11 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
1/4/1/1/14/7 1.087 1.29 1.305 1.306 1.373 1.39 1.392 1.382 1.399 1.401
1/4/2/2/18/9 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
1/4/4/4/26/13 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
2/1/1/1/10/5 1.02 1.268 1.28 1.283 1.356 1.372 1.382 1.356 1.381 1.391
2/1/2/2/14/7 1.088 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
2/1/4/4/22/11 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
2/2/1/1/12/6 1.082 1.29 1.305 1.306 1.373 1.39 1.392 1.373 1.399 1.401
2/2/2/2/16/8 1.088 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
2/2/4/4/24/12 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
2/4/1/1/16/8 1.087 1.29 1.305 1.306 1.373 1.39 1.392 1.382 1.399 1.401
2/4/2/2/20/10 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
2/4/4/4/28/14 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
4/1/1/1/14/7 1.025 1.283 1.297 1.306 1.365 1.381 1.391 1.365 1.39 1.4
4/1/2/2/18/9 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
4/1/4/4/26/13 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
4/2/1/1/16/8 1.082 1.29 1.305 1.306 1.373 1.39 1.392 1.373 1.399 1.401
4/2/2/2/20/10 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
4/2/4/4/28/14 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
4/4/1/1/20/10 1.087 1.29 1.305 1.306 1.373 1.39 1.392 1.382 1.399 1.401
4/4/2/2/24/12 1.093 1.383 1.399 1.402 1.392 1.409 1.411 1.392 1.409 1.411
4/4/4/4/32/16 1.093 1.383 1.399 1.402 1.392 1.409 1.412 1.392 1.409 1.412
46
Chapter 6
Conclusion
47
• We evaluated the design by using the static list best scheduling which
schedules all the three methods and compares the results.
The work done is able to demonostrate the automatic design space explo-
ration for VLIW processors. The framework is generic in the sence any of the
modules can be upgrated for future enhancement with out effecting the whole
framework. The Identification algorithm presently support at most one AFU
for the region. But, it can be changed for multiple AFU with out effecthing
the whole framework.
The static list based scheduling algorithm can also be modifed for multiple
AFUs with appropriated data structures. One can use the incremental ap-
porach for the implementation of scheduling algorithm by using appropriate
data structures to make the amortized cost for estimatation to a constant.
Presently scheduling alrorithm is not implemented in incremental approach.
Presently the framework support at most one AFU per region. The frequency
of execution of some regions is very high and for some regions it is very low.
Instead of selecting AFUs from the low freqently executed region, if we select
multiple AFUs from the high frequently executed regions we can improve the
performance more. Presently all the algorithms and data structures supports
48
at most one AFU per region. One can use appropriate data structures and
algorithms to extend it for multiple AFUs.
49
6.2.5 AFUs with memory accessing capability
Trimaran supports different levels of memories, some are very nearer to the
processor and some are very far. One can use the in built memory heirarchy
of trimaran and model AFUs with memory accessing capability.
50
Bibliography
[3] Diviya Jain, Anshul Kumar, Laura Pozzi, and Paolo Ienne. Automatically cus-
tomising VLIW architectures with coarse grained application-specific functional
units.In Proceedings of the 8th International Workshop on Software and Com-
pilers for Embedded Systems, Amsterdam, September 2004.
[4] Laura Pozzi, Kubilay Atasu, and Paolo Ienne. Exact and approximate al-
gorithms for the extension of embedded processor instruction sets. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
25(7):120929, July 2006.
[5] Paolo Ienne, Laura Pozzi, and Miljan Vuletic. On the limits of processor spe-
cialisation by mapping dataflow sections on ad-hoc functional units. Technical
Report 01/376, Swiss Federal Institute of Technology Lausanne (EPFL), Com-
puter Science Department (DI), Lausanne, December 2001.
[6] Paolo Ienne, Laura Pozzi, and Miljan Vuletic. On the limits of processor spe-
cialisation by mapping dataflow sections on ad-hoc functional units. Technical
Report 01/376, Swiss Federal Institute of Technology Lausanne (EPFL), Com-
puter Science Department (DI), Lausanne, December 2001.
51
[7] N. Clark, H. Zhong, and S. Mahlke. Processor acceleration through automated
instruction set customisation. In Proceedings of the 36th Annual International
Symposium on Microarchitecture, San Diego, Calif., Dec. 2003
[8] Bhuvan Middha, Varun Raj, Anup Gangwar, Anshul Kumar, M. Balakrishnan,
and Paolo Ienne. A Trimaran based framework for exploring the design space
of VLIW ASIPs with coarse grain functional units. In Proceedings of the 15th
International Symposium on System Synthesis, Kyoto, October 2002.
[10] Manoj Kumar Jain, M.Balakrishnan, Anshul Kumar. Asip design methodolo-
gies: Survey and issues. In Proceedings of the 14th IEEE International Confer-
ence on VLSI Design, January 2001, India.
[12] J. C. Gyllenhaal, W.-m. W. Hwu and B. R. Rau. HMDES Version 2.0 Specifica-
tion. Technical Report IMPACT-96-3. University of Illinois at Urbana- Cham-
paign, 1996.
52