A Timed Petri-Net Model For Loop Scheduling

In the Proceedings of the 12th International Conference on the Application and Theory of Petri
Nets, Gjern, Denmark, June 26{28, 1991, pp. 22{41.
A Timed Petri-Net Model for Loop Scheduling

Guang R. Gao
Yue-Bong Wong
Qi Ning
School of Computer Science

McGill University
3480 University
Montreal, Quebec H3A 2A7
Abstract
Most scientic computations spend a signicant part of their time in the execution of
loops. As a result, compiler designers for high-performance computer architectures have
been focusing their attention on the eective and ecient exploitation of parallelism within
loops.
In this paper, we use data ow graphs as our program representation for a class of loops,
and we develop a timed Petri-net model to model these loops. The behavior graph of the Petri
net is used to determine, at compile time, a repetitive execution pattern. Such a pattern,
once detected, can be used to generate parallel instructions schedule during code generation.
We show that, for an ideal model, after the earliest ring schedule is relaxed to satisfy
an initial token-distribution constraint, the repetitive pattern of a loop can be detected in
O(n2) iterations. This improves considerably the polynomial bound reported earlier [17].
Furthermore, we show that our results apply to all nodes on or o any critical cycles.
We have examined the application of the behavior graph approach to a target pipelined
architecture with multiple clean execution pipelines. A number of typical benchmarks (loops)
in scientic programs have been simulated, and we nd that the repetitive patterns can
be reached very fast. This veries the feasibility of employing the proposed method in a
compiler.
1 Introduction
Most scientic computations spend a signicant part of their time in the execution of loops. As a
result, compiler designers for high-performance computer architectures have been focusing their
attention on the eective and ecient exploitation of parallelism within loops. Unfortunately,
many optimizing compilers rely on ad-hoc schemes to handle loop parallelism and so far have
achieved only limited success. In contrast, the technology and architectures for high-performance
machines have been advancing rapidly, and aggressive ne-grain loop scheduling techniques are
needed to take advantage of the extensive parallelism available in these machines.
This paper was partially inspired by recent work in software pipelining proposed for ne-grain
loop scheduling. Software pipelining performs loop scheduling by computing a static parallel
schedule to overlap instructions of a loop body from dierent iterations. An advantage of software
022
pipelining is that it provides a direct way of exploiting parallelism across all iterations of the
loop. This is achieved without the explicit use of loop unrolling, resulting in highly compact
object codes. Software pipelining has been proposed for synchronous parallel machines as well as
pipelined machines [1, 2, 3, 10, 11, 22].
Software pipelining requires the exploitation of parallelism from a partially ordered set of
operations of the loop body which are to be performed repetitively over a sequence of iterations.
When a loop contains loop-carried dependences, the data dependence graph of the loop is no longer
acyclic. Therefore, it is important that the execution of the loop can be modeled at compile time,
and a steady state pattern can be derived for code scheduling. To eectively model the code
scheduling process for such loops is the primary objective of this paper.
This work is related to our work in data ow software pipelining [14, 15, 16]. Data ow software
pipelining is a compiler method for structuring the ne-grain parallelism in loops that are to be
executed by a data ow computer. The generalization of our earlier work, to cover loops with
loop-carried dependencies as well as to pipelined machines with resource constraints, motivates
us to nd a model for ne-grain loop scheduling.
We use data ow graphs as our program representation for a class of loops, and we develop a
timed Petri-net model to model these loops. The behavior graph of the Petri-net model is used
to determine, at compile time, a repetitive execution pattern. After the execution pattern is
determined, the compiler can then generate a time-optimal schedule to guide code generation.
Therefore, our work has provided a model for software pipelining mentioned earlier.
In a companion paper, we had established an O(n3) iterations as the polynomial bound for the
scheduling pattern to occur in a loop with single critical cycle under the earliest ring schedule
for the ideal machine model [17]. In the case of loops with multiple critical cycles, we were only
able to show that a polynomial time bound exists for nodes on the critical cycles. In this paper we
show that, after the earliest ring schedule is relaxed to satisfy certain initial token-distribution
constraint, the repetitive execution pattern of a loop can be detected in O(n2) iterations. This
improves considerably the polynomial bound reported previously. Furthermore, we show that
our results apply to all loops with single or multiple critical cycles, including nodes on or o
critical cycles. We have also examined the application of the behavior graph approach to a target
pipelined architecture with multiple clean execution pipelines.1 We simulate the execution of a
number of typical benchmarks (loops) from scientic programs and found that repetitive patterns
can be reached very fast. This veries the feasibility of employing such methods in a compiler.
Section 2 denes the timed Petri-nets and reviews the basic theories on timed marked graph (for
an introduction to basic Petri-net theory, see [26, 28, 27]). Section 3 denes a class of loops known
as a static data ow software pipeline (SDSP). This class includes loops both with and without
loop-carried dependences. In the third section we also describe how to obtain a corresponding
Petri-net loop representation, SDSP-PN. In Section 4, the technique for constructing the behavior
graph to obtain steady-state behavior for an SDSP-PN operated under the earliest ring rule is
discussed. In Section 5 we show that steady-state behavior for the ideal machine model, SDSP-PN,
can always be reached in a polynomial number of steps. In Section 6 we compose a new model,
called SDSP-MCP-PN, integrating the concept of a clean-pipelined processor architecture (with
multiple clean pipelines, MCP) into the basic SDSP-PN model. We then provide experimental
evidence to show that the cyclic frustum for both an SDSP-PN and its extension SDSP-MCP-PN
can be quickly reached for a set of Livermore loops. Finally, our conclusions are presented in
A pipeline is clean if it is free of structural hazards|resource con icts arise when the hardware cannot support
simultaneous operations by two possibly-independent instructions [20].
1
023
Section 7.
2 The Petri Net Model
2.1 Timed Petri Nets
Adding the notion of time to the basic Petri-net model enables the characterization of system
performance. In this paper we assign a deterministic time, expressed by a non-negative integer,
to each transition in the basic Petri-net. The model described below is made up of the original
timed model introduced by Ramchandani [28] and the concept of instantaneous state subsequently
developed by Chretienne [7].
Formally, a timed Petri net is dened by a pair (PN,
), where PN is the basic Petri-net tuple
(P; T; A). P is a non-empty set of places denoted by fp1; p2; : : : ; png, T is a non-empty set of
transitions denoted by ft1; t2; : : :; tmg, and A is a non-empty set of directed arcs such that P 6= ;,
T 6= ;, P \ T = ;, A P T [ T P . Pictorially, P , T , and A are represented by circles, bars,
and directed arcs, respectively. The symbol
is a function that assigns an non-negative integer
i to each transition ti in the net. The value i denotes the execution time (or the ring time)
taken by transition ti.
The state of the timed Petri net at time u is no longer described only by the current marking
at time u (Mu ) because some transitions may still be processing at time u. A new concept of
residual ring time vector, R, is introduced to keep track of on-going executions at each time step.
Ru(ti) stores the remaining execution time of transition ti at time u. Accordingly, Mu and Ru
together dene the instantaneous state of a timed Petri net. We have also made the following two
assumptions regarding the ring rule of enabled transitions:
Assumption 2.1.1 Two distinct rings of the same transitions cannot overlap. To formally
enforce this rule, each transition in the net is assigned a distinct self-loop of its own with only one
token on it. Though we do not draw them explicitly, they are implicitly assumed. This assumption
is also known as simple-server semantics.
Assumption 2.1.2 Transitions are red as soon as they are enabled. This has been called the
earliest ring rule.
2.2 Marked Graphs
This section denes and reviews previously known results for a class of Petri nets known as marked
graphs [8]. These graphs are important to the development of our works.
Denition 2.2.1 A Petri net PN = (P; T; A) is called a marked graph if and only if there is one
input transition and one output transition for each place in P .
Theorem 2.2.1 A marking is live if and only if the token count of every simple cycle is positive.2
Theorem 2.2.2 A live marking is safe if and only if every edge in the graph is in a simple cycle
with token count 1.
2
m.
A simple cycle is a directed path i j k

p t p
l m
:::t p
such that all places and transitions are dierent except i and
p
024
Theorem 2.2.3 If is a cyclic ring sequence such that M ! M , all transitions have been red
an equal number of times.
2.3 Optimal Computation Rate
Timed Petri nets have been applied in the study of concurrent systems to determine the computation rate (or equivalently the cycle time) which describes the number of rings of a transition
per unit time as the modeling system is operating at its maximum rate. Listed below is a review
of results regarding the cycle time of a timed marked graph [27]:
Denition 2.3.1 The cycle time of transition ti is dened as

Xi
lim
n!1 n
where Xin is the time at which transition ti initiates its n-th execution.
n
The number of tokens in a simple cycle remains the same after any ring sequence.
All transitions in a marked graph have the same cycle time.
The cycle time is computed by
(
(Ck ) ;
(t )
= max M
(Ck ) i
where k = 1; 2; : : : ; q and ti 2 T ;
(Ck ) = Pt 2C
(ti) = sum of the execution times of the transition in simple cycle Ck ;
M (Ck ) = Pp 2C M (pi ) = total number of tokens in simple cycle Ck ;
q = number of simple cycles in the net excluding the self-loop implicitly assumed for each
transition;
the cycle time of each self-loop is re ected by
(ti); 8ti 2 T .
The computation rate of a transition is the average number of rings of that transition in
unit time and is computed by the reciprocal of the cycle time.
(
)
M
(
1
k)
= min
(C ) ;
(t )
i
where k = 1; 2; : : : ; q and ti 2 T
The simple cycle Ck which gives the maximum cycle time or equivalently the minimum
computation rate is known as the critical cycle.
The cycle time of a timed marked graph can be obtained by enumerating every simple cycle
in the graph; however, the time complexity of the enumerating process can turn out to be exponential because there exists a marked graph with an exponential number of simple cycles [23].
A more ecient approach can be obtained by formulating the cycle time problem into a linear
programming problem, which has a theoretical polynominal bound [23].
025
Forward data
arc
Acknowledgement
arc
Feedback data
arc
L1:
+
(A,D)
A(B,C)
+
doall i
A[i]
B[i]
C[i]
D[i]
E[i]
endall
C(A,D)
from 1 to
:= X[i] +
:= Y[i] +
:= A[i] +
:= B[i] +
:= W[i] +
n
5;
A[i];
Z[i];
C[i];
D[i];
A
+
L2:
(B,C,E)
adjacency
list
+
(a)
E(D)
do i from 1 to n
A[i] := X[i] +
B[i] := Y[i] +
C[i] := A[i] +
D[i] := B[i] +
E[i] := W[i] +
end
5;
A[i];
E[i-1];
C[i];
D[i];
L1
D
E
(b) L2
Figure 1: Static Data ow Graph

The above computation rate is optimal or time-optimal in the sense that it is the maximum
achievable computation rate under any machine model [27, 28]. It can be achieved when a model
has enough parallelism to execute all enabled transitions as soon as they become enabled. Such an
ideal machine is used in the next section. Apart from a Petri net model with deterministic ring
time, considerable eort has been spent on stochastic timing, in particular, the cycle-time problem
of a stochastic marked graph. It has been shown that both the upper and lower bounds can be
computed by means of linear programming, which is known to have a theoretical polynominal
bound in the worst case [4].
3 Model Loops with Timed Petri Nets

In this section we rst dene the class of loops known as the static data ow software pipeline
(SDSP), which we focus on in this preliminary stage. It is a generalization of the class of loops
studied in [13], to include loop-carried dependencies. We then describe the procedure for deriving
a corresponding timed Petri-net representation, SDSP-PN.
3.1 The Model
A static data ow software pipeline is a class of loops which have the following features: First,
these loops are non-nested.3 Second, loop-carried dependences (if any) are from one iteration to
the next.
The body of a loop can be represented as a data ow graph G. Formally, it is expressed as a
tuple (V; E; E 0; F; F 0) where V is the set of nodes in G, sets E and E 0 contain forward data arcs
If a loop is nested, we consider the innermost loop. It is known that an innermost loop often spends most of
the execution time; hence, it is critical to loop scheduling.
3
026
A
p1
p2
p3
B
p5
A
p4
C
p6
p7
p8
p1
B
p5
p10
p4
C
p6
p7
D
p9
p3
p2
p9
p8
p11
p10
p12
E
(a) SDFP-PN
for L1
(b)SDFP-PN
for L2
Figure 2: SDSP-PN of L1 and L2

and feedback data arcs, respectively, and F and F 0 are sets of acknowledgement arcs for E and E 0.
Figures 1(a) and (b) illustrate the static data ow graph representation for L1 and L2, in which
L2 contains a loop-carried dependency.4
To convert an SDSP to a Petri net, a place is inserted on each arc of the SDSP, and each
node is replaced with a transition. The resulting Petri net is called an SDSP-PN. For any arc that
initially holds a token in the SDSP, a token is assigned to the corresponding place in the SDSP-PN
to form the initial marking. Note that initially, all tokens resides on the arcs from sets F and
E 0. A detailed study on transforming general data ow graph into Petri net can be found in [21].
Based upon the structure of an SDSP-PN, two important properties immediately follow: First,
the initial marking of SDSP-PN is live and safe (by Theorem 2.2.1 and 2.2.2). Second, SDSP-PN
is a marked graph. Figures 2(a) and (b) give the corresponding Petri-net representations for L1
and L2.
In a data ow graph a conditional expression is represented by a so-called well-formed conditional data ow subgraph [9]. Switch and merge nodes are used in the implementation. These
nodes have special ring rules which can present problems when considered in a Petri-net model.
To overcome these problems, the ring rules of the switch node and the merge node are altered
to produce and consume dummy tokens on their unselected branches [18]. Under this treatment,
switch and merge nodes have the same ring rules as regular nodes; hence, a conditional data ow
graph can be treated as an ordinary SDSP.
4 Steady-state Equivalent Net for SDSP-PN

With the SDSP-PN model constructed in the last section, we are ready to examine the repetitive
behavior or the steady state resulted from executing the SDSP on an ideal machine model with
the earliest ring rule enforced. Section 4.1 introduces the notion of behavior graph [28], which
A direct translation of L1 and L2 also contain control nodes, such as switch and merge nodes. For simplicity,
we omit these nodes from the data ow graph. From now on we use simplied graphs, unless otherwise stated.
4
027
p2
A
p1
B
p2
A
p1
B
p2
p4
p6
p8
p10
p3
p5 p4
C
p7
initial
instantaneous
state marking
D
p3 p6
p8
p5 p4
C
p7
p9
E
p10
terminal
instantaneous
state marking
Figure 3: An Example of the Behavior Graph for the SDSP-PN of L1

together with the live-safe properties of an SDSP-PN provide the means for proving the existence
and uniqueness of the steady state discussed in Section 4.2.
4.1 The Behavior graph of SDSP-PN
The construction of the behavior graph provides an alternative way to describe the behavior of
a Petri net, besides a reachability tree [26]. It becomes particularly useful when we want to
describe the concurrency and cyclic ring sequences of a Petri net. From a dierent standpoint,
the behavior graph is actually a trace, generated while executing the SDSP-PN according to the
earliest ring rule. At each time step, the behavior graph records the set of newly marked places
and the set of enabled transitions to be red at that step. In addition, directed arcs are introduced
among them to denote the token ow relation from place to transition (token consumption) and
from transition to place (token production). The instantaneous state of the behavior graph at
time i can be described by the current residual ring time vector Ri and the current marking
Mi. The algorithm for constructing the behavior graph is omitted [18]. Figure 3 illustrates the
behavior graph constructed for the marked graph, the SDSP-PN of L1, shown in Figure 2(a),
where the execution time of all transitions are assumed to be equal.
4.2 Cyclic Frustum and Steady-State Equivalent Net
As can be seen, the construction process of the behavior graph can continue forever, and the
behavior graph can be innitely extended. One key observation is that the behavior graph exhibits repetitive behavior after an initial period|the amount of time elapsed before the repetitive
behavior is reached. This is shown by the following lemmas.
Lemma 4.2.1 Behavior graph is unique for SDSP-PN.

Proof of Lemma 4.2.1.
Obviously, the original marking of the marked graph is unique. Since there is no structural
con ict in a marked graph, the rings of all enabled transitions at each time step with respect to
the earliest ring rule are unique. The validity of the lemma is immediate.
2
028
Lemma 4.2.2 There exists an instantaneous state in the behavior graph of SDSP-PN that appears
repeatedly.
Proof of Lemma 4.2.2.
The total number of distinct Mi is nite because SDSP-PN has a safe marking. Similarly, the
total number of distinct Ri is also nite because each transition in SDSP-PN has a known ring
time. As a result, the total number of possible instantaneous states are also nite. Hence, if the
behavior graph is innitely extended, some instantaneous states must be repeated.
2
From Lemmas 4.2.1 and 4.2.2 we can see that an instantaneous state once repeated will do so
forever. As a result, the region of the behavior graph between two repeated instantaneous state
can be used to represent the steady-state behavior of the studied SDSP-PN operated under the
earliest ring rule. Thus we have the following denition:
Denition 4.2.1 A Cyclic Frustum of a behavior graph B is the portion of B between two consecutive occurrences of some repeated instantaneous state. In addition, the two instantaneous states
that surround the frustum are termed the initial instantaneous state and the terminal instantaneous
state.
The marking portion of both the initial and terminal instantaneous state found in the behavior
graph for L1 are marked in Figure 3, where the two associated residual ring time vectors are
vectors composed of zero entries. Notice that the cyclic frustum is actually a cyclic ring sequence
since it res each transition at least once and returns the net to its initial state. Once the behavior
graph reaches its frustum, it will keep repeating itself. This suggests a way of capturing the
repetitive behavior of the system. Instead of extending the behavior indenitely, we extract the
cyclic frustum and coalesce the initial and terminal instantaneous states to form another stronglyconnected Petri net, known as steady-state equivalent net.
5 Ecient Compile-Time Derivation of Loop Schedule

As was shown, the repetitive execution pattern of an SDSP-PN executed under the earliest ring
rule can always be found within a nite number of steps. In this section we determine the length of
the initial period, extending earlier work reported in a companion paper [17]. Section 5.1 contains
a summary of the theoretical results from the paper. It concerns the length of the initial period
for an SDSP-PN having one critical cycle and multiple critical cycles. Section 5.2 introduces an
initial token-distribution constraint. As the initial marking of an SDSP-PN meets this constraint,
a tighter polynomial bound can be established for the initial period. More importantly, the result
can be generalized to the case of multiple critical cycles and provides an polynomial bound for its
initial period. Finally, we present an example to show that the lower bound for the initial period
is at least O(n) iterations, providing a measure of the tightness of our results. The notations and
assumptions which are used in this section are dened below:
Let G denote an SDSP-PN having n transitions and let Xih denote the time at which
transition ti commences its h+1 ring. We assume that the execution time i of each
transition ti is one time unit.
If P is a path in G, then M (P ), the token sum, denotes the sum of the tokens on each place
in P .5 The token in one place is taken in the sum as many times as the place is run through
5
Note that a cycle is allowed along a path.
029
in P . Similarly,
(P ), the value sum, denotes the sum of i of each transition ti in P . The i
of transition ti is taken in the sum as many times as the transition is run through in P . Let
Ph (ti; tj ) denote the set of possible paths in G from ti to tj having exactly h tokens along the
path, and let ah(ti; tj ) denote the value sum of the maximum value path in Ph(ti; tj ). We
also use the notation Phy (ti; tj ) to denote the subset y of Ph (ti; tj ) and the notation ayh(ti; tj )
to denote the maximum path value of subset Phy (ti; tj ). Since each transition has a self-loop
with one token on it (Assumption 2.1.1), Ph (ti; tj ) 6= ;, for h h0 where h0 is a positive
integer.
A simple cycle C in G is critical if the ratio of the value sum to the token sum is maximal,
i.e., if
(C )
(Ci) ;
M (C ) M (C )
i
where Ci denotes the other simple cycles in G. Let i denote the cycle time of the simple
cycle Ci in G, that is, i =
(Ci)=M (Ci ).
5.1 Previous Results on the Repetitive Behavior of an SDSP-PN

under the Earliest Firing Rule
Recall that the average cycle time of ring a transition in an SDSP-PN is determined by
(C )=M (C ), where C is the critical cycle in G [27, 28]. In Chretienne's thesis, a more precise
description of the ring time of a transition executed under the earliest ring rule is developed
and expressed as the time constraint Xih+k ? Xih = p, where k equals the least common multiple
of the token sum of all critical cycles in G and p equals k [6]. Thus, for G with only one
critical cycle, say C , k equals M (C ) and p equals
(C ). This time constraint means that every
k-th ring of a transition ti must be p time steps apart.
Recently, there has been an observation that the repetitive execution pattern can be detected
after a polynomial number of iterations of the loop body [1, 2]. However, the eect of multiple
critical cycles on the length of the steady state, raised by Chretienne, has been ignored.
The problem of determining the repetitive execution pattern was reinvestigated in a companion
paper [17]. In that paper, we showed for G with one critical cycle C , operated under the
earliest ring rule, a repeated execution pattern can be found after O(n3) iterations of G, i.e.,
the time constraint mentioned above. Note, in this case, k always equals M (C ). For G with
multiple critical cycles, the problem of nding the repetitive execution pattern then involves the
determination of an upper bound for the least common multiple of the token sum of all critical
cycle. We are unaware of any polynomial result. Instead, we have shown that transitions tj ,
residing on the critical cycles Ci, always result in a repetitive ring pattern after O(n2) iterations
of G, i.e., the time constraint is obeyed again. In this case, k = M (Ci ) and p =
(Ci ) for tj 2 Ci.
Accordingly, transitions from dierent critical cycles have a dierent repeating period, but they
all keep the same computation rate M (Ci)=
(Ci).
Through our simulation results (Section 6), we have found surprisingly short initial periods
for all benchmarks we tested. In fact, the observed bound was within O(n). It appears that the
bound derived above is too pessimistic, and a tighter bound should be possible. The rest of this
section is devoted to answering this question.
030
5.2 Initial Token-Distribution Constraint
The initial token-distribution depicted by Theorem 5.2.1 characterizes an initial marking for G
such that starting from which, a repeated pattern can be found after O(n2) iterations of G. The
length of the pattern and the number of rings of each transition in it respectively equal
(C ) and
M (C ), where C is the critical cycle holding all enabled transitions initially. Formally, the time
constraints Xih+k ? Xih = p; 8ti 2 G are satised after O(n2) iterations of G, where k = M (C )
and p =
(C ). Notice that Theorem 5.2.1 is independent of the number of critical cycles in
G. The validity of Theorem 5.2.1 is critical because the required initial condition can always be
reached after at most O(n) iterations of G, as discussed in the next paragraph. Consequently,
the repetitive ring pattern for a general SDSP-PN G can be found after O(n2) iterations of G,
regardless of the number of critical cycles.
Token Distribution Constraint Satisfaction: Assume that ti is a transition which resides on
a critical cycle. To meet the initial condition, one simply executes G using the earliest ring
rule, but prohibits any ring of transition ti. The ring process soon stops and deadlocks.
Since G is strongly connected, there always exists a cycle-free path P from ti to tj for all tj
in G. If ti is never red, tj stops ring soon after all tokens along P have been consumed.
Note also that there can be at most n tokens along a cycle-free path; that is, tj can be red
at most n times before the initial condition is met. Equivalently, it requires the scheduling
of at most O(n) iterations of G to reach the required state.
Before Theorem 5.2.1 is proven, we rst introduce several important lemmas. The rst is
Lemma 5.2.1 which relates the time at which transition tj starts its h+1 ring to the computation
of ah(ti; tj ), the value sum of the maximal value path in Ph (ti; tj ) (Chretienne and others [5, 6, 7]).
Lemma 5.2.2 states that a subset of a given set of k integers can always be found such that the
total sum is a multiple of k [2]. Lemma 5.2.3 is an inequality based upon the fact that the
value-per-token ratio on a critical cycle is always equal to or greater than the ratio along any
cycle [17].
Lemma 5.2.1 For any G executed under the earliest ring rule, the time Xjh at which transition
tj starts its h+1 ring equals
max
ah(ti; tj ); ti 2 set of enabled transitions at time 0:
t
i
Lemma 5.2.2 Given K integers I1; : : : ; Ik; there is a subset S of Ii such that
0
1
X
@ IiA mod k = 0:
Ii 2S
In other words, the sum of all Ii 2 S is a multiple of k.
Lemma 5.2.3
8Cj 2 critical cycles 2 G; m

(Cj )
X
Ci 2R
(Ci)
where R = fCa ; : : :; Cb j M (Ca)+ +M (Cb ) = mM (Cj ) and Ca; : : :; Cb 2 simple cycles 2 Gg.
031
Theorem 5.2.1 Given a critical cycle C in G, if the set of initial

enabled transitions at time 0
all belong to a selected critical cycle C , the time constraints Xjh+k ? Xjh = p; 8tj 2 G are obeyed
after O(n2) iterations of G are scheduled under the earliest ring rule, where k = M (C ) and
p =
(C ).
Proof of Theorem 5.2.1

Assume that the only enabled transitions at time 0 are those on the selected critical cycle C .
With Lemma 5.2.1, this theorem is proven by showing that for h O(n2),
max
ah+k (ti; tj ) ? max
ah(ti; tj ) = p; 8tj 2 G
t
t
i
for all ti in the set of initially enabled transition at time 0. Equivalently, we show that for
h O(n2),
ah+k (ti; tj ) = ah(ti; tj ) + p; 8tj 2 G
for all ti in the set of initially enabled transition at time 0.
Notice that Pz (ti; tj ), the set of paths from ti to tj with exactly z tokens, can be partitioned
into two disjoint subsets Pza(ti; tj ) and Pzb (ti; tj ), where z > 0. Subset Pza(ti; tj ) denotes the set of
paths that iterate through C at least once, while subset Pzb (ti; tj ) denotes the set of paths which
only touch C (i.e., C is not embedded entirely within the path).
We show that for every path in subset Phb (ti; tj ) there always exists a corresponding path in
subset Pha(ti; tj ) which has an equal or greater value sum, provided h (n + 1)k + n, where
k = M (C ). Consequently, the maximum value path in Ph (ti; tj ) for h O(n2) can always be
found in subset Pha(ti; tj ). Note that, for h (n + 1)k + n, there exists at least k cycles along any
possible
in Ph (ti; tj ). By Lemma 5.2.2
there exists a subset S of those cycles Ci such that
P
P Mpaths

C 2S (Ci ) is a multiple of k . Assume C 2S M (Ci ) = m M (C ); m 2 integer, and m > 0 for

any path Px 2 Ph(ti; tj ). Either S is composed of C m times (i.e., Px 2 Pha(ti; tj )) or Px may
not have the maximum path value. This is because a path Py can be constructed from Px by
replacing all Ci 2 S with exactly m C . Py must also exist in Ph (ti; tj ), and by Lemma 5.2.3, it
has a greater or equal value sum. Therefore, the maximum value path of Ph (ti; tj ) for h O(n2)
is always a member of subset Pha(ti; tj ).
In addition, notice that subset Pha+k (ti; tj ) can be constructed by having every path in the
subsets Pha(ti; tj ) and Phb(ti; tj ) iterate through C one more time. However, as was shown, the
maximum value path can always be found in subset Pha(ti; tj ). Consequently,
i
ah+k (ti; tj ) =
=
=
=
aah+k (ti; tj )
aah(ti; tj ) +
(C )
aah(ti; tj ) + p
ah(ti; tj ) + p
As can be seen, to meet the initial token-distribution constraint, the search for a transition
on the critical cycle is signicant. One possible approach to nd such a transition is to rst determine the computation rate restricted by the critical cycle using Magott's linear programming
formulation [23] and then apply the shortest path algorithm with the distance formulation dened
by Ramamoorthy and Ho to obtain a critical cycle [27]. Using the critical cycle, the required transition can be selected arbitrary. Since Magott's formulation can be solved by linear programming
032
within a theoretical polynomial bound and the shortest path problem can be solved in O(n3)
steps, the nal problem of determining repetitive pattern is polynomial bound also.
An alternative approach to the problem is to appoint a transition the initial enable condition
and then construct the behavior graph from then on for n2+n iterations. If the repetitive execution
pattern cannot be found, an untouched transition is selected and the procedure is repeated. In the
process a maximum of n transitions will be checked, and at most n iterations are required to satisfy
the initial token-distribution constraint for a selected transition. Then, the time complexity of
the entire approach is bounded by the amount of time required to schedule n(n+n2+n) iterations,
i.e., O(n3) iterations. Note that this algorithm also suggests a totally dierent way of approaching
the problem of determining the computation rate for G.
5.2.1 Remarks
Note that the requirement of an acknowledgement arc for each data arc in the model and the
resulting safeness property are both characteristics of a static data ow model. To keep the concept
of an ideal machine model, we have assumed a unit ring time for each transition. In general,
however, the proofs presented both in the previous section and in the companion paper can be
easily extended to cover a more general class of strongly-connected marked graphs where the
acknowledgement arc restriction is eliminated, individual transition is assigned a dierent ring
time, and lastly the number of tokens residing at a place are more than one (but bounded).
However, the assumption of having a self-loop on each transition, Assumption 2.1.1, is required.
Without the self-loop control, the relation between the earliest ring schedule and the maximum
path value, Lemma 5.2.1, cannot be established.
5.3 Tightness of the Bound
In this section, we illustrate the tightness of the previously derived polynomial upper bound, using
the example in Figure 4. The example ultimately illustrates the need for initiating at least n?1
iterations before the repetitive ring pattern is reached. It contains a chain of n nodes with only
one critical cycle (tn?2tn?1tntn?2) located at the right end. The computation rate of the critical
cycle and the chain, in general, is 1=3. All other simple cycles have a computation rate of 1=2. In
addition, note that there is a total of n?2 tokens along the path from tn to t1. Initially, at time
0, t1 is the only initial enabled node. By Lemma 5.2.1, the time for t1 to commence its h+1 ring
can be computed by ah(t1; t1), the maximum path value among the set of possible paths from t1
to t1 with exactly h tokens. However, due to the chain of n?2 tokens from tn to t1, the set of
paths from t1 to t1 with less than n?2 tokens can never reach the critical cycle. Thus, it indicates
that t1 is required to initiate at least n?1 times (i.e., n?1 iterations, or O(n) iterations) before
the eect of the critical cycle can be propagated back to t1.
5.4 Related work
Besides using the behavior graph to derive the schedule, Ramamoorthy and Ho [27] demonstrated
another way to compute a static schedule. They showed that the schedule for each transition ti
can be derived with the following time constraint, once the cycle time is determined:
Sih = ai + h
033
Critical Cycle
Execution Rate = 1/3
t1
t2
tn-2
tn-1
tn
Figure 4: A Code Sequence with an O(n) Lower Bound

where Sih is the time at which transition ti commences the h-th ring, and ai is the time at which
transition ti commences the rst ring. The starting time ai of all transitions can be computed
eciently with the single-source longest path algorithm, once the cycle time is determined. More
importantly, the resulting schedule enters steady state as soon as every transition is red once,
i.e., after the rst iteration. The derived schedule is based upon marked graph model. However,
unlike the behavior-graph approach with the strict earliest ring rule enforced, the computed
schedule may not be time-optimal.
6 Applications: Pipelined Processor Architectures

In this section we study the application of behavior graph for loop scheduling on processor architectures having a number of clean execution pipelines. Section 6.1 renes the SDSP-PN model
to incorporate the ideas of multiple clean pipelines, producing a unied Petri-net model SDSPMCP-PN. We again explore the notion of the behavior graph and the existence of steady state to
assure the feasibility of deriving a static schedule for a machine with multiple clean pipelines. In
Section 6.2 we examine the amount of time required in general for nding the cyclic frustum on
a set of Livermore loops. The fast detection of cyclic frustum shown in the results indicates the
feasibility of using a behavior graph to generate a static schedule in practical compilers.
6.1 SDSP-MCP-PN Model
In this section the notion of multiple clean pipelines (MCP) is incorporated onto the SDSP-PN
model. A unied Petri-net model, SDSP-MCP-PN, is produced. It models the execution of the
SDSP on machine with multiple clean execution pipelines of l stages. The construction process
consists of two steps: series expansion and run-place introduction. Notice that once an instruction
enters a pipeline, it runs through the pipeline without interference from other enabled instructions.
This implies that the detailed structure of the pipeline does not need to be explicit. Figure 5(a)
shows a model of two execution pipelines.
Run-place introduction: We introduce a place pr , known as the run place, to denote the MCP
and modify all transitions ti in the SDSP-PN to include pr as both the input and output
places. Place pr is initially assigned with a number of tokens, each representing the existence
of one pipeline. When a transition becomes enabled, it competes for pr to get red.
Series expansion: To denote the fact that one traversal through MCP takes l time units, a
series expansion procedure is performed which introduces a new transition for each place
034
A
p2
Time
Step
pr
pr
p2
A
p4
p4
p1
p6
p3
B
p6
p5
p8
pr
C
pr
pr
p2
p10
p3
B
p1
p8
p5 p7 p4
A
pr
pr
pr
p7
p1 p3
p6 p8 p9
pr
pr
p10
p2
p5 p7
p4
pr
p9
p10
D
pr
10
p1
p3
pr
p6 p8
p9
11
initial
instantaneous
state marking
(b) SDSP-MCP-PN of L1
terminal
instantaneous
state marking
(b) Behavior Graph
Figure 5:
in the SDSP-PN to account for time delay. We call the transitions originally appearing in
the SDSP-PN the SDSP transitions, and the ones newly introduced in series expansion, the
dummy transitions. Every SDSP transition is assigned an execution time of 1 while every
dummy transition is assigned an execution time of l?1, where l denotes the length of the
execution pipeline. When l=1, there are no dummy transitions remaining in the nal model.
In the gure we distinguish dummy transitions by bars of dierent length.
The behavior graph for the combined model can be constructed in a way similar to that for
the SDSP-PN model described previously. With the existence of the run place as a structural
con ict, choices appear whenever more than one SDSP transition is enabled. We assume that the
ring mechanism in the machine will always choose enabled transitions to re|it will never idle
as long as there is at least one enabled node. The machine can break ties by giving priority to the
nodes that simultaneously become enabled. The particular priority does not matter; we assume
only that the machine exhibits repeatable behavior, i.e., it always makes the same choice given
its rule for priority and machine condition (the instantaneous state).
Multiple tokens in a run place can be represented in the behavior graph as multiple run places,
as shown in Figure 5(b). In addition, the assumption above implies that a repeated instantaneous
state will be encountered as the behavior graph is continuously extended for a sucient period
035
of time. The notion of cyclic frustum and the steady-state equivalent net can again be dened.
A valid scheduling pattern for a multiple clean pipeline machine can then be derived from the
steady-state equivalent net.
With the existence of a resource constraint imposed by a limited number of pipelines, the
computation rate of an SDSP-MCP-PN is no longer re ected directly from the critical cycle. The
impact of resource constraint is illustrated with Theorem 6.1.1. It ultimately imposes an upper
bound on the execution rate of each node in the SDSP-MCP-PN. Intuitively, there can be at most
I enabled transition sent for execution at each time step. Thus, it takes at least n=I cycles to
compute one iteration of the loop body even though the cycle time of the critical cycles is far
less. Note also that this upper bound is the result of the resource constraint imposed by the I
multiple clean pipelines and is independent of the approach used for con ict resolution. When
such a bound is achieved, all pipelines are 100% utilized.
Theorem 6.1.1 Given an SDSP-MCP-PN G which models I clean pipelines and contains n
SDSP transitions, the computation rate of any SDSP transition in G can never be greater than
I=n, i.e., I=n.
6.2 Simulation Results
A set of Livermore Loops was chosen for the study; all were written in SISAL [12, 24]. Our
simulations were performed on a compiler/simulator testbed developed at McGill University [16].
The testbed consists of a prototype SISAL compiler capable of producing data ow code (known
as A-Code) [30, 31]. For this particular study, we modied the simulator to permit analysis of
cyclic frustums generated for both SDSP-PN and SDSP-MCP-PN models. The simulator accepts
A-code as input and simulates the corresponding ring sequence of the code.
Table 1 shows the simulation results after executing an SDSP on an ideal machine with innitely many single-stage pipelines. That is, the SDSP-PN was executed under the earliest ring
rule with the ring time of each transition equal to one. In the table, the size of loop body re ects
the number of nodes that were repeatedly executed excluding the start-up initiation sequence.
Start time and repeat time indicate the times when the initial and the terminal instantaneous
states are identied. Length of frustum is the time dierence between the terminal and initial
instantaneous states. Transition count records the number of occurrences of a transition that
appears in the cyclic frustum. Note that all transitions are red an equal number of times in the
steady state (Theorem 2.2.3). Computation rate is the average ring rate of each SDSP transition
in the loop body and equals
Transition count
Repeat Time ? Start Time
Finally, BD is a tight bound derived by observation and is intended only for comparison purposes.
Note that in each example the repeated instantaneous state is found within 2n time steps.
Table 2 illustrates the corresponding pattern detection result for the set of benchmarks with
respectively one, two, four, and eight clean pipeline(s) assumed. The overall results demonstrates
that the steady state for all cases can be found eciently. It also reveals the following facts:
Loop9 is a potential candidate for parallelizing as a DOALL loop; however, it requires subscript analysis to
expose its parallelism. Here we have examined the loop both ways, with and without LCDs, to increase the
diversity of our testing.
6
036
Loops without LCD
Loops with LCD
Loop1 Loop7 Loop96 Loop12 Loop3 Loop5 Loop9 Loop11

Size of Loop body
31
64
84
15
15
18
84
14
2 (BD)
62
128
168
30
30
36
168
28
Start Time
33
19
26
16
8
9
14
16
Repeat Time
36
23
30
19
10
13
28
20
Length of Frustum
3
4
4
3
2
4
14
4
Transition count
1
1
1
1
1
1
1
1
Computation Rate 1/3
1/4
1/4
1/3
1/2
1/4 1/14
1/4
n
Table 1: Experimental Results for the SDSP-PN Model
Loops without LCD
Loops with LCD
2 (BD)
Loop1 Loop7 Loop9 Loop12 Loop3 Loop5 Loop9 Loop11

480 1024 1344
240
240
288 1344
224
Start Time
Repeat Time
Length of Frustum
Transition count
Computation Rate
Processor Usage
341
296
749
372
360
833
31
64
84
1
1
1
1/31 1/64 1/84
98.9% 99.7% 98.4%
157
184
27
1
1/27
55.8%
86
86
132
106
121
263
20
35
131
1
1
1
1/20 1/35 1/131
72.9% 51.0% 64.1%
138
172
34
1
1/34
40.9%
Start Time
Repeat Time
Length of Frustum
Transition count
Computation Rate
282
308
26
1
1/26
206
244
38
1
1/38
395
438
43
1
1/43
147
145
25
1
1/25
82
100
18
1
1/18
78
145
67
2
2/67
115
230
115
1
1/115
133
166
33
1
1/33
Start Time
Repeat Time
Length of Frustum
Transition count
Computation Rate
268
293
25
1
1/25
207
240
33
1
1/33
242
277
35
1
1/35
145
194
49
2
2/49
81
98
17
1
1/17
74
139
65
2
2/65
113
226
113
1
1/113
154
186
32
1
1/32
Start Time
Repeat Time
Length of Frustum
Transition count
Computation Rate
265
338
73
3
3/73
174
206
32
1
1/32
223
323
100
3
3/100
128
152
24
1
1/24
64
80
16
1
1/16
72
104
32
1
1/32
112
224
112
1
1/112
128
160
32
1
1/32
1 Clean Pipeline:
2 Clean Pipelines:
4 Clean Pipelines:
8 Clean Pipelines:
Table 2: Multiple Clean Pipelines with Eight Stages

037
The condition raised by Theorem 6.1.1 is veried in the case of one clean pipeline, where
some test programs can keep the single pipeline fully busy. In Loop1, Loop7, and Loop9,
the upper bound on the computation rate, 1=n, imposed by the single pipeline is reached.
All three cases indicated that the respective pipelines were fully utilized at all time except
when they were lling and draining. The various processor usage in the three loops also
re ects the impact of the prelude and postlude execution sequence. Though the length of
the postlude sequence was not recorded, the relatively shorter prelude sequence in Loop7,
indicated by start time, is obviously a major factor in utilization.
Though each transition was red an equal number of times in the steady state of a marked
graph, the number of rings was not necessary one.
The amount of time required for the emergence of steady state decreased as the number of
pipelines increased, except in a few cases where the transition count was greater than one.
As the number of pipelines exceeded the amount of parallelism in the loop, the behavior
graph obtained was exactly the same as the one obtained for the ideal model. For instance,
as Loop12, Loop3, Loop5, Loop9, and Loop11 were run with eight pipelines, their start time
and their repeat time were simply eight times the corresponding time derived for the ideal
model.
6.3 Comparison
To construct a schedule for the multiple pipeline machines, Aiken and Nicolau suggested that the
same schedule obtained from the ideal case be applied directly by scheduling the steady state one
row at a time [2]. It was also shown when such schedule is adopted for the multiple pipelines
machine, the total run time expected is always bounded by two times the optimal run time
obtained for the same machine [25]. Nevertheless, the resulting schedule is still unsatisfactory
because, after all instructions from row i are scheduled for execution, a period of l?1 idle cycles
(where l is the length of the pipeline) is always required to delay the initiation of row i+1, in
order to avoid possible data con ict between the last operation of row i and the rst operation
of row i+1. Consider the use of the steady-state of L1, shown in Figure 3, as the schedule for
a machine with two clean pipelines with each one having two stages. The part of the schedule
which involves the steady state will be
processor1
A noop B E noop
processor2
D noop C noop noop
At each iteration, A and D cannot be sent for execution until all B , C , and E completes ring,
even though transition A is free for execution right after B and C complete their ring. Similarly,
since Ramamoorthy and Ho's schedule is derived on a marked graph which is equivalent to an ideal
machine model, it incurs the same ineciency when it is applied to the multiple clean pipeline
case.
In the SDSP-MCP-PN model, the problem of data con ict in a multiple pipeline was considered
in the process of constructing the behavior graph. While imposing the earliest ring rule, the
gap required in the former case is lled with enabled instructions that are safe to be executed.
The corresponding schedule which involves the steady state, derived from the behavior graph
038
(Figure 5(b)), is given below.

processor1
processor2
B E
A
D
C noop noop noop
Thus this scheme will always render better processor usage. In addition, the assurance of repeatable state in the SDSP-MCP-PN together with the simulation results obtained so far reveals the
feasibility of employing the behavior graph to generate a static schedule in practical compilers.
7 Conclusions
The application of Petri-Net theory to compiler design received attention as early as 1970 [29].
Similar work, reported recently, is the study of microprogram scheduling using Petri nets, where
resource constraints such as registers and functional units are modeled into a unied Petri-net
model [19]. It indicated that the search for an optimal schedule is believed to be NP-complete.
In this paper, we have established a much simpler timed Petri-net model for loop scheduling.
The goal is to exploit ne-grain parallelism across iteration boundaries in a loop and the scheduling
method should have a manageable complexity so that it is feasible for compiler implementation.
The intuition is that as long as the steady state is repeated a large number of times and there
is enough parallelism to keep the architecture pipelines fully utilized during each steady-state
period, our scheduling approach should then achieve close to time-optimal throughput.
Based upon the model, we have established a new and improved polynomial time bound for
our scheduling technique to reach the steady state in a class of loops with single or multiple critical
cycles for the ideal machine model. An example has been given to demonstrate the tightness of
the new bound. We have also rened our scheduling approach to exploit ne-grain parallelism for
clean pipelined machine. Experimental results show that the steady states occur very fast in real
scientic benchmark programs.
8 Acknowledgment
We thank the Natural Science and Engineering Research Council (NSERC) for a grant supporting this work. While undergoing the development of this paper, we enjoyed support from the
members of the ACAPS (Advanced Computer Architecture and Program Structures) Group at
McGill University. In particular, we thank Herbert Hum and Jean-Marc Monti for their valuable
discussions. We also thank Russell Olsen for proof reading the nal draft and for his suggestions
for its improvement.
References
[1] A. Aiken. Compaction-based parallelization. (PhD thesis), Technical Report 88{922, Cornell
University, 1988.
[2] A. Aiken and A. Nicolau. Optimal loop parallelization. In Proceedings of the 1988 ACM
SIGPLAN Conference on Programming Languages Design and Implementation, June 1988.
039
[3] A. Aiken and A. Nicolau. A realistic resource-constrained software pipelining algorithm. In

Proceedings of the Third Workshop on Programming Languages and Compilers for Parallel
Computing, Irvine, CA, August 1990.
[4] J. Campos and et al. Tight polynomial bounds for steady-state performance of marked
graphs. In Third International Workshop on Timed Petri Nets, pages 200{209, Kyoto, Japan,
December 1989. IEEE Computer Society Press.
[5] J. Carlier, P. Chretienne, and C. Girault. Modeling scheduling problems with timed Petri
nets. In G. Goos and J. Hartmanis, editors, Advances in Petri Nets, LNCS 340, pages 62{82.
Springer-Verlag, Berlin, Heidelberg, NY, 1984.
[6] P. Chretienne. Les Reseaux de Petri Temporises (These d'etat). PhD thesis, Institut de
programmation, Universite P. et M. CURIE, C.N.R.S.{E.R.A. 592, September 1984.
[7] P. Chretienne. Timed event graphs: A complete study of their controlled executions. In
International Workshop on Timed Petri Nets, pages 47{54, Torino, Italy, July 1985. IEEE
Computer Society Press.
[8] F. Commoner, A. W. Holt, S. Even, and A. Pnueli. Marked directed graphs. Journal of
Computer and System Sciences, 5:511{523, 1971.
[9] J. B. Dennis. Data ow for supercomputers. In Proceedings of the 1984 CompCon, March
1984.
[10] K. Ebcioglu. A compilation technique for software pipelining of loops with conditional jumps.
In Proceedings of the 20th Annual Workshop on Microprogramming, December 1987.
[11] K. Ebcioglu and A. Nicolau. A global resource-constrained parallelization technique. In
Proceedings of the ACM SIGARCH International Conference on Supercomputing, June 1989.
[12] J. T. Feo. An analysis of the computational and parallel complexity of the Livermore loops.
Parallel Computer, 8(7):163{185, July 1988.
[13] G. R. Gao. A pipelined code mapping scheme for static data ow computers. Technical Report
TR-371, Laboratory for Computer Science, MIT, 1986.
[14] G. R. Gao. Aspects of balancing techniques for pipelined data ow code generation. Journal
of Parallel and Distributed Computing, 6:39{61, 1989.
[15] G. R. Gao. A Code Mapping Scheme for Data ow Software Pipelining. Kluwer Academic
Publishers, Boston, December 1990.
[16] G. R. Gao and Z. Paraskevas. Compiling for data ow software pipelining. In D. Gelernter,
A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, pages
275{303. MIT Press, 1990.
[17] G. R. Gao, Y. B. Wong, and Qi Ning. A Petri-Net model for ne-grain loop scheduling. In
Proceedings of the '91 ACM-SIGPLAN Conference on Programming Language Design and
Implementation, Toronto, Canada, June 1991.
040
[18] G. R. Gao, Y. B. Wong, and Qi Ning. A Petri-Net model for ne-grain loop scheduling.
ACAPS Technical Memo 18, School of Computer Science, McGill University, Montreal, January 1991.
[19] C. Hanen. Optimizing microprograms for recurrent loops on pipelined architectures using
timed petri nets. In G. Rozenberg, editor, Advances in Petri Nets, LNCS 424, pages 236{261.
Springer-Verlag, 1989.
[20] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach.
Morgan Kaufmann Publishers, Inc., 1990.
[21] K. M. Kavi, B. P. Buckles, and U. N. Bhat. Isomorphisms between Petri nets and data ow
graphs. IEEE Transactions on Software Engineering, 13(10):1127{1134, October 1987.
[22] Monica Lam. Software pipelining: An eective scheduling technique for VLIW machines. In
Proceedings of the 1988 ACM SIGPLAN Conference on Programming Languages Design and
Implementation, pages 318{328, Atlanta, GA, June 1988.
[23] J. Magott. Performance evaluation of concurrent systems using Petri nets. Information
Processing Letters, North-Holland, 18:7{13, January 1984.
[24] J. R. McGraw and et al. SISAL: Streams and iteration in a single assignment language|
language reference manual version 1.2. Technical Report M-146, Lawrence Livermore National
Laboratory, 1985.
[25] A. Nicolau, K. Pingali, and A. Aiken. Fine-grain compilation for pipelined machines. Technical Report TR-88-934, Department of Computer Science, Cornell University, Ithaca, NY,
1988.
[26] J. L. Peterson. Petri Net Theory and the Modeling of Systems. Prentice-Hall, Inc., Englewood
Clis, NJ, 1981.
[27] C. V. Ramamoorthy and G. S. Ho. Performance evaluation of asynchronous concurrent
systems using Petri Nets. IEEE Transactions on Computers, pages 440{448, September
1980.
[28] C. Ramchandani. Analysis of asynchronous concurrent systems. Technical Report TR-120,
Laboratory for Computer Science, MIT, 1974.
[29] R. Shapior and H. Saint. A new approach to optimization of sequencing decisions. Annual
Review in Automatic Programming, 6:257{288, 1970.
[30] R. Tio. The A-code assembly language reference manual. ACAPS Design Note 02, School of
Computer Science, McGill University, Montreal, July 1988.
[31] R. Tio. DASM: The A-code data-driven assembler program reference manual. ACAPS Design
Note 03, School of Computer Science, McGill University, Montreal, July 1988.
041

A Timed Petri-Net Model For Loop Scheduling

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Timed Petri-Net Model For Loop Scheduling

Enviado por

Direitos autorais:

Formatos disponíveis

In the Proceedings of the 12th International Conference on the Application and Theory of Petri

Nets, Gjern, Denmark, June 26{28, 1991, pp. 22{41.

A Timed Petri-Net Model for Loop Scheduling

School of Computer Science

2 The Petri Net Model

2.1 Timed Petri Nets

2.2 Marked Graphs

A simple cycle is a directed path i j k

2.3 Optimal Computation Rate

De nition 2.3.1 The cycle time of transition ti is de ned as

Figure 1: Static Data ow Graph

3 Model Loops with Timed Petri Nets

3.1 The Model

Figure 2: SDSP-PN of L1 and L2

4 Steady-state Equivalent Net for SDSP-PN

Figure 3: An Example of the Behavior Graph for the SDSP-PN of L1

4.1 The Behavior graph of SDSP-PN

4.2 Cyclic Frustum and Steady-State Equivalent Net

Lemma 4.2.1 Behavior graph is unique for SDSP-PN.

5 Ecient Compile-Time Derivation of Loop Schedule

Note that a cycle is allowed along a path.

5.1 Previous Results on the Repetitive Behavior of an SDSP-PN

5.2 Initial Token-Distribution Constraint

In other words, the sum of all Ii 2 S is a multiple of k.

8Cj 2 critical cycles 2 G; m

Theorem 5.2.1 Given a critical cycle C in G, if the set of initial

Proof of Theorem 5.2.1

5.3 Tightness of the Bound

5.4 Related work

Figure 4: A Code Sequence with an O(n) Lower Bound

6 Applications: Pipelined Processor Architectures

6.1 SDSP-MCP-PN Model

(b) Behavior Graph

I=n, i.e.,  I=n.

6.2 Simulation Results

Loops without LCD

Loops with LCD

Loop1 Loop7 Loop96 Loop12 Loop3 Loop5 Loop9 Loop11

Table 1: Experimental Results for the SDSP-PN Model

Loops without LCD

Loops with LCD

Loop1 Loop7 Loop9 Loop12 Loop3 Loop5 Loop9 Loop11

Table 2: Multiple Clean Pipelines with Eight Stages

(Figure 5(b)), is given below.

[3] A. Aiken and A. Nicolau. A realistic resource-constrained software pipelining algorithm. In

Você também pode gostar

Denition 2.3.1 The cycle time of transition ti is dened as

5 Ecient Compile-Time Derivation of Loop Schedule

I=n, i.e., I=n.