Escolar Documentos
Profissional Documentos
Cultura Documentos
Yue-Bong Wong
Qi Ning
Most scientic computations spend a signicant part of their time in the execution of
loops. As a result, compiler designers for high-performance computer architectures have
been focusing their attention on the eective and ecient exploitation of parallelism within
loops.
In this paper, we use data
ow graphs as our program representation for a class of loops,
and we develop a timed Petri-net model to model these loops. The behavior graph of the Petri
net is used to determine, at compile time, a repetitive execution pattern. Such a pattern,
once detected, can be used to generate parallel instructions schedule during code generation.
We show that, for an ideal model, after the earliest ring schedule is relaxed to satisfy
an initial token-distribution constraint, the repetitive pattern of a loop can be detected in
O(n2) iterations. This improves considerably the polynomial bound reported earlier [17].
Furthermore, we show that our results apply to all nodes on or o any critical cycles.
We have examined the application of the behavior graph approach to a target pipelined
architecture with multiple clean execution pipelines. A number of typical benchmarks (loops)
in scientic programs have been simulated, and we nd that the repetitive patterns can
be reached very fast. This veries the feasibility of employing the proposed method in a
compiler.
1 Introduction
Most scientic computations spend a signicant part of their time in the execution of loops. As a
result, compiler designers for high-performance computer architectures have been focusing their
attention on the eective and ecient exploitation of parallelism within loops. Unfortunately,
many optimizing compilers rely on ad-hoc schemes to handle loop parallelism and so far have
achieved only limited success. In contrast, the technology and architectures for high-performance
machines have been advancing rapidly, and aggressive ne-grain loop scheduling techniques are
needed to take advantage of the extensive parallelism available in these machines.
This paper was partially inspired by recent work in software pipelining proposed for ne-grain
loop scheduling. Software pipelining performs loop scheduling by computing a static parallel
schedule to overlap instructions of a loop body from dierent iterations. An advantage of software
022
pipelining is that it provides a direct way of exploiting parallelism across all iterations of the
loop. This is achieved without the explicit use of loop unrolling, resulting in highly compact
object codes. Software pipelining has been proposed for synchronous parallel machines as well as
pipelined machines [1, 2, 3, 10, 11, 22].
Software pipelining requires the exploitation of parallelism from a partially ordered set of
operations of the loop body which are to be performed repetitively over a sequence of iterations.
When a loop contains loop-carried dependences, the data dependence graph of the loop is no longer
acyclic. Therefore, it is important that the execution of the loop can be modeled at compile time,
and a steady state pattern can be derived for code scheduling. To eectively model the code
scheduling process for such loops is the primary objective of this paper.
This work is related to our work in data
ow software pipelining [14, 15, 16]. Data
ow software
pipelining is a compiler method for structuring the ne-grain parallelism in loops that are to be
executed by a data
ow computer. The generalization of our earlier work, to cover loops with
loop-carried dependencies as well as to pipelined machines with resource constraints, motivates
us to nd a model for ne-grain loop scheduling.
We use data
ow graphs as our program representation for a class of loops, and we develop a
timed Petri-net model to model these loops. The behavior graph of the Petri-net model is used
to determine, at compile time, a repetitive execution pattern. After the execution pattern is
determined, the compiler can then generate a time-optimal schedule to guide code generation.
Therefore, our work has provided a model for software pipelining mentioned earlier.
In a companion paper, we had established an O(n3) iterations as the polynomial bound for the
scheduling pattern to occur in a loop with single critical cycle under the earliest ring schedule
for the ideal machine model [17]. In the case of loops with multiple critical cycles, we were only
able to show that a polynomial time bound exists for nodes on the critical cycles. In this paper we
show that, after the earliest ring schedule is relaxed to satisfy certain initial token-distribution
constraint, the repetitive execution pattern of a loop can be detected in O(n2) iterations. This
improves considerably the polynomial bound reported previously. Furthermore, we show that
our results apply to all loops with single or multiple critical cycles, including nodes on or o
critical cycles. We have also examined the application of the behavior graph approach to a target
pipelined architecture with multiple clean execution pipelines.1 We simulate the execution of a
number of typical benchmarks (loops) from scientic programs and found that repetitive patterns
can be reached very fast. This veries the feasibility of employing such methods in a compiler.
Section 2 denes the timed Petri-nets and reviews the basic theories on timed marked graph (for
an introduction to basic Petri-net theory, see [26, 28, 27]). Section 3 denes a class of loops known
as a static data
ow software pipeline (SDSP). This class includes loops both with and without
loop-carried dependences. In the third section we also describe how to obtain a corresponding
Petri-net loop representation, SDSP-PN. In Section 4, the technique for constructing the behavior
graph to obtain steady-state behavior for an SDSP-PN operated under the earliest ring rule is
discussed. In Section 5 we show that steady-state behavior for the ideal machine model, SDSP-PN,
can always be reached in a polynomial number of steps. In Section 6 we compose a new model,
called SDSP-MCP-PN, integrating the concept of a clean-pipelined processor architecture (with
multiple clean pipelines, MCP) into the basic SDSP-PN model. We then provide experimental
evidence to show that the cyclic frustum for both an SDSP-PN and its extension SDSP-MCP-PN
can be quickly reached for a set of Livermore loops. Finally, our conclusions are presented in
A pipeline is clean if it is free of structural hazards|resource con
icts arise when the hardware cannot support
simultaneous operations by two possibly-independent instructions [20].
1
023
Section 7.
Adding the notion of time to the basic Petri-net model enables the characterization of system
performance. In this paper we assign a deterministic time, expressed by a non-negative integer,
to each transition in the basic Petri-net. The model described below is made up of the original
timed model introduced by Ramchandani [28] and the concept of instantaneous state subsequently
developed by Chretienne [7].
Formally, a timed Petri net is dened by a pair (PN,
), where PN is the basic Petri-net tuple
(P; T; A). P is a non-empty set of places denoted by fp1; p2; : : : ; png, T is a non-empty set of
transitions denoted by ft1; t2; : : :; tmg, and A is a non-empty set of directed arcs such that P 6= ;,
T 6= ;, P \ T = ;, A P T [ T P . Pictorially, P , T , and A are represented by circles, bars,
and directed arcs, respectively. The symbol
is a function that assigns an non-negative integer
i to each transition ti in the net. The value i denotes the execution time (or the ring time)
taken by transition ti.
The state of the timed Petri net at time u is no longer described only by the current marking
at time u (Mu ) because some transitions may still be processing at time u. A new concept of
residual ring time vector, R, is introduced to keep track of on-going executions at each time step.
Ru(ti) stores the remaining execution time of transition ti at time u. Accordingly, Mu and Ru
together dene the instantaneous state of a timed Petri net. We have also made the following two
assumptions regarding the ring rule of enabled transitions:
Assumption 2.1.1 Two distinct rings of the same transitions cannot overlap. To formally
enforce this rule, each transition in the net is assigned a distinct self-loop of its own with only one
token on it. Though we do not draw them explicitly, they are implicitly assumed. This assumption
is also known as simple-server semantics.
Assumption 2.1.2 Transitions are red as soon as they are enabled. This has been called the
earliest ring rule.
This section denes and reviews previously known results for a class of Petri nets known as marked
graphs [8]. These graphs are important to the development of our works.
Denition 2.2.1 A Petri net PN = (P; T; A) is called a marked graph if and only if there is one
input transition and one output transition for each place in P .
Theorem 2.2.1 A marking is live if and only if the token count of every simple cycle is positive.2
Theorem 2.2.2 A live marking is safe if and only if every edge in the graph is in a simple cycle
with token count 1.
2
m.
l m
:::t p
such that all places and transitions are dierent except i and
p
024
Theorem 2.2.3 If is a cyclic ring sequence such that M ! M , all transitions have been red
an equal number of times.
Timed Petri nets have been applied in the study of concurrent systems to determine the computation rate (or equivalently the cycle time) which describes the number of rings of a transition
per unit time as the modeling system is operating at its maximum rate. Listed below is a review
of results regarding the cycle time of a timed marked graph [27]:
The number of tokens in a simple cycle remains the same after any ring sequence.
All transitions in a marked graph have the same cycle time.
The cycle time is computed by
(
(Ck ) ;
(t )
= max M
(Ck ) i
where k = 1; 2; : : : ; q and ti 2 T ;
(Ck ) = Pt 2C
(ti) = sum of the execution times of the transition in simple cycle Ck ;
M (Ck ) = Pp 2C M (pi ) = total number of tokens in simple cycle Ck ;
q = number of simple cycles in the net excluding the self-loop implicitly assumed for each
transition;
the cycle time of each self-loop is re
ected by
(ti); 8ti 2 T .
The computation rate
of a transition is the average number of rings of that transition in
unit time and is computed by the reciprocal of the cycle time.
(
)
M
(
1
k)
= min
(C ) ;
(t )
i
where k = 1; 2; : : : ; q and ti 2 T
The simple cycle Ck which gives the maximum cycle time or equivalently the minimum
computation rate is known as the critical cycle.
The cycle time of a timed marked graph can be obtained by enumerating every simple cycle
in the graph; however, the time complexity of the enumerating process can turn out to be exponential because there exists a marked graph with an exponential number of simple cycles [23].
A more ecient approach can be obtained by formulating the cycle time problem into a linear
programming problem, which has a theoretical polynominal bound [23].
025
Forward data
arc
Acknowledgement
arc
Feedback data
arc
L1:
+
(A,D)
A(B,C)
+
doall i
A[i]
B[i]
C[i]
D[i]
E[i]
endall
C(A,D)
from 1 to
:= X[i] +
:= Y[i] +
:= A[i] +
:= B[i] +
:= W[i] +
n
5;
A[i];
Z[i];
C[i];
D[i];
A
+
L2:
(B,C,E)
adjacency
list
+
(a)
E(D)
do i from 1 to n
A[i] := X[i] +
B[i] := Y[i] +
C[i] := A[i] +
D[i] := B[i] +
E[i] := W[i] +
end
5;
A[i];
E[i-1];
C[i];
D[i];
L1
D
E
(b) L2
A static data
ow software pipeline is a class of loops which have the following features: First,
these loops are non-nested.3 Second, loop-carried dependences (if any) are from one iteration to
the next.
The body of a loop can be represented as a data
ow graph G. Formally, it is expressed as a
tuple (V; E; E 0; F; F 0) where V is the set of nodes in G, sets E and E 0 contain forward data arcs
If a loop is nested, we consider the innermost loop. It is known that an innermost loop often spends most of
the execution time; hence, it is critical to loop scheduling.
3
026
A
p1
p2
p3
B
p5
A
p4
C
p6
p7
p8
p1
B
p5
p10
p4
C
p6
p7
D
p9
p3
p2
p9
p8
p11
p10
p12
E
(a) SDFP-PN
for L1
(b)SDFP-PN
for L2
027
p2
A
p1
B
p2
A
p1
B
p2
p4
p6
p8
p10
p3
p5 p4
C
p7
initial
instantaneous
state marking
D
p3 p6
p8
p5 p4
C
p7
p9
E
p10
terminal
instantaneous
state marking
The construction of the behavior graph provides an alternative way to describe the behavior of
a Petri net, besides a reachability tree [26]. It becomes particularly useful when we want to
describe the concurrency and cyclic ring sequences of a Petri net. From a dierent standpoint,
the behavior graph is actually a trace, generated while executing the SDSP-PN according to the
earliest ring rule. At each time step, the behavior graph records the set of newly marked places
and the set of enabled transitions to be red at that step. In addition, directed arcs are introduced
among them to denote the token
ow relation from place to transition (token consumption) and
from transition to place (token production). The instantaneous state of the behavior graph at
time i can be described by the current residual ring time vector Ri and the current marking
Mi. The algorithm for constructing the behavior graph is omitted [18]. Figure 3 illustrates the
behavior graph constructed for the marked graph, the SDSP-PN of L1, shown in Figure 2(a),
where the execution time of all transitions are assumed to be equal.
As can be seen, the construction process of the behavior graph can continue forever, and the
behavior graph can be innitely extended. One key observation is that the behavior graph exhibits repetitive behavior after an initial period|the amount of time elapsed before the repetitive
behavior is reached. This is shown by the following lemmas.
Lemma 4.2.2 There exists an instantaneous state in the behavior graph of SDSP-PN that appears
repeatedly.
Proof of Lemma 4.2.2.
The total number of distinct Mi is nite because SDSP-PN has a safe marking. Similarly, the
total number of distinct Ri is also nite because each transition in SDSP-PN has a known ring
time. As a result, the total number of possible instantaneous states are also nite. Hence, if the
behavior graph is innitely extended, some instantaneous states must be repeated.
2
From Lemmas 4.2.1 and 4.2.2 we can see that an instantaneous state once repeated will do so
forever. As a result, the region of the behavior graph between two repeated instantaneous state
can be used to represent the steady-state behavior of the studied SDSP-PN operated under the
earliest ring rule. Thus we have the following denition:
Denition 4.2.1 A Cyclic Frustum of a behavior graph B is the portion of B between two consecutive occurrences of some repeated instantaneous state. In addition, the two instantaneous states
that surround the frustum are termed the initial instantaneous state and the terminal instantaneous
state.
The marking portion of both the initial and terminal instantaneous state found in the behavior
graph for L1 are marked in Figure 3, where the two associated residual ring time vectors are
vectors composed of zero entries. Notice that the cyclic frustum is actually a cyclic ring sequence
since it res each transition at least once and returns the net to its initial state. Once the behavior
graph reaches its frustum, it will keep repeating itself. This suggests a way of capturing the
repetitive behavior of the system. Instead of extending the behavior indenitely, we extract the
cyclic frustum and coalesce the initial and terminal instantaneous states to form another stronglyconnected Petri net, known as steady-state equivalent net.
029
in P . Similarly,
(P ), the value sum, denotes the sum of i of each transition ti in P . The i
of transition ti is taken in the sum as many times as the transition is run through in P . Let
Ph (ti; tj ) denote the set of possible paths in G from ti to tj having exactly h tokens along the
path, and let ah(ti; tj ) denote the value sum of the maximum value path in Ph(ti; tj ). We
also use the notation Phy (ti; tj ) to denote the subset y of Ph (ti; tj ) and the notation ayh(ti; tj )
to denote the maximum path value of subset Phy (ti; tj ). Since each transition has a self-loop
with one token on it (Assumption 2.1.1), Ph (ti; tj ) 6= ;, for h h0 where h0 is a positive
integer.
A simple cycle C in G is critical if the ratio of the value sum to the token sum is maximal,
i.e., if
(C )
(Ci) ;
M (C ) M (C )
i
where Ci denotes the other simple cycles in G. Let i denote the cycle time of the simple
cycle Ci in G, that is, i =
(Ci)=M (Ci ).
Recall that the average cycle time of ring a transition in an SDSP-PN is determined by
(C )=M (C ), where C is the critical cycle in G [27, 28]. In Chretienne's thesis, a more precise
description of the ring time of a transition executed under the earliest ring rule is developed
and expressed as the time constraint Xih+k ? Xih = p, where k equals the least common multiple
of the token sum of all critical cycles in G and p equals k [6]. Thus, for G with only one
critical cycle, say C , k equals M (C ) and p equals
(C ). This time constraint means that every
k-th ring of a transition ti must be p time steps apart.
Recently, there has been an observation that the repetitive execution pattern can be detected
after a polynomial number of iterations of the loop body [1, 2]. However, the eect of multiple
critical cycles on the length of the steady state, raised by Chretienne, has been ignored.
The problem of determining the repetitive execution pattern was reinvestigated in a companion
paper [17]. In that paper, we showed for G with one critical cycle C , operated under the
earliest ring rule, a repeated execution pattern can be found after O(n3) iterations of G, i.e.,
the time constraint mentioned above. Note, in this case, k always equals M (C ). For G with
multiple critical cycles, the problem of nding the repetitive execution pattern then involves the
determination of an upper bound for the least common multiple of the token sum of all critical
cycle. We are unaware of any polynomial result. Instead, we have shown that transitions tj ,
residing on the critical cycles Ci, always result in a repetitive ring pattern after O(n2) iterations
of G, i.e., the time constraint is obeyed again. In this case, k = M (Ci ) and p =
(Ci ) for tj 2 Ci.
Accordingly, transitions from dierent critical cycles have a dierent repeating period, but they
all keep the same computation rate M (Ci)=
(Ci).
Through our simulation results (Section 6), we have found surprisingly short initial periods
for all benchmarks we tested. In fact, the observed bound was within O(n). It appears that the
bound derived above is too pessimistic, and a tighter bound should be possible. The rest of this
section is devoted to answering this question.
030
The initial token-distribution depicted by Theorem 5.2.1 characterizes an initial marking for G
such that starting from which, a repeated pattern can be found after O(n2) iterations of G. The
length of the pattern and the number of rings of each transition in it respectively equal
(C ) and
M (C ), where C is the critical cycle holding all enabled transitions initially. Formally, the time
constraints Xih+k ? Xih = p; 8ti 2 G are satised after O(n2) iterations of G, where k = M (C )
and p =
(C ). Notice that Theorem 5.2.1 is independent of the number of critical cycles in
G. The validity of Theorem 5.2.1 is critical because the required initial condition can always be
reached after at most O(n) iterations of G, as discussed in the next paragraph. Consequently,
the repetitive ring pattern for a general SDSP-PN G can be found after O(n2) iterations of G,
regardless of the number of critical cycles.
Token Distribution Constraint Satisfaction: Assume that ti is a transition which resides on
a critical cycle. To meet the initial condition, one simply executes G using the earliest ring
rule, but prohibits any ring of transition ti. The ring process soon stops and deadlocks.
Since G is strongly connected, there always exists a cycle-free path P from ti to tj for all tj
in G. If ti is never red, tj stops ring soon after all tokens along P have been consumed.
Note also that there can be at most n tokens along a cycle-free path; that is, tj can be red
at most n times before the initial condition is met. Equivalently, it requires the scheduling
of at most O(n) iterations of G to reach the required state.
Before Theorem 5.2.1 is proven, we rst introduce several important lemmas. The rst is
Lemma 5.2.1 which relates the time at which transition tj starts its h+1 ring to the computation
of ah(ti; tj ), the value sum of the maximal value path in Ph (ti; tj ) (Chretienne and others [5, 6, 7]).
Lemma 5.2.2 states that a subset of a given set of k integers can always be found such that the
total sum is a multiple of k [2]. Lemma 5.2.3 is an inequality based upon the fact that the
value-per-token ratio on a critical cycle is always equal to or greater than the ratio along any
cycle [17].
Lemma 5.2.1 For any G executed under the earliest ring rule, the time Xjh at which transition
tj starts its h+1 ring equals
max
ah(ti; tj ); ti 2 set of enabled transitions at time 0:
t
i
Lemma 5.2.2 Given K integers I1; : : : ; Ik; there is a subset S of Ii such that
0
1
X
@ IiA mod k = 0:
Ii 2S
Lemma 5.2.3
X
Ci 2R
(Ci)
where R = fCa ; : : :; Cb j M (Ca)+ +M (Cb ) = mM (Cj ) and Ca; : : :; Cb 2 simple cycles 2 Gg.
031
for all ti in the set of initially enabled transition at time 0. Equivalently, we show that for
h O(n2),
ah+k (ti; tj ) = ah(ti; tj ) + p; 8tj 2 G
for all ti in the set of initially enabled transition at time 0.
Notice that Pz (ti; tj ), the set of paths from ti to tj with exactly z tokens, can be partitioned
into two disjoint subsets Pza(ti; tj ) and Pzb (ti; tj ), where z > 0. Subset Pza(ti; tj ) denotes the set of
paths that iterate through C at least once, while subset Pzb (ti; tj ) denotes the set of paths which
only touch C (i.e., C is not embedded entirely within the path).
We show that for every path in subset Phb (ti; tj ) there always exists a corresponding path in
subset Pha(ti; tj ) which has an equal or greater value sum, provided h (n + 1)k + n, where
k = M (C ). Consequently, the maximum value path in Ph (ti; tj ) for h O(n2) can always be
found in subset Pha(ti; tj ). Note that, for h (n + 1)k + n, there exists at least k cycles along any
possible
in Ph (ti; tj ). By Lemma 5.2.2
there exists a subset S of those cycles Ci such that
P
P Mpaths
C 2S (Ci ) is a multiple of k . Assume C 2S M (Ci ) = m M (C ); m 2 integer, and m > 0 for
any path Px 2 Ph(ti; tj ). Either S is composed of C m times (i.e., Px 2 Pha(ti; tj )) or Px may
not have the maximum path value. This is because a path Py can be constructed from Px by
replacing all Ci 2 S with exactly m C . Py must also exist in Ph (ti; tj ), and by Lemma 5.2.3, it
has a greater or equal value sum. Therefore, the maximum value path of Ph (ti; tj ) for h O(n2)
is always a member of subset Pha(ti; tj ).
In addition, notice that subset Pha+k (ti; tj ) can be constructed by having every path in the
subsets Pha(ti; tj ) and Phb(ti; tj ) iterate through C one more time. However, as was shown, the
maximum value path can always be found in subset Pha(ti; tj ). Consequently,
i
ah+k (ti; tj ) =
=
=
=
aah+k (ti; tj )
aah(ti; tj ) +
(C )
aah(ti; tj ) + p
ah(ti; tj ) + p
As can be seen, to meet the initial token-distribution constraint, the search for a transition
on the critical cycle is signicant. One possible approach to nd such a transition is to rst determine the computation rate restricted by the critical cycle using Magott's linear programming
formulation [23] and then apply the shortest path algorithm with the distance formulation dened
by Ramamoorthy and Ho to obtain a critical cycle [27]. Using the critical cycle, the required transition can be selected arbitrary. Since Magott's formulation can be solved by linear programming
032
within a theoretical polynomial bound and the shortest path problem can be solved in O(n3)
steps, the nal problem of determining repetitive pattern is polynomial bound also.
An alternative approach to the problem is to appoint a transition the initial enable condition
and then construct the behavior graph from then on for n2+n iterations. If the repetitive execution
pattern cannot be found, an untouched transition is selected and the procedure is repeated. In the
process a maximum of n transitions will be checked, and at most n iterations are required to satisfy
the initial token-distribution constraint for a selected transition. Then, the time complexity of
the entire approach is bounded by the amount of time required to schedule n(n+n2+n) iterations,
i.e., O(n3) iterations. Note that this algorithm also suggests a totally dierent way of approaching
the problem of determining the computation rate for G.
5.2.1 Remarks
Note that the requirement of an acknowledgement arc for each data arc in the model and the
resulting safeness property are both characteristics of a static data
ow model. To keep the concept
of an ideal machine model, we have assumed a unit ring time for each transition. In general,
however, the proofs presented both in the previous section and in the companion paper can be
easily extended to cover a more general class of strongly-connected marked graphs where the
acknowledgement arc restriction is eliminated, individual transition is assigned a dierent ring
time, and lastly the number of tokens residing at a place are more than one (but bounded).
However, the assumption of having a self-loop on each transition, Assumption 2.1.1, is required.
Without the self-loop control, the relation between the earliest ring schedule and the maximum
path value, Lemma 5.2.1, cannot be established.
In this section, we illustrate the tightness of the previously derived polynomial upper bound, using
the example in Figure 4. The example ultimately illustrates the need for initiating at least n?1
iterations before the repetitive ring pattern is reached. It contains a chain of n nodes with only
one critical cycle (tn?2tn?1tntn?2) located at the right end. The computation rate of the critical
cycle and the chain, in general, is 1=3. All other simple cycles have a computation rate of 1=2. In
addition, note that there is a total of n?2 tokens along the path from tn to t1. Initially, at time
0, t1 is the only initial enabled node. By Lemma 5.2.1, the time for t1 to commence its h+1 ring
can be computed by ah(t1; t1), the maximum path value among the set of possible paths from t1
to t1 with exactly h tokens. However, due to the chain of n?2 tokens from tn to t1, the set of
paths from t1 to t1 with less than n?2 tokens can never reach the critical cycle. Thus, it indicates
that t1 is required to initiate at least n?1 times (i.e., n?1 iterations, or O(n) iterations) before
the eect of the critical cycle can be propagated back to t1.
Besides using the behavior graph to derive the schedule, Ramamoorthy and Ho [27] demonstrated
another way to compute a static schedule. They showed that the schedule for each transition ti
can be derived with the following time constraint, once the cycle time is determined:
Sih = ai + h
033
Critical Cycle
Execution Rate = 1/3
t1
t2
tn-2
tn-1
tn
In this section the notion of multiple clean pipelines (MCP) is incorporated onto the SDSP-PN
model. A unied Petri-net model, SDSP-MCP-PN, is produced. It models the execution of the
SDSP on machine with multiple clean execution pipelines of l stages. The construction process
consists of two steps: series expansion and run-place introduction. Notice that once an instruction
enters a pipeline, it runs through the pipeline without interference from other enabled instructions.
This implies that the detailed structure of the pipeline does not need to be explicit. Figure 5(a)
shows a model of two execution pipelines.
Run-place introduction: We introduce a place pr , known as the run place, to denote the MCP
and modify all transitions ti in the SDSP-PN to include pr as both the input and output
places. Place pr is initially assigned with a number of tokens, each representing the existence
of one pipeline. When a transition becomes enabled, it competes for pr to get red.
Series expansion: To denote the fact that one traversal through MCP takes l time units, a
series expansion procedure is performed which introduces a new transition for each place
034
A
p2
Time
Step
pr
pr
p2
A
p4
p4
p1
p6
p3
B
p6
p5
p8
pr
C
pr
pr
p2
p10
p3
B
p1
p8
p5 p7 p4
A
pr
pr
pr
p7
p1 p3
p6 p8 p9
pr
pr
p10
p2
p5 p7
p4
pr
p9
p10
D
pr
10
p1
p3
pr
p6 p8
p9
11
initial
instantaneous
state marking
(b) SDSP-MCP-PN of L1
terminal
instantaneous
state marking
Figure 5:
in the SDSP-PN to account for time delay. We call the transitions originally appearing in
the SDSP-PN the SDSP transitions, and the ones newly introduced in series expansion, the
dummy transitions. Every SDSP transition is assigned an execution time of 1 while every
dummy transition is assigned an execution time of l?1, where l denotes the length of the
execution pipeline. When l=1, there are no dummy transitions remaining in the nal model.
In the gure we distinguish dummy transitions by bars of dierent length.
The behavior graph for the combined model can be constructed in a way similar to that for
the SDSP-PN model described previously. With the existence of the run place as a structural
con
ict, choices appear whenever more than one SDSP transition is enabled. We assume that the
ring mechanism in the machine will always choose enabled transitions to re|it will never idle
as long as there is at least one enabled node. The machine can break ties by giving priority to the
nodes that simultaneously become enabled. The particular priority does not matter; we assume
only that the machine exhibits repeatable behavior, i.e., it always makes the same choice given
its rule for priority and machine condition (the instantaneous state).
Multiple tokens in a run place can be represented in the behavior graph as multiple run places,
as shown in Figure 5(b). In addition, the assumption above implies that a repeated instantaneous
state will be encountered as the behavior graph is continuously extended for a sucient period
035
of time. The notion of cyclic frustum and the steady-state equivalent net can again be dened.
A valid scheduling pattern for a multiple clean pipeline machine can then be derived from the
steady-state equivalent net.
With the existence of a resource constraint imposed by a limited number of pipelines, the
computation rate of an SDSP-MCP-PN is no longer re
ected directly from the critical cycle. The
impact of resource constraint is illustrated with Theorem 6.1.1. It ultimately imposes an upper
bound on the execution rate of each node in the SDSP-MCP-PN. Intuitively, there can be at most
I enabled transition sent for execution at each time step. Thus, it takes at least n=I cycles to
compute one iteration of the loop body even though the cycle time of the critical cycles is far
less. Note also that this upper bound is the result of the resource constraint imposed by the I
multiple clean pipelines and is independent of the approach used for con
ict resolution. When
such a bound is achieved, all pipelines are 100% utilized.
Theorem 6.1.1 Given an SDSP-MCP-PN G which models I clean pipelines and contains n
SDSP transitions, the computation rate of any SDSP transition in G can never be greater than
A set of Livermore Loops was chosen for the study; all were written in SISAL [12, 24]. Our
simulations were performed on a compiler/simulator testbed developed at McGill University [16].
The testbed consists of a prototype SISAL compiler capable of producing data
ow code (known
as A-Code) [30, 31]. For this particular study, we modied the simulator to permit analysis of
cyclic frustums generated for both SDSP-PN and SDSP-MCP-PN models. The simulator accepts
A-code as input and simulates the corresponding ring sequence of the code.
Table 1 shows the simulation results after executing an SDSP on an ideal machine with innitely many single-stage pipelines. That is, the SDSP-PN was executed under the earliest ring
rule with the ring time of each transition equal to one. In the table, the size of loop body re
ects
the number of nodes that were repeatedly executed excluding the start-up initiation sequence.
Start time and repeat time indicate the times when the initial and the terminal instantaneous
states are identied. Length of frustum is the time dierence between the terminal and initial
instantaneous states. Transition count records the number of occurrences of a transition that
appears in the cyclic frustum. Note that all transitions are red an equal number of times in the
steady state (Theorem 2.2.3). Computation rate is the average ring rate of each SDSP transition
in the loop body and equals
Transition count
Repeat Time ? Start Time
Finally, BD is a tight bound derived by observation and is intended only for comparison purposes.
Note that in each example the repeated instantaneous state is found within 2n time steps.
Table 2 illustrates the corresponding pattern detection result for the set of benchmarks with
respectively one, two, four, and eight clean pipeline(s) assumed. The overall results demonstrates
that the steady state for all cases can be found eciently. It also reveals the following facts:
Loop9 is a potential candidate for parallelizing as a DOALL loop; however, it requires subscript analysis to
expose its parallelism. Here we have examined the loop both ways, with and without LCDs, to increase the
diversity of our testing.
6
036
2 (BD)
Start Time
Repeat Time
Length of Frustum
Transition count
Computation Rate
Processor Usage
341
296
749
372
360
833
31
64
84
1
1
1
1/31 1/64 1/84
98.9% 99.7% 98.4%
157
184
27
1
1/27
55.8%
86
86
132
106
121
263
20
35
131
1
1
1
1/20 1/35 1/131
72.9% 51.0% 64.1%
138
172
34
1
1/34
40.9%
Start Time
Repeat Time
Length of Frustum
Transition count
Computation Rate
282
308
26
1
1/26
206
244
38
1
1/38
395
438
43
1
1/43
147
145
25
1
1/25
82
100
18
1
1/18
78
145
67
2
2/67
115
230
115
1
1/115
133
166
33
1
1/33
Start Time
Repeat Time
Length of Frustum
Transition count
Computation Rate
268
293
25
1
1/25
207
240
33
1
1/33
242
277
35
1
1/35
145
194
49
2
2/49
81
98
17
1
1/17
74
139
65
2
2/65
113
226
113
1
1/113
154
186
32
1
1/32
Start Time
Repeat Time
Length of Frustum
Transition count
Computation Rate
265
338
73
3
3/73
174
206
32
1
1/32
223
323
100
3
3/100
128
152
24
1
1/24
64
80
16
1
1/16
72
104
32
1
1/32
112
224
112
1
1/112
128
160
32
1
1/32
1 Clean Pipeline:
2 Clean Pipelines:
4 Clean Pipelines:
8 Clean Pipelines:
The condition raised by Theorem 6.1.1 is veried in the case of one clean pipeline, where
some test programs can keep the single pipeline fully busy. In Loop1, Loop7, and Loop9,
the upper bound on the computation rate, 1=n, imposed by the single pipeline is reached.
All three cases indicated that the respective pipelines were fully utilized at all time except
when they were lling and draining. The various processor usage in the three loops also
re
ects the impact of the prelude and postlude execution sequence. Though the length of
the postlude sequence was not recorded, the relatively shorter prelude sequence in Loop7,
indicated by start time, is obviously a major factor in utilization.
Though each transition was red an equal number of times in the steady state of a marked
graph, the number of rings was not necessary one.
The amount of time required for the emergence of steady state decreased as the number of
pipelines increased, except in a few cases where the transition count was greater than one.
As the number of pipelines exceeded the amount of parallelism in the loop, the behavior
graph obtained was exactly the same as the one obtained for the ideal model. For instance,
as Loop12, Loop3, Loop5, Loop9, and Loop11 were run with eight pipelines, their start time
and their repeat time were simply eight times the corresponding time derived for the ideal
model.
6.3 Comparison
To construct a schedule for the multiple pipeline machines, Aiken and Nicolau suggested that the
same schedule obtained from the ideal case be applied directly by scheduling the steady state one
row at a time [2]. It was also shown when such schedule is adopted for the multiple pipelines
machine, the total run time expected is always bounded by two times the optimal run time
obtained for the same machine [25]. Nevertheless, the resulting schedule is still unsatisfactory
because, after all instructions from row i are scheduled for execution, a period of l?1 idle cycles
(where l is the length of the pipeline) is always required to delay the initiation of row i+1, in
order to avoid possible data con
ict between the last operation of row i and the rst operation
of row i+1. Consider the use of the steady-state of L1, shown in Figure 3, as the schedule for
a machine with two clean pipelines with each one having two stages. The part of the schedule
which involves the steady state will be
processor1
A noop B E noop
processor2
D noop C noop noop
At each iteration, A and D cannot be sent for execution until all B , C , and E completes ring,
even though transition A is free for execution right after B and C complete their ring. Similarly,
since Ramamoorthy and Ho's schedule is derived on a marked graph which is equivalent to an ideal
machine model, it incurs the same ineciency when it is applied to the multiple clean pipeline
case.
In the SDSP-MCP-PN model, the problem of data con
ict in a multiple pipeline was considered
in the process of constructing the behavior graph. While imposing the earliest ring rule, the
gap required in the former case is lled with enabled instructions that are safe to be executed.
The corresponding schedule which involves the steady state, derived from the behavior graph
038
B E
A
D
C noop noop noop
Thus this scheme will always render better processor usage. In addition, the assurance of repeatable state in the SDSP-MCP-PN together with the simulation results obtained so far reveals the
feasibility of employing the behavior graph to generate a static schedule in practical compilers.
7 Conclusions
The application of Petri-Net theory to compiler design received attention as early as 1970 [29].
Similar work, reported recently, is the study of microprogram scheduling using Petri nets, where
resource constraints such as registers and functional units are modeled into a unied Petri-net
model [19]. It indicated that the search for an optimal schedule is believed to be NP-complete.
In this paper, we have established a much simpler timed Petri-net model for loop scheduling.
The goal is to exploit ne-grain parallelism across iteration boundaries in a loop and the scheduling
method should have a manageable complexity so that it is feasible for compiler implementation.
The intuition is that as long as the steady state is repeated a large number of times and there
is enough parallelism to keep the architecture pipelines fully utilized during each steady-state
period, our scheduling approach should then achieve close to time-optimal throughput.
Based upon the model, we have established a new and improved polynomial time bound for
our scheduling technique to reach the steady state in a class of loops with single or multiple critical
cycles for the ideal machine model. An example has been given to demonstrate the tightness of
the new bound. We have also rened our scheduling approach to exploit ne-grain parallelism for
clean pipelined machine. Experimental results show that the steady states occur very fast in real
scientic benchmark programs.
8 Acknowledgment
We thank the Natural Science and Engineering Research Council (NSERC) for a grant supporting this work. While undergoing the development of this paper, we enjoyed support from the
members of the ACAPS (Advanced Computer Architecture and Program Structures) Group at
McGill University. In particular, we thank Herbert Hum and Jean-Marc Monti for their valuable
discussions. We also thank Russell Olsen for proof reading the nal draft and for his suggestions
for its improvement.
References
[1] A. Aiken. Compaction-based parallelization. (PhD thesis), Technical Report 88{922, Cornell
University, 1988.
[2] A. Aiken and A. Nicolau. Optimal loop parallelization. In Proceedings of the 1988 ACM
SIGPLAN Conference on Programming Languages Design and Implementation, June 1988.
039
[18] G. R. Gao, Y. B. Wong, and Qi Ning. A Petri-Net model for ne-grain loop scheduling.
ACAPS Technical Memo 18, School of Computer Science, McGill University, Montreal, January 1991.
[19] C. Hanen. Optimizing microprograms for recurrent loops on pipelined architectures using
timed petri nets. In G. Rozenberg, editor, Advances in Petri Nets, LNCS 424, pages 236{261.
Springer-Verlag, 1989.
[20] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach.
Morgan Kaufmann Publishers, Inc., 1990.
[21] K. M. Kavi, B. P. Buckles, and U. N. Bhat. Isomorphisms between Petri nets and data
ow
graphs. IEEE Transactions on Software Engineering, 13(10):1127{1134, October 1987.
[22] Monica Lam. Software pipelining: An eective scheduling technique for VLIW machines. In
Proceedings of the 1988 ACM SIGPLAN Conference on Programming Languages Design and
Implementation, pages 318{328, Atlanta, GA, June 1988.
[23] J. Magott. Performance evaluation of concurrent systems using Petri nets. Information
Processing Letters, North-Holland, 18:7{13, January 1984.
[24] J. R. McGraw and et al. SISAL: Streams and iteration in a single assignment language|
language reference manual version 1.2. Technical Report M-146, Lawrence Livermore National
Laboratory, 1985.
[25] A. Nicolau, K. Pingali, and A. Aiken. Fine-grain compilation for pipelined machines. Technical Report TR-88-934, Department of Computer Science, Cornell University, Ithaca, NY,
1988.
[26] J. L. Peterson. Petri Net Theory and the Modeling of Systems. Prentice-Hall, Inc., Englewood
Clis, NJ, 1981.
[27] C. V. Ramamoorthy and G. S. Ho. Performance evaluation of asynchronous concurrent
systems using Petri Nets. IEEE Transactions on Computers, pages 440{448, September
1980.
[28] C. Ramchandani. Analysis of asynchronous concurrent systems. Technical Report TR-120,
Laboratory for Computer Science, MIT, 1974.
[29] R. Shapior and H. Saint. A new approach to optimization of sequencing decisions. Annual
Review in Automatic Programming, 6:257{288, 1970.
[30] R. Tio. The A-code assembly language reference manual. ACAPS Design Note 02, School of
Computer Science, McGill University, Montreal, July 1988.
[31] R. Tio. DASM: The A-code data-driven assembler program reference manual. ACAPS Design
Note 03, School of Computer Science, McGill University, Montreal, July 1988.
041