Escolar Documentos
Profissional Documentos
Cultura Documentos
DECEMBER
1979
907
Abstract-A formal mathematical model of single instruction stream-multiple data stream (SIMD) machines is defined. It is
used as a basis for analyzing various types of interconnection
networks that have been discussed in the literature. The networks are
evaluated in terms ofthe lower and upper bounds on the tune required
for each of the networks discussed to simulate the other networks.
SINM machine algorithms are presented as proofs of the upper time
bounds on these simulation tasks. These simulations are used to
demonstrate techniques for proving the correctness of SIMD machine algorithms, i.e., analyzing the simultaneous flow of N data items
among N processors. Processor address masks, a concise notation
for activating and deactivating processors, are used in the algorithms.
The methods used to prove the lower bounds and to construct (and
prove correct) simulation algorithms to show the upper bounds can be
generalized and applied to the analysis of other networks.
Index Terms-Algorithm correctness, array processors, computer
architecture, Illiac IV, interconnection networks, n-cube array,
parallel processing, perfect shuffle, permutation networks, SIMD
machines, STARAN.
I. INTRODUCTION
908
PE number i
to CONTROL UUNIT
PROCESSOR
A
ODTR
B LI
ADDR
to NEWORK
MEMORY
PlPo
In Fig. 2 these interconnections are shown. This network has
been shown to be useful by Lang [15], Lang and Stone [16],
and Stone [30]. It is the basis of Lawrie's Omega network
[17] and is included in the networks of the Omen [14] and
RAP [7] systems.The Illiac network consists of four functions defined as
follows:
Il+l (x)=x+ 1 mod N
IL_l (x)=x-l modN
Il+ (x) = x + n mod N
Iln (x) = x-n mod N
where n is the square root of N, which is assumed to be a
perfect square. If the PE's are considered as a n x n array,
then each PE is connected to its north, south, east, and west
neighbors, as shown in Fig. 3. This is the network implemented in the Illiac IV [6] and DAP [12] systems. Its ability
to perform various tasks is discussed in [1], [6], [12], [18].
The cube network consists of m functions defined by
...
...
Po)
Pm-I Pi+IPiPi-I Po
for 0 < i < m. When the PE addresses are considered as the
corners of an m-dimensional cube this network connects
each PE to its m neighbors, as shown in Fig. 4. Pease's binary
n-cube [19], the network used in STARAN [2], [3] and the
network proposed for Phoenix [9] are each wired series of
cube functions. In [2], [4], [19] the applicability of this
network to practical problems is discussed.
The plus-minus 2' (PM2I) network consists of 2m functions defined by
PM+i (j) = j+ 2' mod N
PM_i (j) =j-22mod N
for 0. i <im. Note that PM+(ml) =PM (m1) Fig. 5
shows the PM+i interconnections for N = 8. Diagramatically, PM_i is the same as PM+i except the direction is
reversed. A network similar to the PM2I is included in the
network of the Omen computer [14]. The concept underlying the SIMDA array machine's interconrnection network is
similar to that of the PM2I [32]. Feng describes a network
that is basically a wired series of PM2I functions in [10],
where various practical permutations this network can
perform are discussed.
909
--'~~~~~--
f
(a)
(b)
b
connec-
pi*
Po)= qm 1-
..
qi
...
Fig. 5. PM2I network for N = 8; (a) PMo,O (b) PM+1, and (c) PM+2-
qo,
(a)
=(pii- 1
...
(b)
and
WPM-i (Pm- 1
Pi * * * po) = qm-
...
qi ... qo,
where
~~~~
qi+iqi
..popm- Io
=(Pi- I*
..pi+lpi) -1 mod N,
for 0 < i < m. The function WPM+i adds 2i to the PE
addresses, with any "carry" wrapping around up to and
including the i 1st bit position. A carry cannot affect the
ith bit position. The function WPM _behaves analogously.
Fig. 6 shows WPM+i interconnections for N = 8. Diagramatically, WPM_, is the same as WPM except the
direction is reversed. WPM2I is similar to PM2I, except any
"carry" or "borrow" will "wrap-around" through the i 1st
bit position. For example, if N 8, then WPM +
(101)= 010 whereas PM+2 (101) 001. When the
networks are treated as sets ofpermutations on the integers
from 0 to N 1, the WPM2I network has the ability to
simulate any other interconnection function, i.e., in terms of
group theory, WPM2I can generate the entire group of
permutations on N elements. Of the five networks presented
here, only the WPM2I network has this ability [21].
Cross bar networks, Benes networks [5], and Swanson's
networks [31] are not considered here for reasons discussed
in [21]. Note that the PEPE machine has no interconnection
network [8].
67I
(c)
qi- 1, qoqm- 1
Fig. 6. WPM2I network for N = 8; (a) WPM+ c, (b) WPM+1, and (c)
WPM+2.
910
IEEE TRANSACTIONS ON
...
elsewhere ...
SIMD machine interconnection network simulation algorithms are presented. Various examples are given by examining the time required for each of the networks defined to
simulate the others. These specific simulations are used since
each of the networks discussed here has been proposed in
some form in the literature and shown to be useful. However,
911
TABLE I
THE ENTRIES IN Row AND COLUMN Y ARE LOWER AND UPPER BOUNDS
ON THE TIME REQUIRED FOR NETWORK X TO SIMULATE NETWORK y
x
PM2I
Cube
Ittiac
PS
WPM421
PM2I
Cube
Ittiac
uppe r
2m-2
2 Lm/2J
lower
Lower
upper
-PS
WPM2I
lower
n/2
(n/2)+1
(n/2)+1
(nf2)+1
upper
n/2
(n/2)+1
3n-4
(n/2)+1
2.-1
lower
2.-1
m+1
2m-1
upper
2m
M+1
2m
lower
m12
upper
2m-2
2m
there is little in the literature directly comparing the simulation abilities ofthese types of networks. In [23], the ability of
networks to simulate an arbitrary interconnection by sorting destination tags (in O(m2) time for most of these
networks) is discussed. This section is a comparison of the
abilities of networks to simulate a particular interconnection (in less time than it would take to simulate an arbitrary
one using sorting techniques).
The following assumptions are made.
1) When simulating the interconnection function f, the
data that is originally in the DTR of PE,must be transferred
to the DTR of PEf(X), 0 <x < N.
2) The interconnection function is to be simulated as if it
were executed with all PE's being active. It will be shown
after the theorem how this restriction can be removed.
3) The bounds are in terms of the number of executions of
interconnection functions required to perform the
simulation.
4) The interconnection function of the network to be
simulated which requires the most time to simulate will
determine the time bounds for the simulation of that
network.
The instructions in the simulation algorithms can be
divided into three categories: control unit operations, interprocessor data transfers, and register-to-register operations.
Control unit operations, such as incrementing an index register in a "for loop," can usually be done in parallel with the
previously broadcast PE instruction, thus, taking no additional time. Register-to-register operations, within a PE, will
probably involve a single chip, or at worst adjacent chips.
The interprocessor data transfers will involve setting the
controls of the interconnection network and passing data
among the PE's, involving board-to-board, and probably
rack-to-rack, distances. Furthermore, if there is a sequence
of data transfers and the two DTR model is implemented (to
reduce clocking problems), then an input DTR to output
DTR move will be required after each transfer. Thus, unless
the number of register-to-register operations is much greater
than the number of interprocessor data transfers, the time
for the interprocessor transfers will be the dominating factor
in determining the execution time.
-*
Cube
(Lower Bound):
cubei.
PM2I
-+
Cube
(Upper Bound):
Il+I
-+
Illiac
PM+O,
(Lower
=
I1-
and
II_n
PM2I
For
exch
the
PS
(Upper
exchange:
cubeo.
Upper Bound):
PM-0, Il+n
=
PM+(m12),
PM_(m12),
Bound):
see
PM2I
-+
cube
analysis,
since
912
WPM+ .
(S5)
DTR.-A
[Om]
register operations.
equivalent.
...
...
...
913
Case 2: Pm-I $ Pm- 2. In Step 3 this data is stored in A cubej(n/2) = 0, and no subsequence of (11+ JTM12 can perform
when j = m - 1, and in Step 5 it is reloaded into the DTR. this mapping.
This is correct since it was originally in POPm-2NPm-2
Illiac -+ Cube (Upper Bound):
PI = POPm-i Pm-2.. Pl
For
cubem,,-: see the Illiac -* PM2I analysis, since
Consider what happens to the data in the A register of
cubem_ = PM+M-1.
Pm- I PI PoFor cubei, m/2 <i < m - 2:
Case 1: Pm- I = Pm - 2. In Step 2 this data is loaded into
the DTR. This is correct since it was originally in (SI) A -DTR [XM-(i+l)1X11
: active PE's save DTR contents in A
POPm-2Pm-2... = POPm-1 Pm-2 .PI
=
until 2'/n do 11 +,, [Xm]
1
(S2)
for
j
Case 2: Pm- I Pm- 2. In Step 3 this data is loaded into
:
use 11 +, 2k/n times (= +2')
the DTR and Step 4 sends it to im-iPm-2
Po when
A-*
DTR
[Xm-(I+l)lXi]
= m - 1. This is correct since it was originally in (S3)
: active PE's switch DTR and A contents
POPm-2Pm-2 PI = POPm-1 Pm-2.. Pl
This algorithm uses m comparisons and m register-to-reg- (S4) for j = 1 until 2'/n do ILI [Xm]
: use ILn 2'/n times (= -2i)
ister moves.
...
914
and
-+
shuffle
Step
algorithm
4 in the PM2I
for the
-+
shuffle
m-2
-*
-+
--
+ 0,
--
-+
[1m
-+
...
...
+ 0
see
915
computed [20].
Consider the case where the interconnection function to
be simulated is executed with a PE address mask inside a
where statement. If network k can simulate interconnection
function f executed with PE address mask [Xm] using y
interprocessor data transfers, then k can simulate the effect
off executed with arbitrary PE address mask [R] within a
where statement using at most 2y interprocessor data
transfers. To implement conditional masks each PE must
have a flag (or stack of flags if nested where statements are
allowed [26]) to indicate whether or not the PE is active for
the current "do" or "elsewhere" block. The simulation is
done by having an instruction which allows all PE's (active
or not) to save their status flag. All PE's save their status flag,
all are activated, and all save their DTR contents; only PE's
which were active and match [R] set a tag to 1, and f is
simulated, moving the tag with the data. Where tag is not 1,
the original DTR contents are restored. If there is a reserved
bit position which can be used for the tag, then the simulation can be done using y interprocessor data transfers. This
reserved bit could be one of the W bits of the data word that
is to be transferred or a W + 1st bit position added to the
network, DTR, and fast access registers.
IV. CONCLUSIONS
Current technology has created the "age of the microprocessor." This has made feasible the future construction of
computer systems with 21o-214 processors. In order to
design a multimicroprocessor system capable of operating
in the SIMD mode of parallelism, a model of SIMD
916
...
APPENDIX II
...
...
...
Pj+ PiPjPj-I. Pl
ACKNOWLEDGMENT
The author wishes to thank J. D. Ullman for his guidance
and suggestions and L. J. Siegel for her many comments. The
=P1=POPm-2
APPENDIX I
[1] G. Barnes et al., "The Illiac IV computer," IEEE Trans. Comput., vol.
C-17, pp. 746-757, Aug. 1968.
[2] K. E. Batcher, "The flip network in STARAN," in 1976 Int. Conf/
PiOPi-i
REFERENCES
Parallel Processing, Aug. 1976, pp. 65-71.
, "The multidimensional access memory in STARAN,"
IEEE Trans. Comput., vol. C-26, pp. 174-177, Feb. 1977.
[4] L. H. Bauer, "Implementation of data manipulating functions on the
STARAN associative processor," in 1974 Sagamore Comput. Conf
Parallel Processing, Aug. 1974, pp. 209-227.
[5] V. E. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic. New York: Academic, 1965.
[3]
[6] W. J. Bouknight et al., 'The Illiac IV system," Proc. IEEE, vol. 60, pp.