Lecture - 16 Data Flow and Systolic Array Architectures PDF

DATA FLOW ARCHITECTURE (DFA)
To demonstrate the behavior of a data flow machine,

a graph, called a data flow graph, is used. The data
flow graph represents the data dependencies
between individual instructions. It represents the
steps of a program.
The nodes in the data flow graph, also called actors,
represent the operators and are connected by input
and output arcs that carry tokens bearing values.
Tokens are placed on and removed from the arcs
according to certain firing rules.
When tokens are present on all required input arcs of
an actor, that actor is enabled and thus fires. Upon
firing, it removes one token at a time from the
required input arcs, applies the specified function to (a) Data flow graph (b) its
the values associated with the tokens, and places the corresponding activity template
result tokens on the output arcs.
Each node of a data flow graph can be represented
as an activity template. An activity template consists
of fields for operation type, for storage of input
tokens, and for destination addresses. Like a data
flow graph, a collection of activity templates can be
used for representing a program.
Constructing a data flow graph
Two types of tokens: Data tokens and
Boolean tokens. To distinguish the type of
inputs to an actor, solid arrows are used for
carrying data tokens and outlined arrows for
Boolean tokens.
The switch actor places the input token on

one of the output arcs based on the Boolean
token arriving on its input. The merge actor
places one of the input tokens on the output
arc based on the Boolean token arriving on its
input.
The T gate places the input token on the

output arc whenever the Boolean token
arriving on its input is true, as the F gate does
whenever the Boolean token arriving on its
input is false.
Set of actors for constructing a
data flow graph
A data flow graph and its

corresponding activity template
for an if statement
A data flow graph for calculating

N!
Basic Structure of a Data Flow Computer
The MIT computer consists of five
major units:
(1) The processing unit, consisting of
specialized processing elements;
(2) The memory unit, consisting of
instruction cells for holding the
instructions and their operands (i.e.,
each instruction cell holds one activity
template);
(3) The arbitration network, which
delivers instructions to the processing
elements for execution;
(4) The distribution network for
transferring the result data from
processing elements to memory; and
(5) The control unit, which manages all
other units.
The instruction is enabled when all
the required operands are received.
The arbitration network sends the
enabled instruction as an operation
packet to the proper processing
element.
The result is sent as a result packet
through the distribution network to
the destination in memory. A simplified MIT data flow machine
Types of Data Flow Machines
Proposed data flow machines can be
classified into two groups: Static and
Dynamic. In a Static data flow
machine, an instruction is enabled
whenever all the required operands
are received and another instruction
is waiting for the result of this
instruction; otherwise, the
instruction remains disabled.
Each arc in a Static data flow graph

can carry at most one token at any
instance. For example in the graph,
the multiply instruction must not be
enabled until its previous result has
been used by the add instruction.
Often, this constraint is enforced
through the use of acknowledgment
signals.
Dynamic Data Flow Machines
In a Dynamic data flow machine, several
sets of operands may become ready for an
instruction at the same time. An arc may
contain more than one token at any
instance.
Compared with Static data flow, the

Dynamic approach allows more
parallelism because an instruction need
not wait for an empty location in another
instruction to occur before placing its
result.
In the Dynamic approach a mechanism

must be established to distinguish
instances of different sets of operands for
an instruction. One way would be to
Queue up the instances of each operand
in order of their arrival. However,
maintaining many large Queues becomes
very expensive (Because Queues being
FIFO, will not allow operands to be taken
out in any order).
Dynamic Data Flow Machines
Instead an Associative memory called
matching store is used to store the operands.
The result packet format is extended to
include a label field; hence matching operands
can be found by comparing these labels in
the associative memory.
Each time that a result packet arrives at the

memory unit, its address and label fields are
stored in the matching store and its value field
is stored in the data store.
The matching store uses the address and label

fields as its search key for determining which
instruction is enabled. It is assumed that three
packets have arrived at the memory unit at
times t0, t1, and t2.
At time t2, when both operands with the same
label for the add instruction (stored at location
100 in instruction store) have arrived, the add
instruction becomes enabled.
SYSTOLIC ARRAYS
Finding fast, accurate, and cost-effective methods for solving a large-scale linear
system of equations is an important requirement. Due to the lengthy sequences of
arithmetic computations, most large-scale matrix algebra is performed on high-speed
(parallel) digital computers.
A major drawback in processing matrix algebra on general-purpose computers by

software programs is the need for long computation time. Also, in a general-purpose
computer the main memory is not large enough to accommodate very large-scale
matrices. Thus, many time-consuming I/O transfers are needed in addition to the CPU
computation time.
One solution to the need for highly parallel computational power is the connection of
a large number of identical simple processors or processing elements (PEs). Thus all
PEs are arranged in a well organized structure such as a linear or two-dimensional
array. This type of structure, referred to as a Systolic array.
Usually, a Systolic array has a rectangular or hexagonal geometry, but it is possible for
it to have any geometry. Interleaved memories are used to feed data into such arrays
(for equal delay of each data stream).
Various designs of Systolic arrays with different data stream schemes have been
proposed. Some of the proposed designs are Hexagonal arrays, Pipelined arrays,
Semibroadcast arrays, Broadcast arrays , and Wavefront arrays.
Proposed Arrays of Processors
Terminology common to all designs
The processing element primarily used in each design is basically an inner-

product step processor that consists of three registers: Ra, Rb, and Rc.
These registers are used to perform the following multiplication and
addition in one unit of computational time:
Rc = Rc + Ra*Rb
The unit of computational time is defined as ta + tm, where ta and tm are
the time to perform one addition and one multiplication, respectively.
Let P denote the number of required PEs. The turnaround time, T, is

defined as the total time, in time unit ta + tm, needed to complete the
entire computation.
The structure of each design is illustrated by an example that performs the
computation C=A*B, where:
Hexagonal array processor
P = (2n-1)2,
T = 4n-2 when part

of the result stays
in the array,
T = 5n-3 when the

result is transferred
out of the array
For n=3:
P=25; T=12
Pipelined array processor
P = n(2n-1),
T = 4n-2.
For n=3:
P=15; T=10
Semibroadcast array processor
P = n2,
T = 2n when the
result stays in the
array,
T = 3n when the
out of the array.
For n=3:
P=9; T=9
Broadcast array processor
P = n2,
T = n when the
result stays in the
array,
T = 2n when the
out of the array.
For n=3:
P=9; T=6
Wavefront array processor
Given two n-by-n matrices A and B, the
matrix multiplication can be carried out
by the following n iterations:
C(k) = C(k-1) + A*B recursively, for k = 1, 2,
..., n.
Successive pipelining of the wavefronts
will accomplish the computation of all
iterations. The process starts with PE1,1
computing C11(1) = C11(0)+a11*b11. The
computational activity then propagates
to the neighboring processor elements,
PE1,2 and PE2,1. While PE1,2 and PE2,1 are
executing C12(1) = C12(0)+a11*b12 and C21(1)
= C21(0)+a21*b11, respectively, the PE1,1
can execute C11(2) = C11(1)+a12*b21.
P = n 2,
T = 3n-2 result stays in the array,
T = 4n-2 result is transferred out of the
array.
For n=3: P=9; T=10
However, its drawback is that the
utilization of PEs is at most 50%.
Compared to the previous designs, it
uses many fewer PEs than the Pipelined
array, but it needs more turnaround
time than Semibroadcast array.

Lecture - 16 Data Flow and Systolic Array Architectures PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lecture - 16 Data Flow and Systolic Array Architectures PDF

Enviado por

Direitos autorais:

Formatos disponíveis

DATA FLOW ARCHITECTURE (DFA)

To demonstrate the behavior of a data flow machine,

The switch actor places the input token on

The T gate places the input token on the

A data flow graph and its

A data flow graph for calculating

Each arc in a Static data flow graph

Compared with Static data flow, the

In the Dynamic approach a mechanism

Each time that a result packet arrives at the

The matching store uses the address and label

A major drawback in processing matrix algebra on general-purpose computers by

The processing element primarily used in each design is basically an inner-

Let P denote the number of required PEs. The turnaround time, T, is

T = 4n-2 when part

T = 5n-3 when the

Você também pode gostar