Escolar Documentos
Profissional Documentos
Cultura Documentos
c
ont
rol
c
lo
cks
In this paper, we describe the design process and f
lag
s
control
s
ig
nals
fa
lg
s
h
and
sha
ke h
and
sha
ke
implementation results of an instance of the Pleiades
c
onf
ig
re
gis
ter
µP
Microprocessor
ASIC Pleiades DSP Reconfigurable Interconnect
Flexibility
SAT1 SAT2 SAT3
Figure 1. Energy and Flexibility Spectrum for Different Figure 2. Heterogeneous Architecture Template
Architectures
Architecture Instance:
3 Address Generators, 3 Memories, 1 MAC/MUL and 1 ALU
and satellites and between each satellite. For each AG AG AG AG AG
C1 C2 C’1
algorithm domain (communication, speech coding, video
MEM MEM MEM MEM MEM
coding), an architecture instance can be created (with C4
C3
known satellite types and numbers) MAC /MUL MAC/ MUL
C ’8
C5 ALU
pat ter n
C2 C ’4
exchange data streams with other satellites efficiently, C1 C ’1
Time
without the help of a global controller. The t0 t1 t2
The control means available to the programmer are basic thread and the satellites and connections have to be
satellite configurations to specify the kind of operation to reconfigured for each split point.
be performed by the satellite, and configurations for the
reconfigurable interconnect to build a cluster of satellites. The main idea behind reconfigurable computing that is
advocated by the Pleiades system is to build a computa-
2.2 Model of Computation and tional engine through spatially-programmed connections
of processing elements (satellites). The interconnect
Reconfiguration
model that needs to support such a system is depicted in
Figure. 4. On the time axis, t0, t1 and t2 indicate the time
While multiple threads of application can be run on an
of reconfiguration. The bars (C1, C2 etc.) in-between two
instance of the Pleiades architecture template, the
reconfiguration times represent a set of inter-satellite
compilation of a single thread down to the reconfigurable
connections that has to be realized simultaneously by the
components is the main core of the higher level
reconfigurable interconnect.
scheduling tools that can utilize multi-threads. Therefore,
the design methodology described later in the paper aims
to support a smooth transition from a single thread 3 OVERVIEW OF THE
algorithm to an optimized implementation on Pleiades. ARCHITECTURE DESIGN
Figure. 3 illustrates the flow of computation supported by METHODOLOGY
this software methodology. As shown in the figure, a
sequential thread is first initialized on the There are two key issues to be resolved in order to make
microprocessor. After configuration codes are executed the methodology practical to the designers. Firstly, the
on the processor, the control is transferred to the Pleiades architecture combines two very distinct models of
reconfigurable satellites (the “split” point in Figure.3) computation, control-driven computation on the general-
and the computation is returned back to the processor purpose microprocessor and data-driven computing on
after all satellite operations are finished (the “join” the clusters of satellites. Therefore, the goal of the
point). Multiple split points exist within a seqeuntial architectural exploration process is to partition the
Application Thread1 application over these two computing paradigms so that
split performance and energy dissipation constraints are met
(during the compilation process). Secondly,
optimizations related to reconfigurability have to be
Thread3 Thread2 supported at both the architecture design as well as com-
on satellites
pilation stage. Both of these issues requires careful
join modeling of the algorithm and the underlying
on programmable processor heterogeneous architectures.
Applications
After a satisfactory algorithm formulation is obtained, the
Satellites
architectural mapping and partitioning process can be
Microprocessors(s)
entered. To be meaningful, the partitioning process stage 1.
Specification
timing, power constraints
should be based on realistic bottom-up information Algorithm Architecture
regarding the cost of implementing functions and Refinement Characterization
stage 5. stage 6.
The architecture partitioning and mapping process is
Partitioning
started by establishing an initial solution. Given the Interconnect
Optimization
implementation simplicity of a pure software
Compilation/Code Generation
implementation, we have adopted a “software-centric”
Reconfig. Hardware
approach that assumes that the whole algorithm is
initially mapped onto the microprocessor (stage 3). This Implementation
Optimization
establishes how close such a solution adheres to the
design specifications and helps to establish the design
Figure 5. The Software Methodology
bottlenecks. A rank ordering of the dominant compute Flow
kernels is established. In stage 4, dominant kernels are
At different phases of the design phase, it is important to
show the impact of particular design choices on the
Mem1K Mem1K
overall performance and energy of the application in
order to give meaningful design guidance. A spreadsheet-
like environment does precisely that and it is utilized in AG AG FPGA AG AG
our methodology.
Mem1K Mem1K
4 CHIP IMPLEMENTATION
Using the methodology described above, a prototype ALU ALU
MAC MAC Mem
architecture, Maia, has been designed targeting the Mem
i i
o o
domain of voice processing (CELP based speech coders
i.e. 16kb/sec VSELP, LDC-CELP etc.) for wireless AG Mem512
AG AG
Mem512
AG
devices.
Interface
Reconfigurable
Module
Req in
Network
loops with loop counters and stride counters. It behaves Clk Done Clk
as the local controller of data-flow kernels by initiating Req
in
delay clk
Req
out Enable Done
the data-flow threads, and by signaling the end of the
data-flow threads to the ARM8.
(a) Globally asynchronous - locally synchronous signaling
1 1
4.1.3 Embedded FPGA MPY1 MPYn
1 n
Commercial FPGAs are often notorious for their energy n Data associated with an end-of-vector token
consumption and most of them can not be embedded in a MAC1
n Regular data
system-on-a-chip. Therefore, we make use of an in-house
low-energy embedded FPGA [15]. (b) Control tokens differentiate and delineate data streams and data structures (scalar,
vector, matrix)
The embedded FPGA contains a 4×8 array of 5-input 3- Figure 7. Data-flow driven globally synchronous locally
output CLBs, optimized for arithmetic operations and asynchronous communication protocal
data-flow control functions. Its energy-efficiency has
been measured to be 70 times higher than equivalent
commercial solutions. This energy efficient FPGA design 4.2 Communication Interface Description
is realized by combining both architectural and circuit
level modifications which are outlined below.
Logic block The logic block is designed to improve 4.2.1 Inter- satellites Communication Interface
the interconnect utilization, and hence the interconnect The data-flow driven synchronization between the
energy. It is made up of a cluster of 3 input look-up- processing elements employs a 2-phase self-timed
tables. It can be used to implement 5 input random logic, handshaking scheme with REQUEST and
or 2 bit arithmetic operations. ACKNOWLEDGE signals (Figure 7a), realized in a
Low-swing circuit Low-swing interconnect circuit globally asynchronous locally synchronous
improves the energy by a factor of 2 as compared to a full implementation fashion. This approach not only reduces
swing circuit. The logic blocks operate on 1.5V while the power consumption by ensuring that a module is only
low-swing signal lines have a 0.8V swing. activated when data is ready, but also allows various
Interconnect Architecture The interconnect is made modules to operate at different and dynamically varying
up of 3 levels of connectivity. Each level is targeted at rates. Data links combine 16-bit fixed-width data words
providing low energy connections for specific path with 2-bit control tokens that serve as tags for different
lengths. The Level0 structure is targeted at connections data structures (scalar, vector, or matrix) that are
between nearest neighbors. Each logic block can connect supported by the network (Figure 7b). Each module
to 8 of it’s immediate neighbors. The Level1 structure is includes a network interface controller to coordinate
the traditional symmetric mesh architecture, and is good communication and synchronization.
for intermediate length wires. The Level3 structure is
used for implementing connections that span a
significant fraction of the chip. The connectivity of each 4.2.2 Communication Interface between the
of these structures has been optimized using architecture Microprocessor and Satellites
evaluation tools to obtain energy efficiency.
Clock Distribution More than 80% of the clock This interface control unit coordinates synchronization
energy is dissipated in the clock distribution network. and communication between the synchronous ARM8 core
Double-edge-triggered Flip-Flops are used to reduce the and the asynchronous reconfigurable data-paths, most
clock activity by factor of 2, and hence a proportional importantly helping the core perform the reconfiguration
reduction in energy. The clock distribution network also of satellites by mapping all the configuration memories to
uses the low-swing technique for energy reduction. the ARM8 memory space.
VDD
The interface logic controls the strobe generation for clk P5 clk
REF
configuration reads/writes, handshakes, network reset, in P3
d P2 P4
start requests for the address generators and IO ports. P1
REF GND
out
in
The acknowledge signals for the address generators and n1 n2 1V
GND d 0.4V
IO ports are used to detect the end of kernel and the P6 P7
B A B
ARM8 core is interrupted. Interrupt mask registers and A
clk
control registers are used to synchronize ARM8 with the N3 N1 N2 N4 clk out
asynchronous satellite array.
The system supports two modes of operation: TEST and Figure 8: Pseudo-differential low-swing interconnect
SYSTEM modes. As part of the test strategy, the TEST circuitry
mode allows us to bypass the ARM8 processor and The implemented hierarchical interconnect mesh
execute individual kernels through the interface. In the network can provide the optimum energy-efficiency with
SYSTEM mode, instead of an on-chip cache for the right degree of flexibility within the application domain
embedded ARM8, an external SRAM (with zero bus of interest. Several clusters of tightly connected modules
turnaround) serves as the memory for the processor. In are formed based on the communication locality. Each
order to meet the 40MHz performance for the cluster has a local mesh with 2 buses per channel, and a
application, the off-chip memory is clocked twice as fast universal switchbox at every intersection point (Figure
as the core. The interface is designed to meet this 6). Global interconnections are supported by a 2nd level
bandwidth. larger-granularity mesh (implemented on the higher
metal layers) with 2 buses per channel and hierarchical
4.3 Reconfigurable Interconnect switchboxes, located at the key connection points. The
hierarchical switchbox (Figure 6) contains a universal
Architecture switchbox for each mesh-level, as well as a number of
cross-level interconnect switches. This hierarchical
Keeping the energy of the reconfigurable interconnect network architecture requires only a limited number of
network as low as possible while still meeting the buses to achieve sufficient connection flexibility for our
flexibility requirement is crucial to the success of out target applications, and cuts the interconnect energy cost
approach of heterogeneous reconfigurable architecture. by a factor of 7 compared to a straightforward crossbar
This is realized by a combination of architecture and network implementation.
circuit optimizations.
4.3.2 Low-swing Interconnect Interface Circuits
4.3.1 Hierarchical Interconnect Network Communication energy is further reduced by employing a
Architecture low-swing (0.4V) pseudo-differential signaling scheme
Energy-efficient architecture must take advantages of the (Figure 8). The wire capacitance loads are also reduced
locality and regularity of computation. Exploiting locality by simplifying the switch network with NMOS-only
by identifying natural isolated clusters of operations, can switches. The circuit employs an NMOS-only push-pull
be used to guide hardware partitioning resulting in the driver with a very low voltage supply. The receiver is a
minimization of global busses, thus reducing the clocked sense amplifier with low input-offset and good
interconnect power. Although the underlying system is sensitivity followed by a static flip-flop. It contains
heterogeneous, the DSP algorithms usually have double pairs of input transistor, with the gates of P1 and
inherently repetitive computation patterns. Partitioning P3 connected to d, while the gates of P4 and P2 biased at
the hardware by preserving such regularity will lead to GND and REF respectively. Figure. 8 shows the
simpler interconnect structure with reduced fan-ins and signaling waveforms. Based on our asynchronous
fan-outs. Especially for reconfigurable architectures, clocking protocol, the clock signal is generated from the
more regular interconnect architecture achieve better handshaking signals. The low-swing signaling reduces
routability and less reconfiguration overhead. There is the interconnect energy by a factor of 3.4 compared to a
trade-off between flexibility and energy-efficiency. For full-swing CMOS implementation [17].
instance, the crossbar network has the most flexibility,
but also the least energy efficiency. In stage 6 of the
design methodology, cross-bar, mesh and hierarchical
5 RESULTS AND STATISTICS
mesh structures are evaluated, and a 2-level hierarchical
mesh is decided for this implementation.
Maia is a 210-pin chip that contains 1.2 million Table 2 shows the performances of different chip
transistors and measures 5.2×6.7mm2 in a 0.25 µm 6- components (based on a per-block analysis) from
metal CMOS technology. Figure 9 shows the die photo PowerMill simulation.
of the Maia chip and Table 1. summarizes all the
implementation statistics of the chip. Figure 10 (see the end of the paper, after references)
illustrates the signals that are available at the I/O pins.
During the TEST mode, all satellites and the
Technology 0.25 µm 6-level metal reconfigurable interconnect can be configured by writing
CMOS to Taddr and Tdata pins (to the ConfigAdd and
Main Supply Voltage 1V ConfigData buses) and the result of the computation can
Additional Voltages 0.4 V, 1.5 V be read on the Tdata and FIQ pins (from ReadData and
Die Size 5.2 mm x 6.7 mm ACK buses). In addition, simple programs can also be
Transistor Count 1.2 Million transistors fed to ARM8 via Tdata pins to test satellite configuration
Average Cycle Speed 40 MHz reading and writing. The current test set-up supports the
Average Power Dissipation 1.5 - 2 mW test mode described above and a board to verify the
SYSTEM mode is being designed. The HP 16702A logic
analysis system was used for generating the test vectors
Table 1: Chip Characteristics (derived from Timemill simulations) for the TEST mode.
Pattern acquisition was used for verifying the results of
Hardware Pipeline Energy Area the computations after detecting end of kernel using an
modules speed consumptio (mm2) external interrupt signal.
(ns) n per
operation Energy and performance of all kernels are tested in the
(PJ) TEST mode. Based on this information, the estimated
MAC 24 21 0.25 energy dissipation of the processor when programmed for
ALU 20 8 0.09 a VCELP voice coder (with 1.8mW total power
Memory (1K x 14 8 0.32 consumption) is presented in Table 3, including a
16) breakdown of the energy over the major functions.
Memory (512 x 11 7 0.16 Dominant kernels are directly mapped onto hardware
16)
satellites, and their run-time reconfiguration is performed
Address generator 20 6 0.12
by the ARM core. Therefore, the kernel energy presented
Interconnect 10 1* NA
in the table incorporate contributions from both satellite
network
FPGA 25 18** 2.76 and ARM8 configuration. The program control part of
the algorithm is completely mapped to the software. The
total energy efficiency is a factor of 8 better than the best
Table 2: Performances of hardware modules reported in literature [18].
*This number is the average energy consumption per
connection
**This number is the average energy consumption across
various arithmetic functions
MEM MEM
ALU ALU
MAC MAC
MEM MEM
Interconnect Network
AGU MEM AGU AGU MEM AGU
Interface
ARM8 Core
Wdata 32
Rdata 32 ConfigAdd 16