A Low-Energy Heterogeneous Reconfigurable DSP IC

A Low-Energy Heterogeneous Reconfigurable DSP IC
architecture template and the model of reconfiguration.

In section 3 and 4, the methodology used to map
1 INTRODUCTION algorithms to an architecture is given and the
implementation of the architecture is discussed. Section 5
The advent of the third generation of wireless reports the testing strategy for the design and results of
applications creates a need for digital signal processing the final chip.
platforms that simultaneously display high computational
performance, ultra low-energy consumption and a high
degree of flexibility and adaptability. The flexibility and 2 HETEROGENEOUS
adaptability is a necessity in the presence of multiple and RECONFIGURABLE DSP
evolving standards, and helps to increase quality-of-
service in the presence of dynamically evolving channel Reconfigurable architectures [5][6][7] have received
conditions. (Re)configurable processors offer the significant attention in recent years in both the general
advantage of combining flexibility and low-energy [1][2] purpose computing as well as embedded processing.
by providing a direct spatial mapping from algorithm to Mixing processor with fine-grain reconfigurable elements
architecture, hence reducing the control overhead has been the main approach attempted by the above
typically associated with instruction-set processors. systems. The Pleaides reconfigurable architecture
achieves low energy consumption by providing a
A low power reconfigurable DSP architecture template computational platform with mixed programming
(Pleiades) which encapsulates heterogeneous computing granularity (i.e. microprocessor, reconfigurable dataflow,
elements has been proposed [2][3] to solve the problem of FPGA) [8]. In this section, we explain our architecture in
meeting the requirement of flexibility, speed and energy concept, and provide a description of the reconfiguration
efficiency at the same time (Figure. 1). The Pleiades and computation models used in our design methodology.
architecture style echoes the current trend in system-on-
a-chip design which includes a wide variety of 2.1 Architecture Template
macromodules including core processors, DSPs,
programmable logic, embedded memory, and custom The Pleiades architecture (Figure. 2) is composed of a
modules [4]. The heterogeneous architecture style of programmable microprocessor and heterogeneous
Pleiades allows better algorithm-architecture matching computing elements (referred to as satellites in the rest of
,giving better power/performance than many the paper). The architecture template fixes the
heterogeneous reconfigurable processors which communication primitives between the microprocessor
incorporate only a microprocessor and fine-grained
FPGAs. dat
a ASIC or FPGA dat
a
module
c
ont
rol
c
lo
cks
In this paper, we describe the design process and f
lag
s
control
s
ig
nals
fa
lg
s
h
and
sha
ke h
and
sha
ke
implementation results of an instance of the Pleiades
c
onf
ig
re
gis
ter
architecture , Maia, targetting the speech coding domain. A Satellite
In section 2, we give a description of the Pleiades SAT SAT

Configur ation
µP
Microprocessor
ASIC Pleiades DSP Reconfigurable Interconnect
Flexibility
SAT1 SAT2 SAT3
Energy-Efficiency Reconfiguration Bus
Figure 1. Energy and Flexibility Spectrum for Different Figure 2. Heterogeneous Architecture Template
Architectures
Architecture Instance:
3 Address Generators, 3 Memories, 1 MAC/MUL and 1 ALU
and satellites and between each satellite. For each AG AG AG AG AG
C1 C2 C’1
algorithm domain (communication, speech coding, video
MEM MEM MEM MEM MEM
coding), an architecture instance can be created (with C4
C3
known satellite types and numbers) MAC /MUL MAC/ MUL
C ’8
C5 ALU
To reduce overhead in terms of instruction fetch and glo- reconfiguration
bal control, the architecture utilizes distributed control

and configuration. To achieve distributed control, each C’8
int erco nnect

C5
satellite is equipped with an interface that enables it to : :
pat ter n
C2 C ’4
exchange data streams with other satellites efficiently, C1 C ’1
Time
without the help of a global controller. The t0 t1 t2
communication mechanism between each satellite is

dataflow driven [9]. Figure 4. Model of Reconfiguration
The control means available to the programmer are basic thread and the satellites and connections have to be
satellite configurations to specify the kind of operation to reconfigured for each split point.
be performed by the satellite, and configurations for the
reconfigurable interconnect to build a cluster of satellites. The main idea behind reconfigurable computing that is
advocated by the Pleiades system is to build a computa-
2.2 Model of Computation and tional engine through spatially-programmed connections
of processing elements (satellites). The interconnect
Reconfiguration
model that needs to support such a system is depicted in
Figure. 4. On the time axis, t0, t1 and t2 indicate the time
While multiple threads of application can be run on an
of reconfiguration. The bars (C1, C2 etc.) in-between two
instance of the Pleiades architecture template, the
reconfiguration times represent a set of inter-satellite
compilation of a single thread down to the reconfigurable
connections that has to be realized simultaneously by the
components is the main core of the higher level
reconfigurable interconnect.
scheduling tools that can utilize multi-threads. Therefore,
the design methodology described later in the paper aims
to support a smooth transition from a single thread 3 OVERVIEW OF THE
algorithm to an optimized implementation on Pleiades. ARCHITECTURE DESIGN
Figure. 3 illustrates the flow of computation supported by METHODOLOGY
this software methodology. As shown in the figure, a
sequential thread is first initialized on the There are two key issues to be resolved in order to make
microprocessor. After configuration codes are executed the methodology practical to the designers. Firstly, the
on the processor, the control is transferred to the Pleiades architecture combines two very distinct models of
reconfigurable satellites (the “split” point in Figure.3) computation, control-driven computation on the general-
and the computation is returned back to the processor purpose microprocessor and data-driven computing on
after all satellite operations are finished (the “join” the clusters of satellites. Therefore, the goal of the
point). Multiple split points exist within a seqeuntial architectural exploration process is to partition the
Application Thread1 application over these two computing paradigms so that
split performance and energy dissipation constraints are met
(during the compilation process). Secondly,
optimizations related to reconfigurability have to be
Thread3 Thread2 supported at both the architecture design as well as com-
on satellites
pilation stage. Both of these issues requires careful
join modeling of the algorithm and the underlying
on programmable processor heterogeneous architectures.
The basic flow of the design exploration methodology

Figure 3. Flow of Computation on Pleiades
[10] is presented in Figure. 5. After the introduction of
terminology, a short overview of the overall flow is given
in section 3.1.1. A more detailed description of this evaluated in order of importance and mapped to satellites
methodology and tools developed can be found in [11]. for better power and performance [12]. If a hardware
implementation is deemed worthwhile, a repartitioning of
Definition of Kernel - A computational intensive part of the design is established.
the algorithm that often resides in nested loops.
After all costly kernels are mapped to accelerators, a final
3.1.1 Basic Methodology Flow
partition of the algorithm across different architectures is
The methodology flow takes DSP or communication obtained (stage 5) and memory assignment and allocation
algorithms specified in a high-level language (e.g. C) as is performed to minimized memory trasfers. While the
input. The initiation of the design process requires the rest of the algorithm remains as high- level language, the
establishment of a first-order baseline model of the portions of the algorithm to be implemented by satellites
algorithm complexity and bottlenecks. Such a model are specified in an intermediate form that is capable of
allows for the selection and execution of architecture- modeling the structure of the reconfigurable satellite
independent optimizations (stage 1). As architectural operations (i.e. as a netlist). Based on this conceptual
choices have yet to be made, this model assumes the netlist, implementation optimizations (stage 6) [13] are
presence of a “virtual architecture” with some generic invoked to choose a good reconfigurable interconnect
operator costs attached to it. Optimizations at this stage architecture (during architecture design path) and to
only address either win-only situations or order-of-
generate efficient configuration and interface code
magnitude improvements, so that absolute accuracy is not
(during compilation and test vector generation path) [14].
that important.
Applications
After a satisfactory algorithm formulation is obtained, the
Satellites
architectural mapping and partitioning process can be
Microprocessors(s)
entered. To be meaningful, the partitioning process stage 1.
Specification
timing, power constraints
should be based on realistic bottom-up information Algorithm Architecture
regarding the cost of implementing functions and Refinement Characterization
operations on the different architectural choices. Our stage 2.

design-exploration methodology relies extensively on the Mapping to Core
stage 3.
availability of power-delay models for all components in
Kernel Ranking PDA
its architectural library (stage 2). The estimation methods models
employed in each of these models vary depending upon
Mapping to accelerators
the type of the module and the desired accuracy. While Exploration
stage 4.
the absolute accuracy of these characterizations is not
Kernel
crucial, it is important that bounds on the prediction PDA macro-model
accuracy are known. Only “Improvements” that fall

within the noise level of the estimations are accepted. Performance Evaluation
stage 5. stage 6.
The architecture partitioning and mapping process is
Partitioning
started by establishing an initial solution. Given the Interconnect
Optimization
implementation simplicity of a pure software
Compilation/Code Generation
implementation, we have adopted a “software-centric”
Reconfig. Hardware
approach that assumes that the whole algorithm is
initially mapped onto the microprocessor (stage 3). This Implementation
Optimization
establishes how close such a solution adheres to the
design specifications and helps to establish the design
Figure 5. The Software Methodology
bottlenecks. A rank ordering of the dominant compute Flow
kernels is established. In stage 4, dominant kernels are
At different phases of the design phase, it is important to
show the impact of particular design choices on the
Mem1K Mem1K
overall performance and energy of the application in
order to give meaningful design guidance. A spreadsheet-
like environment does precisely that and it is utilized in AG AG FPGA AG AG
our methodology.
Mem1K Mem1K
4 CHIP IMPLEMENTATION
Using the methodology described above, a prototype ALU ALU
MAC MAC Mem
architecture, Maia, has been designed targeting the Mem
i i
o o
domain of voice processing (CELP based speech coders
i.e. 16kb/sec VSELP, LDC-CELP etc.) for wireless AG Mem512
AG AG
Mem512
AG
devices.
Interface
The most dominant kernels in this domain are vector

ARM
computations (dot-products, FIR, IIR filters etc). After
1 1 2 2
going through stage 1 to stage 5 of the methodology, the
following specifications of the architecture are obtained. 1 2
The Maia processor (Figure. 6) combines a 1 2
2 1
microprocessor core (embedded ARM8) with 21 satellite 2 1
processors: two MACs, two ALUs, eight address
22 1 1
generators, eight embedded memories (4 512×16bit, 4 Universal Switchbox
Hierarchical Switchbox
1K×16bit), and an embedded low-energy FPGA [15]. The (only cross-mesh connections are shown)
FPGA is used for infrequent functions (theta function
generation etc.) that does not justify custom ASIC Figure 6. The Maia floorplan
implementations.
4.1.1 Embedded microprocessor
Connections between satellites are accomplished through
a 2-level hierarchical mesh-structured reconfigurable The embedded ARM8 core is optimized for low-energy
interconnect network. Through an interface control unit, operation, and can operate under variable supply voltages
the ARM8 configures the satellites and communicates [16]. The core is synthesized from VHDL with hand
data with satellites using IO interface ports and direct optimizations on the critical path to enhance performance
memory reads/writes. and power.
4.1.2 Programmable ASIC elements
In the rest of this section, the architecture and circuit
designs of each component are described. We first
Since filter operations in the applications make frequent
present the computational elements (microprocessor,
use of MAC and memory units, the cores of both MAC
ASIC and FPGA designs) in section 4.1. The processor-
and memory are custom designed ICs. The ALU and
satellite and inter-satellite interface design are discussed
address generator are synthesized from VHDL.
in section 4.2. Reconfigurable interconnect architecture is
described in Section 4.3.
Both the dual-stage pipelined MAC (including
shift/round/saturate functions) and the ALU can be
4.1 Component Description configured to handle a range of operations. The address
generators and embedded memories are distributed to
supply multiple parallel data streams to the
computational elements. The address generator features a
small local instruction memory, and can be programmed Out
n
I Processor
to support various types of addressing patterns and nested In
Reconfigurable
Module
Req in
Network
loops with loop counters and stride counters. It behaves Clk Done Clk
as the local controller of data-flow kernels by initiating Req
in
delay clk
Req
out Enable Done
the data-flow threads, and by signaling the end of the
data-flow threads to the ARM8.
(a) Globally asynchronous - locally synchronous signaling
1 1
4.1.3 Embedded FPGA MPY1 MPYn
1 n
Commercial FPGAs are often notorious for their energy n Data associated with an end-of-vector token
consumption and most of them can not be embedded in a MAC1
n Regular data
system-on-a-chip. Therefore, we make use of an in-house
low-energy embedded FPGA [15]. (b) Control tokens differentiate and delineate data streams and data structures (scalar,
vector, matrix)
The embedded FPGA contains a 4×8 array of 5-input 3- Figure 7. Data-flow driven globally synchronous locally
output CLBs, optimized for arithmetic operations and asynchronous communication protocal
data-flow control functions. Its energy-efficiency has
been measured to be 70 times higher than equivalent
commercial solutions. This energy efficient FPGA design 4.2 Communication Interface Description
is realized by combining both architectural and circuit
level modifications which are outlined below.
Logic block  The logic block is designed to improve 4.2.1 Inter- satellites Communication Interface
the interconnect utilization, and hence the interconnect The data-flow driven synchronization between the
energy. It is made up of a cluster of 3 input look-up- processing elements employs a 2-phase self-timed
tables. It can be used to implement 5 input random logic, handshaking scheme with REQUEST and
or 2 bit arithmetic operations. ACKNOWLEDGE signals (Figure 7a), realized in a
Low-swing circuit  Low-swing interconnect circuit globally asynchronous locally synchronous
improves the energy by a factor of 2 as compared to a full implementation fashion. This approach not only reduces
swing circuit. The logic blocks operate on 1.5V while the power consumption by ensuring that a module is only
low-swing signal lines have a 0.8V swing. activated when data is ready, but also allows various
Interconnect Architecture  The interconnect is made modules to operate at different and dynamically varying
up of 3 levels of connectivity. Each level is targeted at rates. Data links combine 16-bit fixed-width data words
providing low energy connections for specific path with 2-bit control tokens that serve as tags for different
lengths. The Level0 structure is targeted at connections data structures (scalar, vector, or matrix) that are
between nearest neighbors. Each logic block can connect supported by the network (Figure 7b). Each module
to 8 of it’s immediate neighbors. The Level1 structure is includes a network interface controller to coordinate
the traditional symmetric mesh architecture, and is good communication and synchronization.
for intermediate length wires. The Level3 structure is
used for implementing connections that span a
significant fraction of the chip. The connectivity of each 4.2.2 Communication Interface between the
of these structures has been optimized using architecture Microprocessor and Satellites
evaluation tools to obtain energy efficiency.
Clock Distribution  More than 80% of the clock This interface control unit coordinates synchronization
energy is dissipated in the clock distribution network. and communication between the synchronous ARM8 core
Double-edge-triggered Flip-Flops are used to reduce the and the asynchronous reconfigurable data-paths, most
clock activity by factor of 2, and hence a proportional importantly helping the core perform the reconfiguration
reduction in energy. The clock distribution network also of satellites by mapping all the configuration memories to
uses the low-swing technique for energy reduction. the ARM8 memory space.
VDD
The interface logic controls the strobe generation for clk P5 clk
REF
configuration reads/writes, handshakes, network reset, in P3
d P2 P4
start requests for the address generators and IO ports. P1
REF GND
out
in
The acknowledge signals for the address generators and n1 n2 1V
GND d 0.4V
IO ports are used to detect the end of kernel and the P6 P7
B A B
ARM8 core is interrupted. Interrupt mask registers and A
clk
control registers are used to synchronize ARM8 with the N3 N1 N2 N4 clk out
asynchronous satellite array.
The system supports two modes of operation: TEST and Figure 8: Pseudo-differential low-swing interconnect
SYSTEM modes. As part of the test strategy, the TEST circuitry
mode allows us to bypass the ARM8 processor and The implemented hierarchical interconnect mesh
execute individual kernels through the interface. In the network can provide the optimum energy-efficiency with
SYSTEM mode, instead of an on-chip cache for the right degree of flexibility within the application domain
embedded ARM8, an external SRAM (with zero bus of interest. Several clusters of tightly connected modules
turnaround) serves as the memory for the processor. In are formed based on the communication locality. Each
order to meet the 40MHz performance for the cluster has a local mesh with 2 buses per channel, and a
application, the off-chip memory is clocked twice as fast universal switchbox at every intersection point (Figure
as the core. The interface is designed to meet this 6). Global interconnections are supported by a 2nd level
bandwidth. larger-granularity mesh (implemented on the higher
metal layers) with 2 buses per channel and hierarchical
4.3 Reconfigurable Interconnect switchboxes, located at the key connection points. The
hierarchical switchbox (Figure 6) contains a universal
Architecture switchbox for each mesh-level, as well as a number of
cross-level interconnect switches. This hierarchical
Keeping the energy of the reconfigurable interconnect network architecture requires only a limited number of
network as low as possible while still meeting the buses to achieve sufficient connection flexibility for our
flexibility requirement is crucial to the success of out target applications, and cuts the interconnect energy cost
approach of heterogeneous reconfigurable architecture. by a factor of 7 compared to a straightforward crossbar
This is realized by a combination of architecture and network implementation.
circuit optimizations.
4.3.2 Low-swing Interconnect Interface Circuits
4.3.1 Hierarchical Interconnect Network Communication energy is further reduced by employing a
Architecture low-swing (0.4V) pseudo-differential signaling scheme
Energy-efficient architecture must take advantages of the (Figure 8). The wire capacitance loads are also reduced
locality and regularity of computation. Exploiting locality by simplifying the switch network with NMOS-only
by identifying natural isolated clusters of operations, can switches. The circuit employs an NMOS-only push-pull
be used to guide hardware partitioning resulting in the driver with a very low voltage supply. The receiver is a
minimization of global busses, thus reducing the clocked sense amplifier with low input-offset and good
interconnect power. Although the underlying system is sensitivity followed by a static flip-flop. It contains
heterogeneous, the DSP algorithms usually have double pairs of input transistor, with the gates of P1 and
inherently repetitive computation patterns. Partitioning P3 connected to d, while the gates of P4 and P2 biased at
the hardware by preserving such regularity will lead to GND and REF respectively. Figure. 8 shows the
simpler interconnect structure with reduced fan-ins and signaling waveforms. Based on our asynchronous
fan-outs. Especially for reconfigurable architectures, clocking protocol, the clock signal is generated from the
more regular interconnect architecture achieve better handshaking signals. The low-swing signaling reduces
routability and less reconfiguration overhead. There is the interconnect energy by a factor of 3.4 compared to a
trade-off between flexibility and energy-efficiency. For full-swing CMOS implementation [17].
instance, the crossbar network has the most flexibility,
but also the least energy efficiency. In stage 6 of the
design methodology, cross-bar, mesh and hierarchical
5 RESULTS AND STATISTICS
mesh structures are evaluated, and a 2-level hierarchical
mesh is decided for this implementation.
Maia is a 210-pin chip that contains 1.2 million Table 2 shows the performances of different chip
transistors and measures 5.2×6.7mm2 in a 0.25 µm 6- components (based on a per-block analysis) from
metal CMOS technology. Figure 9 shows the die photo PowerMill simulation.
of the Maia chip and Table 1. summarizes all the
implementation statistics of the chip. Figure 10 (see the end of the paper, after references)
illustrates the signals that are available at the I/O pins.
During the TEST mode, all satellites and the
Technology 0.25 µm 6-level metal reconfigurable interconnect can be configured by writing
CMOS to Taddr and Tdata pins (to the ConfigAdd and
Main Supply Voltage 1V ConfigData buses) and the result of the computation can
Additional Voltages 0.4 V, 1.5 V be read on the Tdata and FIQ pins (from ReadData and
Die Size 5.2 mm x 6.7 mm ACK buses). In addition, simple programs can also be
Transistor Count 1.2 Million transistors fed to ARM8 via Tdata pins to test satellite configuration
Average Cycle Speed 40 MHz reading and writing. The current test set-up supports the
Average Power Dissipation 1.5 - 2 mW test mode described above and a board to verify the
SYSTEM mode is being designed. The HP 16702A logic
analysis system was used for generating the test vectors
Table 1: Chip Characteristics (derived from Timemill simulations) for the TEST mode.
Pattern acquisition was used for verifying the results of
Hardware Pipeline Energy Area the computations after detecting end of kernel using an
modules speed consumptio (mm2) external interrupt signal.
(ns) n per
operation Energy and performance of all kernels are tested in the
(PJ) TEST mode. Based on this information, the estimated
MAC 24 21 0.25 energy dissipation of the processor when programmed for
ALU 20 8 0.09 a VCELP voice coder (with 1.8mW total power
Memory (1K x 14 8 0.32 consumption) is presented in Table 3, including a
16) breakdown of the energy over the major functions.
Memory (512 x 11 7 0.16 Dominant kernels are directly mapped onto hardware
16)
satellites, and their run-time reconfiguration is performed
Address generator 20 6 0.12
by the ARM core. Therefore, the kernel energy presented
Interconnect 10 1* NA
in the table incorporate contributions from both satellite
network
FPGA 25 18** 2.76 and ARM8 configuration. The program control part of
the algorithm is completely mapped to the software. The
total energy efficiency is a factor of 8 better than the best
Table 2: Performances of hardware modules reported in literature [18].
*This number is the average energy consumption per
connection
**This number is the average energy consumption across
various arithmetic functions
MEM MEM
AGU AGU AGU AGU

FPGA
MEM MEM
ALU ALU
MAC MAC
MEM MEM
Interconnect Network
AGU MEM AGU AGU MEM AGU
Interface
ARM8 Core
Figure 9. Maia die photo
Functionality Energy consumption (mJ) for 1 sec

of VCELP speech processing
Dot product 0.738
FIR filter 0.131
IIIR filter 0.021
Vector sum with scalar multiply 0.042
Kernels Compute code 0.011
Covariance matrix compute 0.006
Program control 0.838
Total 1.787
Table 3. VSELP energy breakdown
current trend in system-on-a-chip design which contains
6 CONCLUSION embedded components of various flexibility and
In this paper, Pleiades, a heterogeneous reconfigurable reconfigurability (microprocessor, ASICs, FPGA). The
architecture template is introduced and a design heterogeneity and reconfigurability of the architecture
methodology to map algorithms to architectures is proves to be very energy efficient when compared to
summarized. The details of the design and state-of-the-art programmable processors.
implementation of an instance of the Pleiades
architecture is presented. The implementation echoes the
7 ACKNOWLEDGEMENTS Pleiades Family of Processors”, Master’s Thesis, UC
Berkeley, 1999.
We would like to acknowledge DARPA’s support for the [10] M. Wan, D. Lidsky, Y. Ichikawa and J. Rabaey. “An
Pleiades project (DABT-63-96-C-0026). The authors
Energy-Conscious Methodology for Early Exploration
would like to thank Seno Katsunori and Yuji Ichikawa
for their early work on the Pleiades prototype and of Heterogeneous DSPs”, Proceedings of CICC 1998.
evaluation. We would like to acknowledge other [11] M. Wan, H. Zhang, V. George, M. Benes, A.
members on the Maia design team. Abnous and J. Rabaey, "Design Methodology of a
Low-Energy Reconfigurable Single-Chip DSP
8 REFERENCES System", Journal of VLSI Signal Processing 2000.
[12] M. Wan, H. Zhang, M. Benes and J. Rabaey, “A
[1] M. Goel and N. R. Shanbhag, “Low-power Low-Power Reconfigurable Data-Driven DSP
equalizers for 51.84 Mb/s very high-speed digital System”, Proceedings of the SiPS99
subscriber loop [VDSL] modems”, Proceedings of [13] H. Zhang, M. Wan, V. George, J. Rabaey, “Intercon-
IEEE Workshop on Signal Processing Systems, Oct. nect Architecture Exploration for Low Energy Recon-
1998, Boston. figurable Single-Chip DSPs”, Proceedings of the
[2] A. Abnous and J. Rabaey, “Ultra-Low-Power WVLSI, Orlando, FL, USA, April 1999
Domain- Specific Multimedia Processors”, [14] S. Li, M. Wan and J. Rabaey, “Configuration Code
Proceedings of the IEEE VLSI Signal Processing Generation and Optimizations for Heterogeneous
Workshop, San Francisco, California, USA, October Reconfigurable DSPs”, Proceddings of SiPS, 1999.
1996. [15] V. George, H. Zhang, J. Rabaey, “Low Energy
[3] A. Abnous et al., “Evaluation of a Low-Power FPGA Design”, Proceedings of ISLPED 1999.
Reconfigurable DSP Architecture”, Proceedings of [16] T. Burd, T. Pering, A. Stratakos, R. Brodersen,”A
the Reconfigurable Architectures Workshop, Orlando, Dynamic Voltage-Scaled Microprocessor System”,
Florida, USA, March 1998. Proceedings of ISSCC 2000.
[4] J. Borel, “Technologies for multimedia systems on a [17] Hui Zhang et al, “Low-Swing Interconnect Interface
chip”, 1997 IEEE International Solid-State Circuits Circuits”, Proceedings of ISLPED 1997.
Conference. pages. 18-21. [18] Wai Lee et al, “A 1V DSP for Wireless
[5] G. R. Goslin, “A Guide to Using Field Communication”, Digest of Technical Papers of
Programmable Gate Arrays for Application-Specific ISSCC 97
Digital Signal Processing Performance”, Proceedings
of SPIE, vol. 2914, p321-331.
[6] J. Hauser and J. Wawrzynek. GARP: A MIPS
processor with a reconfigurable coprocessor. In J.
Arnold and K. L. Pocek, editors, Proceedings of
IEEE Worship on FPGA for Custom Computing
Machines, Napa, CA, April 1997.
[7] T. Garverick et al, NAPA1000, http://
www.national.com/appinfo/milaero/napa1000
[8] J. M. Rabaey, “Reconfigurable Computing: the
Solution to Low Power Programmable DSP”, Proc. to
1997 ICASSP Conference, Munich, April 1997.
[9] M. Benes, “Deisng and Implementation of
Communication and Switching Techniques for the
SYSTEM MODE TEST MODE
Off-chip Logic Analyzer
SRAM
Taddr<15:0>
Addr<31:0> Tdata<31:0>
Dq<31:0> Test,TRwn,TClk,FIQ etc.
Other controls
IO Pins
Wdata 32
Rdata 32 ConfigAdd 16
ARM8 VAddress 32 ConfigData 32

Core Requests Interface ReadData 32
Responses Strobe 22 Satellites
Interrupt Start
ACKs 10
Figure 10. Maia chip testing strategy

A Low-Energy Heterogeneous Reconfigurable DSP IC

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Low-Energy Heterogeneous Reconfigurable DSP IC

Enviado por

Direitos autorais:

Formatos disponíveis

A Low-Energy Heterogeneous Reconfigurable DSP IC

architecture template and the model of reconfiguration.

architecture , Maia, targetting the speech coding domain. A Satellite

In section 2, we give a description of the Pleiades SAT SAT

Energy-Efficiency Reconfiguration Bus

To reduce overhead in terms of instruction fetch and glo- reconfiguration

bal control, the architecture utilizes distributed control

int erco nnect

communication mechanism between each satellite is

The basic flow of the design exploration methodology

operations on the different architectural choices. Our stage 2.

accuracy are known. Only “Improvements” that fall

The most dominant kernels in this domain are vector

AGU AGU AGU AGU

Figure 9. Maia die photo

Functionality Energy consumption (mJ) for 1 sec

ARM8 VAddress 32 ConfigData 32

Figure 10. Maia chip testing strategy

Você também pode gostar